Welcome to RTV Documentation!

This site covers RTV’s high-level overview, usage & API documentation.

RTV - Reference-Target Validator

Table Of Contents

  1. Overview

  2. Installation

    1. Prerequisites

    2. Steps

  3. Usage

    1. Python Scripts

    2. Config files

  4. Extending

    1. Custom Entities

    2. Defining Custom Entities in Config

    3. Using Custom Entities in Actions

  5. Running Examples

    1. Setup

    2. Running example scripts

    3. Running example config files

  6. Troubleshooting

Overview

RTV is a framework for validating data against some reference. It provides a set of python classes which aim to help users setup, manage, automate complex data validation scenarios. It also allows users to extend itself with custom entites, to suit their specific data validation needs.

Installation

Prerequisites

In order to proceed with the installation you need to have the following installed/available on your machine:

  1. Python

  2. Pip

  3. Venv

Steps

  1. Set up the virtual environment:

    python -m venv ~/env/rtv
    
  2. Activate the environment:

    UNIX:

    source ~/env/rtv/bin/activate
    
  3. Install the package:

    pip install --extra-index-url https://pypi.perfacct.eu rtv-framework
    

Usage

RTV provides entities (classes) for use. The key types of those entities are:

  • Readers - Used for reading the data form source files and converitng it into internal DataCollection objects.

  • Transformers - Used to apply various transformations to the DataCollection objects.

  • Validations - Used to perform validation procedures on DataCollection objects. Details of internal composition (attributes, methods, etc.) depend on concrete realization.

  • Writers - Used to write validations results and any arbitrary data to output destinations (can be files, stdout, sockets and such).

Option 1: Python Script

When using a framework in the python scripts you can simply import needed entities (classes) in your script and use them in your code.

Here is a basic example script from this repo’s example directory:

Check Running Examples section for instructions on how to run this example script yourself.

from rtv.data.output.writer import JSONFileWriter
from rtv.data.reader import CSVReader
from rtv.validation import StrategyValidation
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator


def main():
    pred_filename = "./input/basic/pred.csv"
    true_filename = "./input/basic/true.csv"

    # Instantiate the Reader and the Writer
    reader = CSVReader({"delimiter": ","})
    writer = JSONFileWriter()

    # Read sources to get reference and target DataCollection objects
    reference = reader.read(true_filename)
    target = reader.read(pred_filename)

    # Instatiate validation strategies
    mae_strategy_05 = MeanAbsoluteError({"threshold": 0.5})
    mae_strategy_03 = MeanAbsoluteError({"threshold": 0.3})
    mae_strategy_01 = MeanAbsoluteError({"threshold": 0.1})

    mae_strategy_05.set_name("mae_05")
    mae_strategy_03.set_name("mae_03")
    mae_strategy_01.set_name("mae_01")

    # Instatiate validations
    v1 = StrategyValidation(["default"], [mae_strategy_05])
    v2 = StrategyValidation(["k1"], [mae_strategy_01])
    v3 = StrategyValidation(["k2"], [mae_strategy_03])

    # Set the names for validations
    v1.name = "v1"
    v2.name = "v2"
    v3.name = "v3"

    # Run the validations
    result_collection = Validator().validate(reference, target, [v1, v2, v3])

    # Write the outputs
    writer.write(result_collection, "test_output")


if __name__ == "__main__":
    main()

Option 2: Configuration files

RTV really shines when used to run reusable config files.

Currently supported (out-of-the-box) file formats are YAML and JSON.

Internal (semantic) structure of config files consists of two main parts:

  • definitions - A list of items where users define the entities that will be used for performing the validation scenario.

  • actions - A sequence/list of actions to be performed in the validation scenario with the use of defined entities.

definition parameters:

  • name: An alias to reference defined entity later in current config file.

  • class: A class name or an alias for entitiy constructor to use.

  • …all other parameters are arbitrary.

Here is the basic config example from this repo’s example directory:

Check Running Examples section for instructions on how to run this example config yourself.

definitions:
  # readers:
  - name: csvreader
    class: CSVReader
    delimiter: ","

  # strategies:
  - name: ewa_dist
    class: ElementWiseAbsoluteDistance
    deviation: 0
  - name: map_err
    class: MeanAbsolutePercentageError
    deviation: 0
  - name: ews_dist
    class: ElementWiseSimpleDistance
    range: [-10, 80]

  # transformer
  - name: my_transformer
    class: MyTransformer
    suffix: i

  # validations:
  - name: v1
    class: StrategyValidation
    keys: default
    strategies: ewa_dist
  - name: v2
    class: StrategyValidation
    keys: [Ai,Bi,Ci,Di,Ei]
    strategies: map_err
  - name: v3
    class: StrategyValidation
    keys: all
    strategies: ews_dist

  # writers:
  - name: json_writer
    class: JSONFileWriter
  - name: result_writer
    class: ResultWriter
actions:
  - read:
    - output_name: ref
      source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_a.csv
      reader: csvreader
    - output_name: t1
      source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_b.csv
      reader: csvreader
    - output_name: t2
      source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_c.csv
      reader: csvreader

  - transform:
    - input: ref
      output_name: reference
      transformers: my_transformer
    - input: t1
      output_name: target1
      transformers: my_transformer

  - validate:
    - output_name: result_ab
      validations:
        - v1
        - v2 # partially default
      reference: reference
      target: target1
    - output_name: result_ac
      validations:
        - v1
        - v3 # overwrites default
      reference: ref
      target: t2

  - write:
    - output: ${RTV_EXAMPLE_PATH}/result_matrix_a_matrix_b
      writer: json_writer
      input: result_ab
    - output: ${RTV_EXAMPLE_PATH}/result_matrix_a_matrix_c
      writer: result_writer
      input: result_ac

Extending

As mentioned earlier RTV can be extended by custom user entities (classes) to provide missing functionality for user validation scenario (e.g. implementing some custom error metrics) or extend supported configuration file formats.

Custom Entities

Implementig and registering

Custom entities should implement pre-defiened framework’s interfaces.

from pydantic import BaseModel

from rtv.core.base import BaseEntity
from rtv.transformer.interfaces import ITransformer

class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
    class Params(BaseModel):
        my_awesome_param: int
        ...

NOTE:idf is optional for almost all entities, and most probably not needed if users intend to use the entity in python script, however there are some exceptions (will be mentioned below). It is just an alias to make config files more concise.

Core Entities Available:

Entity Name

Inherit from

Reader

BaseReader, IReader

Transformer

BaseEntity, ITransformer

Validation

BaseEntity, IValidation

Validation Strategy

BaseValidationStrategy, IValidationStrategy

Action *

BaseAction, IAction

Writer

BaseWriter, IWriter

Config Loader **

IConfigLoader, idf=”<extension_suffix>”

* - See implementing custom actions

** - See implementing custom config loaders

Implementing custom actions

When implementing custom actions we recommend to add a short and descriptive identifier:

# ...
class MyCustomAction(BaseAction, IAction, idf="greet"):
     class Params(BaseModel):
         message: str

     def execute(self):
         print(self.message)
# ...

That would make it more convenient to use in the configuration files.

actions:
  - greet:
      message: "Hello World!"

To say more, custom actions implementation only makes sense for usage in config files.

Implementing custom config loaders

IConfigLoader is a special case, this class is not inheriting from BaseEntity and requires idf to be the same as file extension suffix:

class TxtConfigLoader(IConfigLoader, idf="txt"):
    ...

Otherwise it should crash the run.

Registering for use in config files

The framework will automatically handle the addition of this custom class to the registry, and it will become available for use in the config files.

However, the framework needs to know where to look for the custom code. So, users need to set up an environment variable RTV_USER_CODE_PATH:

export RTV_USER_CODE_PATH=<custom_code_directory_path>

Substitute <custom_code_directory_path with an actual path on your file system where you gonna store the custom code for RTV. You can structure and and name those files as you want.

Defining custom entities in config

YAML Config Example:

  • Using custom class name:

    definitions:
        - name: my_awesome_transformer
          class: MyAwesomeTransformer
          my_awesome_param: 42
    # ...
    
  • Using idf (identifier/alias):

    definitions:
        - name: my_awesome_transformer
          class: awesome
          my_awesome_param: 42
    # ...
    

NOTE: Custom actions should not be defined, just used by alias instead

Using custom entities in actions

YAML Config Example:

  • Custom transformer:

    actions:
        - transform:
            input: data
            output_name: transformed_data
            transformers: my_awesome_transformer
            # ...
    
  • Custom action:

    actions:
        - my_awesome_action:
            awesome_parameter: 42
            # ...
    

Running examples

This repository holds an example directory with some configuration files and scripts that you can use to test/explore the RTV’s functionality and features. The directory contains nested README file describing specific example scripts and/or configuration files in detail.

Setup

In order to execute these examples you need to download example directory from this repository.

You will also need to set up an environment variable RTV_EXAMPLE_PATH with the absolute path to the example directory, like this:

UNIX:

export RTV_EXAMPLE_PATH=<example_path>

Substitute <example_path> with an actual path to the example directory in your filesystem.

Running example scripts

  1. Navigate to the examples directory.

  2. Execute example script of choice, e.g.:

    python scripts/basic.py
    

Running example config files

  1. Navigate to the examples directory.

  2. Run the rtv executable providing path to the example config as a command line argument, e.g.:

    rtv configs/basic.yaml
    

Further details on specific examples can be found in the example directory’s readme.

Troubleshooting

RTVs error output is written into the working directory as files with names like rtv-error-<timestamp>.log. You will also see some info logs and warnings in your stdout during the execution.