RTV Tutorial

  1. Python Script

  2. Configuration File

  3. Classes

  4. Custom Entities

RTV allows user to execute validation scenarios in two ways:

  • Writing python scripts, using classes from rtv package and executing them.

  • Writing configuration files in arbitrary format following the defined structure.

This tutorial covers both of those approaches. Also it provides basic required knowledge of framework’s entities/classes and how to extend the framework with custom entities/classes.

Python Script

To execute a validation scenario using RTV via python script you need to import, instatiate and invoke certain classes from the rtv package inside your script and then simply execute it:

python /path/to/your/script

Example

pred.csv:

k1,k2,k3
1,0,0
0,1,0
0,0,1

true.csv

k1,k2,k3
0,0,1
0,1,1
1,0,1

script.py:

from rtv.data.output.writer import JSONFileWriter
from rtv.data.reader import CSVReader
from rtv.validation import StrategyValidation
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator


def main():
    pred_filename = "pred.csv"
    true_filename = "true.csv"

    # Instantiate the Reader and the Writer entities
    reader = CSVReader({"delimiter": ","})
    writer = JSONFileWriter()

    # Read sources to get reference and target DataCollection objects
    reference = reader.read(true_filename)
    target = reader.read(pred_filename)

    # Instatiate ValidationStrategy entities
    mae_strategy_05 = MeanAbsoluteError({"threshold": 0.5})
    mae_strategy_03 = MeanAbsoluteError({"threshold": 0.3})
    mae_strategy_01 = MeanAbsoluteError({"threshold": 0.1})

    # Set names for validation strategies
    mae_strategy_05.set_name("mae_05")
    mae_strategy_03.set_name("mae_03")
    mae_strategy_01.set_name("mae_01")

    # Instatiate Validation entities
    v1 = StrategyValidation(["default"], [mae_strategy_05])
    v2 = StrategyValidation(["k1"], [mae_strategy_01])
    v3 = StrategyValidation(["k2"], [mae_strategy_03])

    # Set the names for validations
    v1.name = "v1"
    v2.name = "v2"
    v3.name = "v3"

    # Run the Validation entities via special Validator object.
    result_collection = Validator().validate(reference, target, [v1, v2, v3])

    # Write the outputs
    writer.write(result_collection, "test_output")


if __name__ == "__main__":
    main()

NOTE: More details on used classes will be provided later in this tutorial.

Configuration File

To execute validation scenario via RTV configuration file you need to run rtv from command line and provide a path to the configuration file, like this:

rtv /path/to/config/file

Currently supported formats:

  • yaml

  • json

NOTE: This tutorial uses yaml format for examples in most places.

Structure

A valid configuration file for RTV should have two main sections:

  • definitions - this section holds a list of framework’s entities defined which will be used in the validation scenario.

  • actions - this section should hold a list of actions wich will be performed during validation scenario execution.

Minimal example:

yaml:

definitions:
    - name: csv_reader
      class: CSVReader
      delimiter: "|"

actions:
    - read:
        reader: csv_reader
        source: vector.csv
        output_name: vector_data

json:

{
    "definitions": [
        {
            "name": "csv_reader",
            "class": "CSVReader",
            "delimiter": "|"
        }
    ],
    "actions": [
        {
            "read": {
                    "reader": "csv_reader",
                    "source": "vector.csv",
                    "output_name": "vector_data"
                }
        }
    ]
}

Definitions

Each element in the list of definitions in definitions section of the configuration file should have following required fields:

  • name: You can think of it as an alias or a variable name, that you can later use in the config to reference defined entity.

  • class: A constructor class name of the entity.

The rest of the definition fields are arbitrary parameters for certain entity. In previous example delimiter field is a parameter of CSVReader.

NOTE: You can find a list of available entities/classes and their parameters in the following sections of this tutorial.

Actions

The common structure for actions section entry is as follows:

actions:
    - <action_type>:
        - <action_param>: ...
          # ...

        - <action_param>: ...
          # ...

A set of <action_param> fields is specific to a certain action type.

Example with read <action_type>:

actions:
    - read:
        - reader: csv_reader
          source: vector.csv
          output_name: vector_table_data

        - reader: txt_reader
          source: vector.txt
          output_name: vector_text_data

NOTE: You will find info on available <action_type> and realated <action_param> in the following section of this tutorial.

During the validation run the actions will be executed in order that they were defined in the config, so the following example will lead to an error:

actions:
    - transform:
        input: vector_data
        output_name: transformed_vector_data
        transformers: vector_transposer

    - read:
        reader: csv_reader
        source: vector.csv
        output_name: vector_data

transform action will raise an exception when trying to access vector_data entry as it will only be available after successful read action execution.

Available actions

read

Used to read data from arbitrary source(s), convert it to RTV internal data representation and save it to the current scenario’s data store.

Fields:

  • reader: A name of the Reader entity to use for the action execution.

  • source: A path to a source.

  • output_name: A unique (to the current scenario) name that will be used to store and reference the action’s result.

  • pattern: Optional field, a regex pattern to match more than one source file. If this field is provided then source should be a path to a directory with source files to match the pattern.Defaults to empty string.

  • prefix_key: Optional field, a prefix string to prepend to every key of resulting data entry. Defaults to empty string.

Example:

Read reference.csv and target.csv source files and save resulting data as reference and target respectively:

definitions:
    - name: csv_reader
      class: CSVReader

actions:
    - read:
        - reader: csv_reader
          source: reference.csv
          output_name: reference
          prefix_key: ref

        - reader: csv_reader
          source: iterations/
          pattern: iter_(\d+).csv
          # will match: iter_001.csv, iter_002.csv...
          output_name: target

write

Used to write a data entry to some output destination using Writer entity.

Fields:

  • input: A name of the data entry to write to output.

  • writer: A name of the defined Writer entity to use for the action execution.

  • output: An action result’s output destination. Actual type depends on the writer implementation.

Example:

Write result data entry to a json file named validation_result.json using JSONWriter entity.

definitions:
    # ...
    - name: json_writer
      class: JSONWriter
    # ...

actions:
    # ...
    - write:
        input: result
        writer: json_writer
        output: validation_result

transform

Used to transform data entries using Transformer entities and save the result as a new data entry.

Fields:

  • input: A name of the data entry to transform.

  • transformers: A name (or a list of names) of Transformer entity to use for the action execution.

  • output_name: A unique name that will be used to store and later reference the result of the action.

Example:

Transform result data entry using inverse_transformer and save the transformed result to result_transformed data entry.

definitions:
    # ...
    - name: inverse_transformer
      class: InverseTransformer
    # ...

actions:
    # ...
    - transform:
        input: result
        writer: inverse_transformer
        output: result_transformed
    # ...

validate

Used to perform validation on target data entry against reference data entry using single or multiple Validation entities.

Fields:

  • reference: A data entry name to use as reference.

  • target: A data entry name to use as target.

  • validations: A name (or a list of names) of Validation entity to use for the action execution.

  • output_name: A unique name that will be used to store and reference the result of the action.

Example:

Validate a data entry against b data entry using v1 validation and write the resulting data entry to result.

definitions:
    # ...
    - name: mae
      class: MeanAbsoluteError
      threshold: 0.5

    - name: v1
      class: StrategyValidation
      strategies: mae
      keys: all
    # ...

actions:
    # ...
    - validate:
        reference: b
        target: a
        validations: v1
        output_name: result
    # ...

free

Used to remove data entries from the current scenario data store.

Fields:

  • targets: Names of data entries to remove.

Example:

Remove a and b data entries.

actions:
    - free:
        targets: [a,b]

Classes

Every entity in the RTV framework has a semantic meaning to it based on it’s purpose.

Currently there are following entity types of interest to the user:

  • Reader - Used to read data from “outer world” (source types are arbitrary).

  • Writer - Used to expose data to the “outer world” (output types are arbitrary).

  • Transformer - Used to apply arbitrary transformations to internal data entries.

  • Validation - Encapsulates data entries (reference and target) and other entities required to perform the validation.

  • ValidationStrategy - An entity that encapsulates and performs actual logic of the validation.

Every entity in the RTV framework has a class constructor that user can use in two ways:

  1. Define and use entity in config file.

    definitions:
        - name: reader
          class: CSVReader
    # ...
    
    actions:
        - read:
              reader: reader
           # ...
    
  2. Import class into a python script.

    from rtv.data.reader import CSVReader
    
    reader = CSVReader();
    

Following sections contain lists of available entity classes, their brief descriptions and usage examples.

Readers

Readers are used to read data from arbitrary sources.

They are imported from rtv.data.reader sub-package.

Examples of usage:

Python Script:

from rtv.data.reader import RReader

reader = RReader()

data = reader.read("/path/to/source/file")

Configuration file:

definitions:
    - name: rreader
      class: RReader
      # ...

actions:
    - read:
        source: sample.rds
        reader: rreader
        output_name: data

CSVReader

Reads the data from a csv file.

Short parameters list:

  • delimiter: A symbol used to separate values in csv file. Defaults to ,.

  • lineterminator: A character(s) used to denote line-break. Defaults to \r\n.

  • headless: A flag indicating weather to read the csv file table in headless mode (for files with no header row). Defaults to False.

  • treat_headless: A string representing the approach used when reading in headless mode. Options: as_matrix, row_wise, column_wise. Defaults to as_matrix.

Simple Examples:

Script:

from rtv.data.reader import CSVReader

reader = CSVReader({
    "delimiter": ",",
    "lineterminator": "\n",
    "headless": False,
})
data = reader.read("/path/to/csv/file")

Configuration file:

definitions:
   - name: reader
     class: CSVReader
     delimiter: ","
     lineterminator: "\n"
     headless: False
# ...

actions:
   - read:
         reader: reader
         source: /path/to/csv/file
         output_name: data
      # ...

Headless example:

source.csv:

1,2,3
4,5,6

Python Script:

from rtv.data.reader import CSVReader

matrix_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "as_matrix", # can be omitted
})

row_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "row_wise",
})

row_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "column_wise",
})

data = matrix_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data": [
#        [1,2,3],
#        [4,5,6]
#   ]
# }

data = row_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data_0": [1,2,3],
#   "csv_data_1": [4,5,6]
# }

data = column_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data_0": [1,4],
#   "csv_data_1": [2,5],
#   "csv_data_2": [3,6]
# }

RReader

Reads data from .rds files. Based on the rdata package.

Short parameters list:

  • extension: A string to specify different file extension. Defaults to None.

  • default_encoding: Specify default encoding for source files. Defaults to None.

  • force_default_encoding: A flag indicating weather to force default encoding when reading source files. Defaults to False.

Simple Example:

Script:

from rtv.data.reader import RReader

reader = RReader({
    "extension": "rds",
    "default_encoding": "utf-16",
    "force_default_encoding": True,
})

data = reader.read("sample.rds")

QS2Reader

Custom reader for R’s qs2 data files.

Important: Requires R to be installed on the system, and qs2 package installed in R.

Configuration file:

definitions:
    - name: qsreader
      class: QS2Reader

actions:
    - read:
        source: sample.qs2
        reader: qsreader
        output_name: data

Writers

Writers are used to “export” data from rtv to arbitrary destinations.

They are imported from rtv.data.output sub-package.

JSONFileWriter

Writes data entry to json file.

Examples:

Script:

from rtv.data.output import JSONFileWriter

# ...

writer.write(data_entry, "output")
# Writes `data_entry` to output.json file

Configuration file:

definitions:
    - name: json_writer
      class: JSONFileWriter

action:
    - write:
        input: data_entry
        writer: json_writer
        output: output

ResultWriter

Writes ResultCollection (a special type of DataCollection that stores results of the validations and corresponding artifacts) to a txt file.

Examples:

Script:

from rtv.data.output import ResultWriter

result_writer = ResultWriter()

# ... Some validations happen to product `result_collection`

result_writer.write(result_collection, "passed")

# passed.txt:
# passed: True
# keys passed:
#   'v3/rmse/k1': True
#   'v3/rmse/k2': True
#   'v3/mae/k1': True
#   'v3/mae/k2': True
#   ...

Configuration file:

definitions:
    - name: result_writer
      class: ResultWriter
    # ...

actions:
    # ...
    - write:
        input: result_collection
        output: passed
        writer: result_writer

Transformers

Transformers are entities that are meant to apply arbitrary transformations to DataCollection objects.

Imported from rtv.transfromer sub-package.

Usage Examples:

Script:

from rtv.transformer import PassThrough

transformer = PassThrough()

transformed_data = transfromer.transform(some_data_collection)

Configuration file:

definitions:
    - name: transformer
      class: PassThrough

actions:
    - transform:
        input: some_data_collection
        output_name: transformed_data
        transformers: transformer

Available Transformers:

  • PassThrough: Does absolutely nothing. Implemented for demonstration purposes.

    Parameters:

    • delay: number of seconds to sleep (simulate transformations).

Validations

Validations are special entities that encapsulate everything needed to perform a single validation in the scenario (data collections, strategies or other involved objects).

They are imported from rtv.validation sub-package.

They can be executed by calling execute() method in python script, or using validate action in the configuration file. However for python script usage it is recommended to use Validator special helper object to manage validation objects execution (See examples below).

StrategyValidation

Validates target against reference using provided keys and strategies.

Parameters:

  • keys: List of key names of data entries to apply strategies to. Special values are: “default” (applies to all keys in data entries, which were not validated by any other validation) and “all” (applies to all keys in data entries). Defaults to “default”.

  • strategies: List of ValidationStrategy objects.*

  • key_pattern: A regex pattern to use for matching keys in data entries. If provided, keys parameter is ignored. Defaults to empty string.

* - If used in configuration file this parameter is just a names list of ValidationStrategy entities defined earlier.

Examples:

Script:

from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator

# ... Some operations to get `reference` and `target` DataCollection objects

mae = MeanAbsoluteError()

# Will be applied to all keys
validation_1 = StrategyValidation(["all"], [mae])
# Will be applied to keys that start with `prefix_` (e.g. `prefix_k1`...)
validation_2 = StrategyValidation(["default"], [mae], "prefix_*")

# To execute validations on data use `Validator` special object:
result_collection = Validator().validate(
    reference,
    target,
    [validation_1, validation_2]
)

# You can also use `execute` method of the validation object itself:
list_of_tuples = validation_1.execute(reference, target)

# But mind that the output will be not that clear, or easy to understand
# and most probably will need some manual handling, Validator object does that
# for you and provides results packed in the nice ResultCollection object.

Configuration file:

definitions:
    # ...
    - name: mae
      class: MeanAbsoluteError

    - name: validation_1
      class: StrategyValidation
      keys: all
      strategies: mae

    - name: validation_2
      class: StrategyValidation
      keys: default
      key_pattern: prefix_*
      strategies: mae
    # ...

actions:
    # ...
    - validate:
        reference: reference
        target: target
        validations:
            - validation_1
            - validation_2
        output_name: result_collection

Validation Strategies

Entities that encapsulate logic of actual validation, executed on specific key(s) present both in reference and target data entries. They are imported form rtv.validation.strategy sub-package.

Example:

from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator

# ... Some operations to get `reference` and `target` DataCollection objects

print(reference.to_json())
# { "k1": [1,1,1], ...}

print(target.to_json())
# { "k1": [0,1,1], ...}

mae = MeanAbsoluteError({"threshold": 0.1})

validation = StrategyValidation(["k1"], [mae])
# This validation will take `k1` from target as `y_pred`,
# `k1` from reference as `y_true` and calculate:
# mean_absolute_error(y_true, y_pred)
# If resulting error will be greater than 0.1 - validation
# is considered failed.

Error Metrics

A family of ValidatonStrategy entities. Calculate some commonly used error metrics.

Common Parameters:

  • threshold: The upper limit for the value of calculated metric, if exceeded the parent validation fails. Defaults to 0.

Usage Examples:

Script:

from rtv.validation.strategy import MeanAbsoluteError

strategy = MeanAbsoluteError({"threshold": 0.5})

Configuration file:

definitions:
    - name: strategy
      class: MeanAbsoluteError
      threshold: 0.5

Available Error Metric Strategy Constructors:

  • MeanAbsoluteError

  • MeanAbsolutePercentageError

  • MeanSquaredError

  • MeanSquaredLogError

  • RootMeanSquaredError

  • RootMeanSquaredLogError

  • RelativeMAPE

  • RelativeRMSE

Comparison

A family of ValidationStrategy entities that are based on comparing values from reference and target data entries.

These entities use Comparator entity with some predefined parameters. If you want to gain more control on the comparison process you can use Comparator entities directly (See following section).

Common Parameters:

  • deviation: A limit for the individual elements’ distance. If exceeded, the parent validation is considered failed.

Usage Examples:

Script:

from rtv.validation.strategy import ElementWiseAbsoluteDistance

strategy = ElementWiseAbsoluteDistance({
    "deviation": 50,
})

Configuration file:

definitions:
    - name: strategy
      class: ElementWiseSimpleDistance
      deviation: 50

Available Comparison Strategy Constructors:

  • ElementWiseAbsoluteDistance

    Compare two Iterables element wise by calculating absolute distance between each element.

    Parameters:

    • num_max_values: Max number of biggest values to include in validation artifacts. Defaults to 10.

  • ElementWiseSimpleDistance

    Compare two Iterables element wise by calculating simple distance between each element.

    Parameters:

    • num_max_values: Max number of biggest values to include in validation artifacts. Defaults to 10.

    • num_min_values: Max number of smallest values to include in validation artifacts. Defaults to 10.

Comparators

A special entity that compares values from two data entries (reference and target in most cases). Imported from rtv.validation.comparator sub-package.

Common Parameters:

  • callback: A callback function that performs the comparison.

  • keys: A list of keys used to get the data values for comparison.

NOTE: Currently the direct use of Comparator entities in configuration files is not supported.

ElementWiseComparator

Example:

from rtv.validation.comparator import ElementWiseComparator

# ... Some operations to get `reference` and `target` DataCollection objects

print(reference.to_json())
# { "k1": [1,1,1], ...}

print(target.to_json())
# { "k1": [0,1,1], ...}

comparator= ElementWiseComparator(
    lambda x,y: y - x,
    ["k1"]
)

comparison_result = comparator.compare(reference, target) 
# [-1, 0, 0]

Custom Entities

RTV can be extended by custom user entities (classes) to provide missing functionality for user validation scenario (e.g. implementing some custom error metrics) or extend supported configuration file formats.

Implementing and registering

Custom entities should implement pre-defined framework’s interfaces and should inherit from a base class.

The core base class to use is BaseEntity, imported from rtv.core.base.

Framework also provides some base classes for core types of entities. All base classes are child classes of BaseEntity.

Base classes that can be imported from rtv.base:

  • BaseComparisonStrategy

  • BaseErrorMetricStrategy

  • BaseValidationStrategy

  • BaseWriter

BaseReader class can be imported from rtv.data.reader.base.

BaseAction class can be imported from rtv.action.base.

Most core interfaces are imported from the rtv.interfaces module. Those are:

  • IComparator

  • IReader

  • ITransformer

  • IValidationStrategy

  • IWriter

IAction interface can be imported from rtv.action.interfaces.

IConfigLoader interface can be imported from rtv.config.interfaces.

IValidation interface can be imported from rtv.validation.interfaces.

NOTE: You can find the interfaces definitions in this section.

Here is an example of custom transformer implementation:

from pydantic import BaseModel

from rtv.core.base import BaseEntity
from rtv.interfaces import ITransformer

class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
    class Params(BaseModel):
        my_awesome_param: int
        ...

    def transform(self, data):
        ...

The entity defined above can be used in configuration file like this:

definitions:
    - name: my_awesome_transformer
      class: MyAwesomeTransformer # You can also use `idf` alias here
      my_awesome_param: 42

actions:
    - transform:
          input: data
          output_name: transformed
          transfromers: my_awesome_transformer

Usage of nested Params(BaseModel) class provides auto-validation for your entity’s parameters. If your entity does not have parameters you can omit this nested class:

from rtv.interfaces import ITransformer

class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
    def transform(self, data):
        ...

It is important to inherit from the interface class when realizing your custom entities because interface classes provide mechanisms that allow you later to use your entity via configuration files.

NOTE:idf is optional for almost all entities, and most probably not needed if users intend to use the entity in python script, however there are some exceptions (will be mentioned below). It is just an alias to make config files more concise.

Following table shows which types you can implement and from which classes you should inherit:

Entity Name

Inherit from

Reader

BaseReader, IReader

Transformer

BaseEntity, ITransformer

Validation

BaseEntity, IValidation

Validation Strategy

BaseValidationStrategy, IValidationStrategy

Action *

BaseAction, IAction

Writer

BaseWriter, IWriter

Config Loader **

IConfigLoader, idf=”<extension_suffix>”

* - See implementing custom actions

** - See implementing custom config loaders

Implementing custom actions

When implementing custom actions we recommend to add a short and descriptive identifier:

# ...
class MyCustomAction(BaseAction, IAction, idf="greet"):
     class Params(BaseModel):
         message: str

     def execute(self):
         print(self.message)
# ...

That would make it more convenient to use in the configuration files.

actions:
  - greet:
      message: "Hello World!"

NOTE: Custom actions should not be defined, just used by alias.

To say more, custom actions implementation only makes sense for usage in configuration files.

Implementing custom config loaders

IConfigLoader is a special case, this class is not inheriting from BaseEntity, does not support nested Params class, and requires idf to be the same as file extension suffix:

class TxtConfigLoader(IConfigLoader, idf="txt"):
    ...

Otherwise it should crash the run.

Registering for use in config files

The framework will automatically handle the addition of user’s custom class to the registry, and it will become available for use in the config files.

However, the framework needs to know where to look for the custom code. So, users need to set up an environment variable RTV_USER_CODE_PATH:

export RTV_USER_CODE_PATH=<custom_code_directory_path>

Substitute <custom_code_directory_path with an actual path on your file system where you gonna store the custom code files for RTV. You can structure and name those files as you want.

Defining custom entities in config

YAML Config Example:

  • Using custom class name:

    definitions:
        - name: my_awesome_transformer
          class: MyAwesomeTransformer
          my_awesome_param: 42
    # ...
    
  • Using idf (identifier/alias):

    definitions:
        - name: my_awesome_transformer
          class: awesome
          my_awesome_param: 42
    # ...
    

Using custom entities in actions

YAML Config Example:

  • Custom transformer:

    actions:
        - transform:
            input: data
            output_name: transformed_data
            transformers: my_awesome_transformer
            # ...
    
  • Custom action:

    actions:
        - my_awesome_action:
            awesome_parameter: 42
            # ...
    

Core Interfaces

class IReader(Interface):
    @abstractmethod
    def read(
        self, *args, **kwargs
    ) -> DataCollection:
        ...

class IWriter(Interface):
    @abstractmethod
    def write(self,
        data: DataCollection,
        *args, **kwargs
    ) -> None:
        ...

class ITransformer(Interface):
    @abstractmethod
    def transform(
        self, collection: DataCollection, **kwargs
    ) -> DataCollection:
        ...

class IValidation(Interface):
    @abstractmethod
    def execute(
        self, reference: DataCollection, target: DataCollection
    ) -> List[Tuple[str, ValidationResultModel]]:
        ...

    @property
    @abstractmethod
    def name(self) -> str:
        ...

class IValidationStrategy(Interface):
    @abstractmethod
    def validate(
        self,
        key: str,
        reference: DataCollection,
        target: DataCollection,
    ) -> ValidationResultModel:
        ...

    @property
    @abstractmethod
    def details(self) -> StrategyDetailsModel:
        ...

    @property
    @abstractmethod
    def name(self) -> str:
        ...

class IAction(Interface):
    @abstractmethod
    def execute(self) -> None:
        ...

class IConfigLoader(Interface):
    @abstractmethod
    def load(self, config_path: str) -> Dict[str, Any]:
        ...