RTV’s Core Entities/Classes

This tutorial aims to provide its reader all required knownledge on the framework’s entities (classes), their purposes, features and use cases.

NOTE: This tutorial only focuses on the most common usecases and provides tructated set of parameter desctiptions, so if you need more details on a particular class or entity you shuould search for related API documentation.

Quick Reference:

Every entity in the RTV framework has a semantical meaning to it based on it’s prpose.

Currently there are following entity types of interest to the user:

  • Reader - Used to read data from “outer world” (source types are arbitrary).

  • Writer - Used to expose data to the “outer world” (output types are arbitrary).

  • Transformer - Used to apply arbitrary transformations to internal data entries.

  • Validation - Encapsulates data entries (reference and target) and other entities required to perform the validation.

  • ValidationStrategy - An entity that ecapsulates and performs actual logic of the validation.

There are also some special classes/entities for internal use or such.

Every entity in the RTV framework has a class contructor that user can use in two ways:

  1. Define and use entity in config file.

    definitions:
        - name: reader
          class: CSVReader
    # ...
    
    actions:
        - read:
              reader: reader
           # ...
    
  2. Import class into a python script.

    from rtv.data.reader import CSVReader
    
    reader = CSVReader();
    

Following sections contain lists of available entity classes, their brief descriptions and usage examples.

Data Collection and Store

DataCollection class

DataCollection is a class that RTV uses for internal representation of the data that it works with. From high level perspective it could be considered a collection of key -> value pairs, where keys are unique string names, and values are some arbitrary data primitives or objects.

You can import the class to your script like this:

from rtv.data.collection import DataCollection

To instaciate/create a data collection you need to call the class contructor and pass a valid python dictionary as an argument (defaults to {}):

from rtv.data.collection import DataCollection

my_data_collection = DataCollection({"data": [1,2,3]})

The example above will create a data collection object holding one key named data with a value that is an iterable collection of integers (in this particular example - python list, but it can be any valid python object, e.g. numpy array).

However, as RTV user you will most probably never have to instanciate data collections yourself, but instead get them as results from some other entities.

# source_file contents:
#
# k1,k2
# 0,1
#

data_collection = csv_reader.read(source_file)

for key in data_collection.keys()
    print(key)

# Output:
# k1
# k2

The example above illustrates how you can get the data collection object from Reader object’s read method call.

NOTE:Reader class will be described in details in the next section of this tutorial.

The following code example illustrates some common use cases for data collection instances:

# NOTE: Using data_collection valirable from previous code example

# Get all keys of the data collection
keys = [k for k in data_collection.keys()] # ["k1", "k2"]

# Get all values of the data collection
values = [v for v in data_collection.values()] # [0, 1]

# Iterate on data collection key-value pairs
for k, v in data_collection.items():
    ...

# Get JSON string representing the data collection
print(data_collection.to_json()) # {"k1":0,"k2":1}

# Check if data collection has a value associated with some key
print(data_collection.has("k1")) # True
print(data_collection.has("k3")) # False

# Get specific value by key
k1 = data_collection.get("k1")
print(k1) # 0

# Add a new key-value pair to the data collection
data_collection.add_data("k3", 2)
print(data_collection.has("k3")) # True
print(data_collection.get("k3")) # 2

# Add multiple key-value pairs to the data collection
# NOTE: The key `k3` will be overwritten
data_collection.add_data_bulk({"k3": 3, "k4": 4})
print(data_collection.has("k4")) # True
print(data_collection.get("k3")) # 3
print(data_collection.get("k4")) # 4

Want to know more about DataCollection class internals? Check related API doc.

Data store

data_store is a key-value storage object where the keys are unique string names and values are DataCollection instances.

Data store is used internaly by the framework to track/access data states during a single configuration file execution.

Let’s look at the example:

config.yaml:

actions:
    - read:
        source: matrix.csv
        reader: csv_reader
        output_name: source_data_collection

    - transform:
        input: source_data_collection
        transformsers: 
            - matrix_inverser
            - negative_integers_filter
        output_name: transfomed_data_collection

    - free:
        targets: 
            - source_data_collection

After the read action will be executed the data store will have one entry with the key source_data_collection.

After transform action there will be two entires in data store: source_data_collection and transformed_data_collection.

After free action execution the source_data_collection entry will be removed from the data store.

NOTE: If you didn’t fully understand the config file example, you should read this Tutorial.

Despite data_store instance is not meant to be dealed with directly, you can import it into your scripts like this, if you need it:

from rtv.data.store import data_store

Some operations that you can perform on data_store:

from rtv.data.collection import DataCollection
from rtv.data.store import data_store

sample_collection = DataCollection({})

data_store.set("empty", sample_collection)

print(data_store.has("empty")) # True
print(isinstance(data_store.get("empty"), DataCollection)) # True

data_store.remove("empty")
print(data_store.has("empty")) # False

For more info on Store objects check related API doc.

Readers

Readers are used to read data from arbitrary sources.

They are imported from rtv.data.reader subpackage.

Examples of usage:

Python Script:

from rtv.data.reader import RReader

reader = RReader()

data = reader.read("/path/to/source/file")

Configuration file:

definitions:
    - name: rreader
      class: RReader
      # ...

actions:
    - read:
        source: sample.rds
        reader: rreader
        output_name: data

CSVReader

Reads the data from a csv file.

Short parameters list:

  • delimiter: A symbol used to separate values in csv file. Defaults to ,.

  • lineterminator: A character(s) used to denote linebreak. Defaults to \r\n.

  • headless: A flag inidicating wether to read the csv file table in headless mode (for files with no header row). Defaults to False.

  • treat_headless: A string representing the approach used when reading in headless mode. Options: as_matrix, row_wise, column_wise. Defaults to as_matrix.

Simple Examples:

Script:

from rtv.data.reader import CSVReader

reader = CSVReader({
    "delimiter": ",",
    "lineterminator": "\n",
    "headless": False,
})
data = reader.read("/path/to/csv/file")

Configuration file:

definitions:
   - name: reader
     class: CSVReader
     delimiter: ","
     lineterminator: "\n"
     headless: False
# ...

actions:
   - read:
         reader: reader
         source: /path/to/csv/file
         output_name: data
      # ...

Headless example:

source.csv:

1,2,3
4,5,6

Python Script:

from rtv.data.reader import CSVReader

matrix_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "as_matrix", # can be omitted
})

row_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "row_wise",
})

row_reader = CSVReader({
    "delimiter": ",",
    "headless": True,
    "treat_headless": "column_wise",
})

data = matrix_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data": [
#        [1,2,3],
#        [4,5,6]
#   ]
# }

data = row_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data_0": [1,2,3],
#   "csv_data_1": [4,5,6]
# }

data = column_reader.read("source.csv")

print(data.to_json())
# {
#   "csv_data_0": [1,4],
#   "csv_data_1": [2,5],
#   "csv_data_2": [3,6]
# }

RReader

Reads data from .rds files. Based on the rdata package.

Short parameters list:

  • extension: A string to specify different file extesnion. Defaults to None.

  • default_encoding: Specify default encoding for source files. Defaults to None.

  • force_default_encoding: A flag inidicating wether to force default encoding when reading source files. Defaults to False.

Simple Example:

Script:

from rtv.data.reader import RReader

reader = RReader({
    "extension": "rds",
    "default_encoding": "utf-16",
    "force_default_encoding": True,
})

data = reader.read("sample.rds")

Configuration file:

definitions:
    - name: rreader
      class: RReader
      extension: rds
      default_encoding: utf-16
      force_default_encoding: True
      # ...

actions:
    - read:
        source: sample.rds
        reader: rreader
        output_name: data

Writers

Writers are used to “export” data from rtv to arbitrary destinations.

They are imported from rtv.data.output subpackage.

JSONFileWriter

Writes data entry to json file.

Examples:

Script:

from rtv.data.output import JSONFileWriter

# ...

writer.write(data_entry, "output")
# Writes `data_entry` to output.json file

Configuration file:

definitions:
    - name: json_writer
      class: JSONFileWriter

action:
    - write:
        input: data_entry
        writer: json_writer
        output: output

ResultWriter

Writes ResultCollection (a special type of DataCollection that stores results of the validations and corresponding artifacts) to a txt file.

Examples:

Script:

from rtv.data.output import ResultWriter

result_writer = ResultWriter()

# ... Some validations happen to product `result_collection`

result_writer.write(result_collection, "passed")

# passed.txt:
# passed: True
# keys passed:
#   'v3/rmse/k1': True
#   'v3/rmse/k2': True
#   'v3/mae/k1': True
#   'v3/mae/k2': True
#   ...

Configuration file:

definitions:
    - name: result_writer
      class: ResultWriter
    # ...

actions:
    # ...
    - write:
        input: result_collection
        output: passed
        writer: result_writer

Transformers

Transformers are entities that are meant to apply arbitrary transformations to DataCollection objects.

Imported from rtv.transfromer subpackage.

Usage Examples:

Script:

from rtv.transformer import PassThrough

transformer = PassThrough()

transformed_data = transfromer.transform(some_data_collection)

Configuration file:

definitions:
    - name: transformer
      class: PassThrough

actions:
    - transform:
        input: some_data_collection
        output_name: transformed_data
        transformers: transformer

Available Transformers:

  • PassThrough: Does absolutely nothing. Implemented for demonstrational purposes.

    Parameters:

    • delay: number of seconds to sleep (simulate transformations).

Validations

Validations are special entities that encapsulate everything needed to perform a single validation in the scenario (data collections, strategies or other involved objects).

They are imported from rtv.validation subpackage.

They can be executed by calling execute() method in python script, or using validate action in the configuration file. However for python script usage it is recommended to use Validator special helper object to manage validation objects execution (See examples below).

StrategyValidation

Validates target against refernce using provided keys and strategies.

Parameters:

  • keys: List of key names of data entries to apply strategies to. Special values are: “default” (applies to all keys in data entries, which were not validated by any other validation) and “all” (applies to all keys in data entries). Defaults to “default”.

  • strategies: List of ValidationStrategy objects.*

  • key_pattern: A regex pattern to use for matching keys in data entries. If provided, keys parameter is ignored. Defaults to empty string.

* - If used in configuration file this parameter is just a names list of ValidationStrategy entities defined earlier.

Examples:

Script:

from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator

# ... Some operations to get `reference` and `target` DataCollection objects

mae = MeanAbsoluteError()

# Will be applied to all keys
validation_1 = StrategyValidation(["all"], [mae])
# Will be applied to keys that start with `prefix_` (e.g. `prefix_k1`...)
validation_2 = StrategyValidation(["default"], [mae], "prefix_*")

# To execute validations on data use `Validator` special object:
result_collection = Validator().validate(
    reference,
    target,
    [validation_1, validation_2]
)

# You can also use `execute` method of the validation object itself:
list_of_tuples = validation_1.execute(reference, target)

# But mind that the output will be not that clear, or easy to understand
# and most probably will need some manual handling, Validator object does that
# for you and provides results packed in the nice ResultCollection object.

Configuration file:

definitions:
    # ...
    - name: mae
      class: MeanAbsoluteError

    - name: validation_1
      class: StrategyValidation
      keys: all
      strategies: mae

    - name: validation_2
      class: StrategyValidation
      keys: default
      key_pattern: prefix_*
      strategies: mae
    # ...

actions:
    # ...
    - validate:
        reference: reference
        target: target
        validations:
            - validation_1
            - validation_2
        output_name: result_collection

Validation Strategies

Entities that encapsulate logic of acutal validation, executed on specific key(s) present both in reference and target data entries. They are imported form rtv.validation.strategy subpackage.

Example:

from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator

# ... Some operations to get `reference` and `target` DataCollection objects

print(reference.to_json())
# { "k1": [1,1,1], ...}

print(target.to_json())
# { "k1": [0,1,1], ...}

mae = MeanAbsoluteError({"threshold": 0.1})

validation = StrategyValidation(["k1"], [mae])
# This validation will take `k1` from target as `y_pred`,
# `k1` from reference as `y_true` and calculate:
# mean_absolute_error(y_true, y_pred)
# If resulting error will be greater than 0.1 - validation
# is considered failed.

Error Metrics

A family of ValidatonStrategy entities. Calculate some commonly used error metrics.

Common Parameters:

  • threshold: The upper limit for the value of calculated metric, if exceeded the parent validation fails. Defaults to 0.

Usage Examples:

Script:

from rtv.validation.strategy import MeanAbsoluteError

strategy = MeanAbsoluteError({"threshold": 0.5})

Configuration file:

definitions:
    - name: strategy
      class: MeanAbsoluteError
      threshold: 0.5

Available Error Metric Strategy Construcotrs:

  • MeanAbsoluteError

  • MeanAbsolutePercentageError

  • MeanSquaredError

  • MeanSquaredLogError

  • RootMeanSquaredError

  • RootMeanSquaredLogError

Comparison

A family of ValidationStrategy entities that are based on comparing values from reference and target data entries.

These entities use Comparator entitiy with some predefined parameters. If you want to gain more control on the comparison process you can use Comparator entities directly (See following section).

Common Parameters:

  • deviation: A limit for the individual elements’ distance. If exceeded, the parent validation is considered failed.

Usage Examples:

Script:

from rtv.validation.strategy import ElementWiseAbsoluteDistance

strategy = ElementWiseAbsoluteDistance({
    "deviation": 50,
})

Configuration file:

definitions:
    - name: strategy
      class: ElementWiseSimpleDistance
      deviation: 50

Available Comparison Strategy Constructors:

  • ElementWiseAbsoluteDistance

    Compare two iterables element wise by calculating absolute distance between each element.

    Parameters:

    • num_max_values: Max number of biggest values to include in validation artifacts. Defaults to 10.

  • ElementWiseSimpleDistance

    Compare two iterables element wise by calculating simple distance between each element.

    Parameters:

    • num_max_values: Max number of biggest values to include in validation artifacts. Defaults to 10.

    • num_min_values: Max number of smallest values to include in validation artifacts. Defaults to 10.

Comparators

A special entitiy that compares values from two data entries (refernce and target in most cases). Imported from rtv.validation.comparator subpackage.

Common Parameters:

  • callback: A callback function that performs the comparison.

  • keys: A list of keys used to get the data values for comparison.

NOTE: Currently the direct use of Comparator entities in configuration files is not supported.

ElementWiseComparator

Example:

from rtv.validation.comparator import ElementWiseComparator

# ... Some operations to get `reference` and `target` DataCollection objects

print(reference.to_json())
# { "k1": [1,1,1], ...}

print(target.to_json())
# { "k1": [0,1,1], ...}

comparator= ElementWiseComparator(
    lambda x,y: y - x,
    ["k1"]
)

comparison_result = comparator.compare(reference, target) 
# [-1, 0, 0]