RTV’s Core Entities/Classes¶
This tutorial aims to provide its reader all required knownledge on the framework’s entities (classes), their purposes, features and use cases.
NOTE: This tutorial only focuses on the most common usecases and provides tructated set of parameter desctiptions, so if you need more details on a particular class or entity you shuould search for related API documentation.
Quick Reference:
Every entity in the RTV framework has a semantical meaning to it based on it’s prpose.
Currently there are following entity types of interest to the user:
Reader
- Used to read data from “outer world” (source types are arbitrary).Writer
- Used to expose data to the “outer world” (output types are arbitrary).Transformer
- Used to apply arbitrary transformations to internal data entries.Validation
- Encapsulates data entries (reference and target) and other entities required to perform the validation.ValidationStrategy
- An entity that ecapsulates and performs actual logic of the validation.
There are also some special classes/entities for internal use or such.
Every entity in the RTV framework has a class contructor that user can use in two ways:
Define and use entity in config file.
definitions: - name: reader class: CSVReader # ... actions: - read: reader: reader # ...
Import class into a python script.
from rtv.data.reader import CSVReader reader = CSVReader();
Following sections contain lists of available entity classes, their brief descriptions and usage examples.
Data Collection and Store¶
DataCollection class¶
DataCollection
is a class that RTV uses for internal representation
of the data that it works with. From high level perspective it could be
considered a collection of key
-> value
pairs, where keys are unique string
names, and values are some arbitrary data primitives or objects.
You can import the class to your script like this:
from rtv.data.collection import DataCollection
To instaciate/create a data collection you need to call the class contructor
and pass a valid python dictionary as an argument (defaults to {}
):
from rtv.data.collection import DataCollection
my_data_collection = DataCollection({"data": [1,2,3]})
The example above will create a data collection object holding one key named
data
with a value that is an iterable collection of integers (in this
particular example - python list, but it can be any valid python object, e.g.
numpy
array).
However, as RTV user you will most probably never have to instanciate data collections yourself, but instead get them as results from some other entities.
# source_file contents:
#
# k1,k2
# 0,1
#
data_collection = csv_reader.read(source_file)
for key in data_collection.keys()
print(key)
# Output:
# k1
# k2
The example above illustrates how you can get the data collection object from
Reader
object’s read
method call.
NOTE:Reader
class will be described in details in the next section of this
tutorial.
The following code example illustrates some common use cases for data collection instances:
# NOTE: Using data_collection valirable from previous code example
# Get all keys of the data collection
keys = [k for k in data_collection.keys()] # ["k1", "k2"]
# Get all values of the data collection
values = [v for v in data_collection.values()] # [0, 1]
# Iterate on data collection key-value pairs
for k, v in data_collection.items():
...
# Get JSON string representing the data collection
print(data_collection.to_json()) # {"k1":0,"k2":1}
# Check if data collection has a value associated with some key
print(data_collection.has("k1")) # True
print(data_collection.has("k3")) # False
# Get specific value by key
k1 = data_collection.get("k1")
print(k1) # 0
# Add a new key-value pair to the data collection
data_collection.add_data("k3", 2)
print(data_collection.has("k3")) # True
print(data_collection.get("k3")) # 2
# Add multiple key-value pairs to the data collection
# NOTE: The key `k3` will be overwritten
data_collection.add_data_bulk({"k3": 3, "k4": 4})
print(data_collection.has("k4")) # True
print(data_collection.get("k3")) # 3
print(data_collection.get("k4")) # 4
Want to know more about DataCollection
class internals? Check related
API doc.
Data store¶
data_store
is a key-value storage object where the keys are unique string
names and values are DataCollection
instances.
Data store is used internaly by the framework to track/access data states during a single configuration file execution.
Let’s look at the example:
config.yaml
:
actions:
- read:
source: matrix.csv
reader: csv_reader
output_name: source_data_collection
- transform:
input: source_data_collection
transformsers:
- matrix_inverser
- negative_integers_filter
output_name: transfomed_data_collection
- free:
targets:
- source_data_collection
After the read
action will be executed the data store will have one entry with
the key source_data_collection
.
After transform
action there will be two entires in data store:
source_data_collection
and transformed_data_collection
.
After free
action execution the source_data_collection
entry will be removed
from the data store.
NOTE: If you didn’t fully understand the config file example, you should read this Tutorial.
Despite data_store
instance is not meant to be dealed with directly, you can
import it into your scripts like this, if you need it:
from rtv.data.store import data_store
Some operations that you can perform on data_store
:
from rtv.data.collection import DataCollection
from rtv.data.store import data_store
sample_collection = DataCollection({})
data_store.set("empty", sample_collection)
print(data_store.has("empty")) # True
print(isinstance(data_store.get("empty"), DataCollection)) # True
data_store.remove("empty")
print(data_store.has("empty")) # False
For more info on Store
objects check related
API doc.
Readers¶
Readers are used to read data from arbitrary sources.
They are imported from rtv.data.reader
subpackage.
Examples of usage:
Python Script:
from rtv.data.reader import RReader
reader = RReader()
data = reader.read("/path/to/source/file")
Configuration file:
definitions:
- name: rreader
class: RReader
# ...
actions:
- read:
source: sample.rds
reader: rreader
output_name: data
CSVReader
¶
Reads the data from a csv
file.
Short parameters list:
delimiter
: A symbol used to separate values in csv file. Defaults to,
.lineterminator
: A character(s) used to denote linebreak. Defaults to\r\n
.headless
: A flag inidicating wether to read the csv file table in headless mode (for files with no header row). Defaults toFalse
.treat_headless
: A string representing the approach used when reading inheadless
mode. Options:as_matrix
,row_wise
,column_wise
. Defaults toas_matrix
.
Simple Examples:
Script:
from rtv.data.reader import CSVReader
reader = CSVReader({
"delimiter": ",",
"lineterminator": "\n",
"headless": False,
})
data = reader.read("/path/to/csv/file")
Configuration file:
definitions:
- name: reader
class: CSVReader
delimiter: ","
lineterminator: "\n"
headless: False
# ...
actions:
- read:
reader: reader
source: /path/to/csv/file
output_name: data
# ...
Headless example:
source.csv
:
1,2,3
4,5,6
Python Script:
from rtv.data.reader import CSVReader
matrix_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "as_matrix", # can be omitted
})
row_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "row_wise",
})
row_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "column_wise",
})
data = matrix_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data": [
# [1,2,3],
# [4,5,6]
# ]
# }
data = row_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data_0": [1,2,3],
# "csv_data_1": [4,5,6]
# }
data = column_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data_0": [1,4],
# "csv_data_1": [2,5],
# "csv_data_2": [3,6]
# }
RReader
¶
Reads data from .rds
files. Based on the rdata
package.
Short parameters list:
extension
: A string to specify different file extesnion. Defaults toNone
.default_encoding
: Specify default encoding for source files. Defaults toNone
.force_default_encoding
: A flag inidicating wether to forcedefault encoding
when reading source files. Defaults toFalse
.
Simple Example:
Script:
from rtv.data.reader import RReader
reader = RReader({
"extension": "rds",
"default_encoding": "utf-16",
"force_default_encoding": True,
})
data = reader.read("sample.rds")
Configuration file:
definitions:
- name: rreader
class: RReader
extension: rds
default_encoding: utf-16
force_default_encoding: True
# ...
actions:
- read:
source: sample.rds
reader: rreader
output_name: data
Writers¶
Writers are used to “export” data from rtv to arbitrary destinations.
They are imported from rtv.data.output
subpackage.
JSONFileWriter
¶
Writes data entry to json file.
Examples:
Script:
from rtv.data.output import JSONFileWriter
# ...
writer.write(data_entry, "output")
# Writes `data_entry` to output.json file
Configuration file:
definitions:
- name: json_writer
class: JSONFileWriter
action:
- write:
input: data_entry
writer: json_writer
output: output
ResultWriter
¶
Writes ResultCollection
(a special type of DataCollection
that stores
results of the validations and corresponding artifacts) to a txt
file.
Examples:
Script:
from rtv.data.output import ResultWriter
result_writer = ResultWriter()
# ... Some validations happen to product `result_collection`
result_writer.write(result_collection, "passed")
# passed.txt:
# passed: True
# keys passed:
# 'v3/rmse/k1': True
# 'v3/rmse/k2': True
# 'v3/mae/k1': True
# 'v3/mae/k2': True
# ...
Configuration file:
definitions:
- name: result_writer
class: ResultWriter
# ...
actions:
# ...
- write:
input: result_collection
output: passed
writer: result_writer
Transformers¶
Transformers are entities that are meant to apply arbitrary transformations to
DataCollection
objects.
Imported from rtv.transfromer
subpackage.
Usage Examples:
Script:
from rtv.transformer import PassThrough
transformer = PassThrough()
transformed_data = transfromer.transform(some_data_collection)
Configuration file:
definitions:
- name: transformer
class: PassThrough
actions:
- transform:
input: some_data_collection
output_name: transformed_data
transformers: transformer
Available Transformers:
PassThrough
: Does absolutely nothing. Implemented for demonstrational purposes.Parameters:
delay
: number of seconds to sleep (simulate transformations).
Validations¶
Validations are special entities that encapsulate everything needed to perform a single validation in the scenario (data collections, strategies or other involved objects).
They are imported from rtv.validation
subpackage.
They can be executed by calling execute()
method in python script, or using
validate
action in the configuration file.
However for python script usage it is recommended to use Validator
special helper object to manage validation objects execution (See examples below).
StrategyValidation
¶
Validates target against refernce using provided keys and strategies.
Parameters:
keys
: List of key names of data entries to apply strategies to. Special values are: “default” (applies to all keys in data entries, which were not validated by any other validation) and “all” (applies to all keys in data entries). Defaults to “default”.strategies
: List ofValidationStrategy
objects.*key_pattern
: A regex pattern to use for matching keys in data entries. If provided,keys
parameter is ignored. Defaults to empty string.
* - If used in configuration file this parameter is just a names list of
ValidationStrategy
entities defined earlier.
Examples:
Script:
from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
# ... Some operations to get `reference` and `target` DataCollection objects
mae = MeanAbsoluteError()
# Will be applied to all keys
validation_1 = StrategyValidation(["all"], [mae])
# Will be applied to keys that start with `prefix_` (e.g. `prefix_k1`...)
validation_2 = StrategyValidation(["default"], [mae], "prefix_*")
# To execute validations on data use `Validator` special object:
result_collection = Validator().validate(
reference,
target,
[validation_1, validation_2]
)
# You can also use `execute` method of the validation object itself:
list_of_tuples = validation_1.execute(reference, target)
# But mind that the output will be not that clear, or easy to understand
# and most probably will need some manual handling, Validator object does that
# for you and provides results packed in the nice ResultCollection object.
Configuration file:
definitions:
# ...
- name: mae
class: MeanAbsoluteError
- name: validation_1
class: StrategyValidation
keys: all
strategies: mae
- name: validation_2
class: StrategyValidation
keys: default
key_pattern: prefix_*
strategies: mae
# ...
actions:
# ...
- validate:
reference: reference
target: target
validations:
- validation_1
- validation_2
output_name: result_collection
Validation Strategies¶
Entities that encapsulate logic of acutal validation, executed on specific
key(s) present both in reference and target data entries.
They are imported form rtv.validation.strategy
subpackage.
Example:
from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
# ... Some operations to get `reference` and `target` DataCollection objects
print(reference.to_json())
# { "k1": [1,1,1], ...}
print(target.to_json())
# { "k1": [0,1,1], ...}
mae = MeanAbsoluteError({"threshold": 0.1})
validation = StrategyValidation(["k1"], [mae])
# This validation will take `k1` from target as `y_pred`,
# `k1` from reference as `y_true` and calculate:
# mean_absolute_error(y_true, y_pred)
# If resulting error will be greater than 0.1 - validation
# is considered failed.
Error Metrics¶
A family of ValidatonStrategy
entities. Calculate some commonly used error
metrics.
Common Parameters:
threshold
: The upper limit for the value of calculated metric, if exceeded the parent validation fails. Defaults to0
.
Usage Examples:
Script:
from rtv.validation.strategy import MeanAbsoluteError
strategy = MeanAbsoluteError({"threshold": 0.5})
Configuration file:
definitions:
- name: strategy
class: MeanAbsoluteError
threshold: 0.5
Available Error Metric Strategy Construcotrs:
MeanAbsoluteError
MeanAbsolutePercentageError
MeanSquaredError
MeanSquaredLogError
RootMeanSquaredError
RootMeanSquaredLogError
Comparison¶
A family of ValidationStrategy
entities that are based on comparing
values from reference and target data entries.
These entities use Comparator
entitiy with some predefined parameters.
If you want to gain more control on the comparison process you can use
Comparator
entities directly (See following section).
Common Parameters:
deviation
: A limit for the individual elements’ distance. If exceeded, the parent validation is considered failed.
Usage Examples:
Script:
from rtv.validation.strategy import ElementWiseAbsoluteDistance
strategy = ElementWiseAbsoluteDistance({
"deviation": 50,
})
Configuration file:
definitions:
- name: strategy
class: ElementWiseSimpleDistance
deviation: 50
Available Comparison Strategy Constructors:
ElementWiseAbsoluteDistance
Compare two iterables element wise by calculating absolute distance between each element.
Parameters:
num_max_values
: Max number of biggest values to include in validation artifacts. Defaults to10
.
ElementWiseSimpleDistance
Compare two iterables element wise by calculating simple distance between each element.
Parameters:
num_max_values
: Max number of biggest values to include in validation artifacts. Defaults to10
.num_min_values
: Max number of smallest values to include in validation artifacts. Defaults to10
.
Comparators¶
A special entitiy that compares values from two data entries (refernce and
target in most cases).
Imported from rtv.validation.comparator
subpackage.
Common Parameters:
callback
: A callback function that performs the comparison.keys
: A list of keys used to get the data values for comparison.
NOTE: Currently the direct use of Comparator
entities in configuration
files is not supported.
ElementWiseComparator
¶
Example:
from rtv.validation.comparator import ElementWiseComparator
# ... Some operations to get `reference` and `target` DataCollection objects
print(reference.to_json())
# { "k1": [1,1,1], ...}
print(target.to_json())
# { "k1": [0,1,1], ...}
comparator= ElementWiseComparator(
lambda x,y: y - x,
["k1"]
)
comparison_result = comparator.compare(reference, target)
# [-1, 0, 0]