RTV Tutorial¶
RTV allows user to execute validation scenarios in two ways:
Writing python scripts, using classes from
rtv
package and executing them.Writing configuration files in arbitrary format following the defined structure.
This tutorial covers both of those approaches. Also it provides basic required knowledge of framework’s entities/classes and how to extend the framework with custom entities/classes.
Python Script¶
To execute a validation scenario using RTV via python script you need to
import, instatiate and invoke certain classes from the rtv
package inside
your script and then simply execute it:
python /path/to/your/script
Example¶
pred.csv
:
k1,k2,k3
1,0,0
0,1,0
0,0,1
true.csv
k1,k2,k3
0,0,1
0,1,1
1,0,1
script.py
:
from rtv.data.output.writer import JSONFileWriter
from rtv.data.reader import CSVReader
from rtv.validation import StrategyValidation
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
def main():
pred_filename = "pred.csv"
true_filename = "true.csv"
# Instantiate the Reader and the Writer entities
reader = CSVReader({"delimiter": ","})
writer = JSONFileWriter()
# Read sources to get reference and target DataCollection objects
reference = reader.read(true_filename)
target = reader.read(pred_filename)
# Instatiate ValidationStrategy entities
mae_strategy_05 = MeanAbsoluteError({"threshold": 0.5})
mae_strategy_03 = MeanAbsoluteError({"threshold": 0.3})
mae_strategy_01 = MeanAbsoluteError({"threshold": 0.1})
# Set names for validation strategies
mae_strategy_05.set_name("mae_05")
mae_strategy_03.set_name("mae_03")
mae_strategy_01.set_name("mae_01")
# Instatiate Validation entities
v1 = StrategyValidation(["default"], [mae_strategy_05])
v2 = StrategyValidation(["k1"], [mae_strategy_01])
v3 = StrategyValidation(["k2"], [mae_strategy_03])
# Set the names for validations
v1.name = "v1"
v2.name = "v2"
v3.name = "v3"
# Run the Validation entities via special Validator object.
result_collection = Validator().validate(reference, target, [v1, v2, v3])
# Write the outputs
writer.write(result_collection, "test_output")
if __name__ == "__main__":
main()
NOTE: More details on used classes will be provided later in this tutorial.
Configuration File¶
To execute validation scenario via RTV configuration file you need to run rtv
from command line and provide a path to the configuration file, like this:
rtv /path/to/config/file
Currently supported formats:
yaml
json
NOTE: This tutorial uses yaml
format for examples in most places.
Structure¶
A valid configuration file for RTV should have two main sections:
definitions
- this section holds a list of framework’s entities defined which will be used in the validation scenario.actions
- this section should hold a list of actions wich will be performed during validation scenario execution.
Minimal example:
yaml
:
definitions:
- name: csv_reader
class: CSVReader
delimiter: "|"
actions:
- read:
reader: csv_reader
source: vector.csv
output_name: vector_data
json
:
{
"definitions": [
{
"name": "csv_reader",
"class": "CSVReader",
"delimiter": "|"
}
],
"actions": [
{
"read": {
"reader": "csv_reader",
"source": "vector.csv",
"output_name": "vector_data"
}
}
]
}
Definitions¶
Each element in the list of definitions in definitions
section of the
configuration file should have following required fields:
name
: You can think of it as an alias or a variable name, that you can later use in the config to reference defined entity.class
: A constructor class name of the entity.
The rest of the definition fields are arbitrary parameters for certain entity.
In previous example delimiter
field is a parameter of CSVReader
.
NOTE: You can find a list of available entities/classes and their parameters in the following sections of this tutorial.
Actions¶
The common structure for actions
section entry is as follows:
actions:
- <action_type>:
- <action_param>: ...
# ...
- <action_param>: ...
# ...
A set of <action_param>
fields is specific to a certain action type.
Example with read
<action_type>
:
actions:
- read:
- reader: csv_reader
source: vector.csv
output_name: vector_table_data
- reader: txt_reader
source: vector.txt
output_name: vector_text_data
NOTE: You will find info on available <action_type>
and realated
<action_param>
in the following section of this tutorial.
During the validation run the actions will be executed in order that they were defined in the config, so the following example will lead to an error:
actions:
- transform:
input: vector_data
output_name: transformed_vector_data
transformers: vector_transposer
- read:
reader: csv_reader
source: vector.csv
output_name: vector_data
transform
action will raise an exception when trying to access vector_data
entry as it will only be available after successful read
action execution.
Available actions¶
read
¶
Used to read data from arbitrary source(s), convert it to RTV internal data representation and save it to the current scenario’s data store.
Fields:
reader
: A name of theReader
entity to use for the action execution.source
: A path to a source.output_name
: A unique (to the current scenario) name that will be used to store and reference the action’s result.pattern
: Optional field, a regex pattern to match more than one source file. If this field is provided thensource
should be a path to a directory with source files to match thepattern
.Defaults to empty string.prefix_key
: Optional field, a prefix string to prepend to every key of resulting data entry. Defaults to empty string.
Example:
Read reference.csv
and target.csv
source files and save resulting data
as reference
and target
respectively:
definitions:
- name: csv_reader
class: CSVReader
actions:
- read:
- reader: csv_reader
source: reference.csv
output_name: reference
prefix_key: ref
- reader: csv_reader
source: iterations/
pattern: iter_(\d+).csv
# will match: iter_001.csv, iter_002.csv...
output_name: target
write
¶
Used to write a data entry to some output destination using Writer
entity.
Fields:
input
: A name of the data entry to write tooutput
.writer
: A name of the definedWriter
entity to use for the action execution.output
: An action result’s output destination. Actual type depends on thewriter
implementation.
Example:
Write result
data entry to a json file named validation_result.json
using
JSONWriter
entity.
definitions:
# ...
- name: json_writer
class: JSONWriter
# ...
actions:
# ...
- write:
input: result
writer: json_writer
output: validation_result
transform
¶
Used to transform data entries using Transformer
entities and save the result as a
new data entry.
Fields:
input
: A name of the data entry to transform.transformers
: A name (or a list of names) ofTransformer
entity to use for the action execution.output_name
: A unique name that will be used to store and later reference the result of the action.
Example:
Transform result
data entry using inverse_transformer
and save the
transformed result to result_transformed
data entry.
definitions:
# ...
- name: inverse_transformer
class: InverseTransformer
# ...
actions:
# ...
- transform:
input: result
writer: inverse_transformer
output: result_transformed
# ...
validate
¶
Used to perform validation on target
data entry against reference
data
entry using single or multiple Validation
entities.
Fields:
reference
: A data entry name to use as reference.target
: A data entry name to use as target.validations
: A name (or a list of names) ofValidation
entity to use for the action execution.output_name
: A unique name that will be used to store and reference the result of the action.
Example:
Validate a
data entry against b
data entry using v1
validation and write
the resulting data entry to result
.
definitions:
# ...
- name: mae
class: MeanAbsoluteError
threshold: 0.5
- name: v1
class: StrategyValidation
strategies: mae
keys: all
# ...
actions:
# ...
- validate:
reference: b
target: a
validations: v1
output_name: result
# ...
free
¶
Used to remove data entries from the current scenario data store.
Fields:
targets
: Names of data entries to remove.
Example:
Remove a
and b
data entries.
actions:
- free:
targets: [a,b]
Classes¶
Every entity in the RTV framework has a semantic meaning to it based on it’s purpose.
Currently there are following entity types of interest to the user:
Reader
- Used to read data from “outer world” (source types are arbitrary).Writer
- Used to expose data to the “outer world” (output types are arbitrary).Transformer
- Used to apply arbitrary transformations to internal data entries.Validation
- Encapsulates data entries (reference and target) and other entities required to perform the validation.ValidationStrategy
- An entity that encapsulates and performs actual logic of the validation.
Every entity in the RTV framework has a class constructor that user can use in two ways:
Define and use entity in config file.
definitions: - name: reader class: CSVReader # ... actions: - read: reader: reader # ...
Import class into a python script.
from rtv.data.reader import CSVReader reader = CSVReader();
Following sections contain lists of available entity classes, their brief descriptions and usage examples.
Readers¶
Readers are used to read data from arbitrary sources.
They are imported from rtv.data.reader
sub-package.
Examples of usage:
Python Script:
from rtv.data.reader import RReader
reader = RReader()
data = reader.read("/path/to/source/file")
Configuration file:
definitions:
- name: rreader
class: RReader
# ...
actions:
- read:
source: sample.rds
reader: rreader
output_name: data
CSVReader
¶
Reads the data from a csv
file.
Short parameters list:
delimiter
: A symbol used to separate values in csv file. Defaults to,
.lineterminator
: A character(s) used to denote line-break. Defaults to\r\n
.headless
: A flag indicating weather to read the csv file table in headless mode (for files with no header row). Defaults toFalse
.treat_headless
: A string representing the approach used when reading inheadless
mode. Options:as_matrix
,row_wise
,column_wise
. Defaults toas_matrix
.
Simple Examples:
Script:
from rtv.data.reader import CSVReader
reader = CSVReader({
"delimiter": ",",
"lineterminator": "\n",
"headless": False,
})
data = reader.read("/path/to/csv/file")
Configuration file:
definitions:
- name: reader
class: CSVReader
delimiter: ","
lineterminator: "\n"
headless: False
# ...
actions:
- read:
reader: reader
source: /path/to/csv/file
output_name: data
# ...
Headless example:
source.csv
:
1,2,3
4,5,6
Python Script:
from rtv.data.reader import CSVReader
matrix_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "as_matrix", # can be omitted
})
row_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "row_wise",
})
row_reader = CSVReader({
"delimiter": ",",
"headless": True,
"treat_headless": "column_wise",
})
data = matrix_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data": [
# [1,2,3],
# [4,5,6]
# ]
# }
data = row_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data_0": [1,2,3],
# "csv_data_1": [4,5,6]
# }
data = column_reader.read("source.csv")
print(data.to_json())
# {
# "csv_data_0": [1,4],
# "csv_data_1": [2,5],
# "csv_data_2": [3,6]
# }
RReader
¶
Reads data from .rds
files. Based on the rdata
package.
Short parameters list:
extension
: A string to specify different file extension. Defaults toNone
.default_encoding
: Specify default encoding for source files. Defaults toNone
.force_default_encoding
: A flag indicating weather to forcedefault encoding
when reading source files. Defaults toFalse
.
Simple Example:
Script:
from rtv.data.reader import RReader
reader = RReader({
"extension": "rds",
"default_encoding": "utf-16",
"force_default_encoding": True,
})
data = reader.read("sample.rds")
QS2Reader
¶
Custom reader for R’s qs2
data files.
Important: Requires R to be installed on the system, and qs2
package
installed in R.
Configuration file:
definitions:
- name: qsreader
class: QS2Reader
actions:
- read:
source: sample.qs2
reader: qsreader
output_name: data
Writers¶
Writers are used to “export” data from rtv to arbitrary destinations.
They are imported from rtv.data.output
sub-package.
JSONFileWriter
¶
Writes data entry to json file.
Examples:
Script:
from rtv.data.output import JSONFileWriter
# ...
writer.write(data_entry, "output")
# Writes `data_entry` to output.json file
Configuration file:
definitions:
- name: json_writer
class: JSONFileWriter
action:
- write:
input: data_entry
writer: json_writer
output: output
ResultWriter
¶
Writes ResultCollection
(a special type of DataCollection
that stores
results of the validations and corresponding artifacts) to a txt
file.
Examples:
Script:
from rtv.data.output import ResultWriter
result_writer = ResultWriter()
# ... Some validations happen to product `result_collection`
result_writer.write(result_collection, "passed")
# passed.txt:
# passed: True
# keys passed:
# 'v3/rmse/k1': True
# 'v3/rmse/k2': True
# 'v3/mae/k1': True
# 'v3/mae/k2': True
# ...
Configuration file:
definitions:
- name: result_writer
class: ResultWriter
# ...
actions:
# ...
- write:
input: result_collection
output: passed
writer: result_writer
Transformers¶
Transformers are entities that are meant to apply arbitrary transformations to
DataCollection
objects.
Imported from rtv.transfromer
sub-package.
Usage Examples:
Script:
from rtv.transformer import PassThrough
transformer = PassThrough()
transformed_data = transfromer.transform(some_data_collection)
Configuration file:
definitions:
- name: transformer
class: PassThrough
actions:
- transform:
input: some_data_collection
output_name: transformed_data
transformers: transformer
Available Transformers:
PassThrough
: Does absolutely nothing. Implemented for demonstration purposes.Parameters:
delay
: number of seconds to sleep (simulate transformations).
Validations¶
Validations are special entities that encapsulate everything needed to perform a single validation in the scenario (data collections, strategies or other involved objects).
They are imported from rtv.validation
sub-package.
They can be executed by calling execute()
method in python script, or using
validate
action in the configuration file.
However for python script usage it is recommended to use Validator
special helper object to manage validation objects execution (See examples below).
StrategyValidation
¶
Validates target against reference using provided keys and strategies.
Parameters:
keys
: List of key names of data entries to apply strategies to. Special values are: “default” (applies to all keys in data entries, which were not validated by any other validation) and “all” (applies to all keys in data entries). Defaults to “default”.strategies
: List ofValidationStrategy
objects.*key_pattern
: A regex pattern to use for matching keys in data entries. If provided,keys
parameter is ignored. Defaults to empty string.
* - If used in configuration file this parameter is just a names list of
ValidationStrategy
entities defined earlier.
Examples:
Script:
from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
# ... Some operations to get `reference` and `target` DataCollection objects
mae = MeanAbsoluteError()
# Will be applied to all keys
validation_1 = StrategyValidation(["all"], [mae])
# Will be applied to keys that start with `prefix_` (e.g. `prefix_k1`...)
validation_2 = StrategyValidation(["default"], [mae], "prefix_*")
# To execute validations on data use `Validator` special object:
result_collection = Validator().validate(
reference,
target,
[validation_1, validation_2]
)
# You can also use `execute` method of the validation object itself:
list_of_tuples = validation_1.execute(reference, target)
# But mind that the output will be not that clear, or easy to understand
# and most probably will need some manual handling, Validator object does that
# for you and provides results packed in the nice ResultCollection object.
Configuration file:
definitions:
# ...
- name: mae
class: MeanAbsoluteError
- name: validation_1
class: StrategyValidation
keys: all
strategies: mae
- name: validation_2
class: StrategyValidation
keys: default
key_pattern: prefix_*
strategies: mae
# ...
actions:
# ...
- validate:
reference: reference
target: target
validations:
- validation_1
- validation_2
output_name: result_collection
Validation Strategies¶
Entities that encapsulate logic of actual validation, executed on specific
key(s) present both in reference and target data entries.
They are imported form rtv.validation.strategy
sub-package.
Example:
from rtv.validation import StrategyValidation, Validator
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
# ... Some operations to get `reference` and `target` DataCollection objects
print(reference.to_json())
# { "k1": [1,1,1], ...}
print(target.to_json())
# { "k1": [0,1,1], ...}
mae = MeanAbsoluteError({"threshold": 0.1})
validation = StrategyValidation(["k1"], [mae])
# This validation will take `k1` from target as `y_pred`,
# `k1` from reference as `y_true` and calculate:
# mean_absolute_error(y_true, y_pred)
# If resulting error will be greater than 0.1 - validation
# is considered failed.
Error Metrics¶
A family of ValidatonStrategy
entities. Calculate some commonly used error
metrics.
Common Parameters:
threshold
: The upper limit for the value of calculated metric, if exceeded the parent validation fails. Defaults to0
.
Usage Examples:
Script:
from rtv.validation.strategy import MeanAbsoluteError
strategy = MeanAbsoluteError({"threshold": 0.5})
Configuration file:
definitions:
- name: strategy
class: MeanAbsoluteError
threshold: 0.5
Available Error Metric Strategy Constructors:
MeanAbsoluteError
MeanAbsolutePercentageError
MeanSquaredError
MeanSquaredLogError
RootMeanSquaredError
RootMeanSquaredLogError
RelativeMAPE
RelativeRMSE
Comparison¶
A family of ValidationStrategy
entities that are based on comparing
values from reference and target data entries.
These entities use Comparator
entity with some predefined parameters.
If you want to gain more control on the comparison process you can use
Comparator
entities directly (See following section).
Common Parameters:
deviation
: A limit for the individual elements’ distance. If exceeded, the parent validation is considered failed.
Usage Examples:
Script:
from rtv.validation.strategy import ElementWiseAbsoluteDistance
strategy = ElementWiseAbsoluteDistance({
"deviation": 50,
})
Configuration file:
definitions:
- name: strategy
class: ElementWiseSimpleDistance
deviation: 50
Available Comparison Strategy Constructors:
ElementWiseAbsoluteDistance
Compare two Iterables element wise by calculating absolute distance between each element.
Parameters:
num_max_values
: Max number of biggest values to include in validation artifacts. Defaults to10
.
ElementWiseSimpleDistance
Compare two Iterables element wise by calculating simple distance between each element.
Parameters:
num_max_values
: Max number of biggest values to include in validation artifacts. Defaults to10
.num_min_values
: Max number of smallest values to include in validation artifacts. Defaults to10
.
Comparators¶
A special entity that compares values from two data entries (reference and
target in most cases).
Imported from rtv.validation.comparator
sub-package.
Common Parameters:
callback
: A callback function that performs the comparison.keys
: A list of keys used to get the data values for comparison.
NOTE: Currently the direct use of Comparator
entities in configuration
files is not supported.
ElementWiseComparator
¶
Example:
from rtv.validation.comparator import ElementWiseComparator
# ... Some operations to get `reference` and `target` DataCollection objects
print(reference.to_json())
# { "k1": [1,1,1], ...}
print(target.to_json())
# { "k1": [0,1,1], ...}
comparator= ElementWiseComparator(
lambda x,y: y - x,
["k1"]
)
comparison_result = comparator.compare(reference, target)
# [-1, 0, 0]
Custom Entities¶
RTV can be extended by custom user entities (classes) to provide missing functionality for user validation scenario (e.g. implementing some custom error metrics) or extend supported configuration file formats.
Implementing and registering¶
Custom entities should implement pre-defined framework’s interfaces and
should inherit from a base
class.
The core base class to use is BaseEntity
, imported from rtv.core.base
.
Framework also provides some base classes for core types of entities.
All base classes are child classes of BaseEntity
.
Base classes that can be imported from rtv.base
:
BaseComparisonStrategy
BaseErrorMetricStrategy
BaseValidationStrategy
BaseWriter
BaseReader
class can be imported from rtv.data.reader.base
.
BaseAction
class can be imported from rtv.action.base
.
Most core interfaces are imported from the rtv.interfaces
module. Those are:
IComparator
IReader
ITransformer
IValidationStrategy
IWriter
IAction
interface can be imported from rtv.action.interfaces
.
IConfigLoader
interface can be imported from rtv.config.interfaces
.
IValidation
interface can be imported from rtv.validation.interfaces
.
NOTE: You can find the interfaces definitions in this section.
Here is an example of custom transformer implementation:
from pydantic import BaseModel
from rtv.core.base import BaseEntity
from rtv.interfaces import ITransformer
class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
class Params(BaseModel):
my_awesome_param: int
...
def transform(self, data):
...
The entity defined above can be used in configuration file like this:
definitions:
- name: my_awesome_transformer
class: MyAwesomeTransformer # You can also use `idf` alias here
my_awesome_param: 42
actions:
- transform:
input: data
output_name: transformed
transfromers: my_awesome_transformer
Usage of nested Params(BaseModel)
class provides auto-validation for your
entity’s parameters. If your entity does not have parameters you can omit
this nested class:
from rtv.interfaces import ITransformer
class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
def transform(self, data):
...
It is important to inherit from the interface class when realizing your custom entities because interface classes provide mechanisms that allow you later to use your entity via configuration files.
NOTE:idf
is optional for almost all entities, and most probably not needed if
users intend to use the entity in python script, however there are some
exceptions (will be mentioned below). It is just an alias to make config files
more concise.
Following table shows which types you can implement and from which classes you should inherit:
Entity Name |
Inherit from |
---|---|
Reader |
BaseReader, IReader |
Transformer |
BaseEntity, ITransformer |
Validation |
BaseEntity, IValidation |
Validation Strategy |
BaseValidationStrategy, IValidationStrategy |
Action * |
BaseAction, IAction |
Writer |
BaseWriter, IWriter |
Config Loader ** |
IConfigLoader, idf=”<extension_suffix>” |
* - See implementing custom actions
** - See implementing custom config loaders
Implementing custom actions¶
When implementing custom actions we recommend to add a short and descriptive identifier:
# ...
class MyCustomAction(BaseAction, IAction, idf="greet"):
class Params(BaseModel):
message: str
def execute(self):
print(self.message)
# ...
That would make it more convenient to use in the configuration files.
actions:
- greet:
message: "Hello World!"
NOTE: Custom actions
should not be defined, just used by alias.
To say more, custom actions implementation only makes sense for usage in configuration files.
Implementing custom config loaders¶
IConfigLoader
is a special case, this class is not inheriting from
BaseEntity
, does not support nested Params
class, and requires idf
to be
the same as file extension suffix:
class TxtConfigLoader(IConfigLoader, idf="txt"):
...
Otherwise it should crash the run.
Registering for use in config files¶
The framework will automatically handle the addition of user’s custom class
to the registry
, and it will become available for use in the config files.
However, the framework needs to know where to look for the custom code. So, users
need to set up an environment variable RTV_USER_CODE_PATH
:
export RTV_USER_CODE_PATH=<custom_code_directory_path>
Substitute <custom_code_directory_path
with an actual path on your file
system where you gonna store the custom code files for RTV. You can structure
and name those files as you want.
Defining custom entities in config¶
YAML Config Example:
Using custom class name:
definitions: - name: my_awesome_transformer class: MyAwesomeTransformer my_awesome_param: 42 # ...
Using
idf
(identifier/alias):definitions: - name: my_awesome_transformer class: awesome my_awesome_param: 42 # ...
Using custom entities in actions¶
YAML Config Example:
Custom transformer:
actions: - transform: input: data output_name: transformed_data transformers: my_awesome_transformer # ...
Custom action:
actions: - my_awesome_action: awesome_parameter: 42 # ...
Core Interfaces¶
class IReader(Interface):
@abstractmethod
def read(
self, *args, **kwargs
) -> DataCollection:
...
class IWriter(Interface):
@abstractmethod
def write(self,
data: DataCollection,
*args, **kwargs
) -> None:
...
class ITransformer(Interface):
@abstractmethod
def transform(
self, collection: DataCollection, **kwargs
) -> DataCollection:
...
class IValidation(Interface):
@abstractmethod
def execute(
self, reference: DataCollection, target: DataCollection
) -> List[Tuple[str, ValidationResultModel]]:
...
@property
@abstractmethod
def name(self) -> str:
...
class IValidationStrategy(Interface):
@abstractmethod
def validate(
self,
key: str,
reference: DataCollection,
target: DataCollection,
) -> ValidationResultModel:
...
@property
@abstractmethod
def details(self) -> StrategyDetailsModel:
...
@property
@abstractmethod
def name(self) -> str:
...
class IAction(Interface):
@abstractmethod
def execute(self) -> None:
...
class IConfigLoader(Interface):
@abstractmethod
def load(self, config_path: str) -> Dict[str, Any]:
...