Welcome to RTV Documentation!¶
This site covers RTV’s high-level overview, usage & API documentation.
RTV - Reference-Target Validator¶
Table Of Contents¶
Overview¶
RTV is a framework for validating data against some reference. It provides a set of python classes which aim to help users setup, manage, automate complex data validation scenarios. It also allows users to extend itself with custom entites, to suit their specific data validation needs.
Installation¶
Prerequisites¶
In order to proceed with the installation you need to have the following installed/available on your machine:
Steps¶
Set up the virtual environment:
python -m venv ~/env/rtv
Activate the environment:
UNIX:
source ~/env/rtv/bin/activate
Install the package:
pip install --extra-index-url https://pypi.perfacct.eu rtv-framework
Usage¶
RTV provides entities (classes) for use. The key types of those entities are:
Readers
- Used for reading the data form source files and converitng it into internalDataCollection
objects.Transformers
- Used to apply various transformations to theDataCollection
objects.Validations
- Used to perform validation procedures onDataCollection
objects. Details of internal composition (attributes, methods, etc.) depend on concrete realization.Writers
- Used to write validations results and any arbitrary data to output destinations (can be files, stdout, sockets and such).
Option 1: Python Script¶
When using a framework in the python scripts you can simply import needed entities (classes) in your script and use them in your code.
Here is a basic example script from this repo’s example
directory:
Check Running Examples section for instructions on how to run this example script yourself.
from rtv.data.output.writer import JSONFileWriter
from rtv.data.reader import CSVReader
from rtv.validation import StrategyValidation
from rtv.validation.strategy import MeanAbsoluteError
from rtv.validation.validator import Validator
def main():
pred_filename = "./input/basic/pred.csv"
true_filename = "./input/basic/true.csv"
# Instantiate the Reader and the Writer
reader = CSVReader({"delimiter": ","})
writer = JSONFileWriter()
# Read sources to get reference and target DataCollection objects
reference = reader.read(true_filename)
target = reader.read(pred_filename)
# Instatiate validation strategies
mae_strategy_05 = MeanAbsoluteError({"threshold": 0.5})
mae_strategy_03 = MeanAbsoluteError({"threshold": 0.3})
mae_strategy_01 = MeanAbsoluteError({"threshold": 0.1})
mae_strategy_05.set_name("mae_05")
mae_strategy_03.set_name("mae_03")
mae_strategy_01.set_name("mae_01")
# Instatiate validations
v1 = StrategyValidation(["default"], [mae_strategy_05])
v2 = StrategyValidation(["k1"], [mae_strategy_01])
v3 = StrategyValidation(["k2"], [mae_strategy_03])
# Set the names for validations
v1.name = "v1"
v2.name = "v2"
v3.name = "v3"
# Run the validations
result_collection = Validator().validate(reference, target, [v1, v2, v3])
# Write the outputs
writer.write(result_collection, "test_output")
if __name__ == "__main__":
main()
Option 2: Configuration files¶
RTV really shines when used to run reusable config files.
Currently supported (out-of-the-box) file formats are YAML
and JSON
.
Internal (semantic) structure of config files consists of two main parts:
definitions
- A list of items where users define the entities that will be used for performing the validation scenario.actions
- A sequence/list of actions to be performed in the validation scenario with the use of defined entities.
definition
parameters:
name
: An alias to reference defined entity later in current config file.class
: A class name or an alias for entitiy constructor to use.…all other parameters are arbitrary.
Here is the basic config example from this repo’s example
directory:
Check Running Examples section for instructions on how to run this example config yourself.
definitions:
# readers:
- name: csvreader
class: CSVReader
delimiter: ","
# strategies:
- name: ewa_dist
class: ElementWiseAbsoluteDistance
deviation: 0
- name: map_err
class: MeanAbsolutePercentageError
deviation: 0
- name: ews_dist
class: ElementWiseSimpleDistance
range: [-10, 80]
# transformer
- name: my_transformer
class: MyTransformer
suffix: i
# validations:
- name: v1
class: StrategyValidation
keys: default
strategies: ewa_dist
- name: v2
class: StrategyValidation
keys: [Ai,Bi,Ci,Di,Ei]
strategies: map_err
- name: v3
class: StrategyValidation
keys: all
strategies: ews_dist
# writers:
- name: json_writer
class: JSONFileWriter
- name: result_writer
class: ResultWriter
actions:
- read:
- output_name: ref
source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_a.csv
reader: csvreader
- output_name: t1
source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_b.csv
reader: csvreader
- output_name: t2
source: ${RTV_EXAMPLE_PATH}/input/basic/matrix_c.csv
reader: csvreader
- transform:
- input: ref
output_name: reference
transformers: my_transformer
- input: t1
output_name: target1
transformers: my_transformer
- validate:
- output_name: result_ab
validations:
- v1
- v2 # partially default
reference: reference
target: target1
- output_name: result_ac
validations:
- v1
- v3 # overwrites default
reference: ref
target: t2
- write:
- output: ${RTV_EXAMPLE_PATH}/result_matrix_a_matrix_b
writer: json_writer
input: result_ab
- output: ${RTV_EXAMPLE_PATH}/result_matrix_a_matrix_c
writer: result_writer
input: result_ac
Extending¶
As mentioned earlier RTV can be extended by custom user entities (classes) to provide missing functionality for user validation scenario (e.g. implementing some custom error metrics) or extend supported configuration file formats.
Custom Entities¶
Implementig and registering¶
Custom entities should implement pre-defiened framework’s interfaces.
from pydantic import BaseModel
from rtv.core.base import BaseEntity
from rtv.transformer.interfaces import ITransformer
class MyAwesomeTransformer(BaseEntity, ITransformer, idf="awesome"):
class Params(BaseModel):
my_awesome_param: int
...
NOTE:idf
is optional for almost all entities, and most probably not needed if
users intend to use the entity in python script, however there are some
exceptions (will be mentioned below). It is just an alias to make config files
more concise.
Core Entities Available:¶
Entity Name |
Inherit from |
---|---|
Reader |
BaseReader, IReader |
Transformer |
BaseEntity, ITransformer |
Validation |
BaseEntity, IValidation |
Validation Strategy |
BaseValidationStrategy, IValidationStrategy |
Action * |
BaseAction, IAction |
Writer |
BaseWriter, IWriter |
Config Loader ** |
IConfigLoader, idf=”<extension_suffix>” |
* - See implementing custom actions
** - See implementing custom config loaders
Implementing custom actions¶
When implementing custom actions we recommend to add a short and descriptive identifier:
# ...
class MyCustomAction(BaseAction, IAction, idf="greet"):
class Params(BaseModel):
message: str
def execute(self):
print(self.message)
# ...
That would make it more convenient to use in the configuration files.
actions:
- greet:
message: "Hello World!"
To say more, custom actions implementation only makes sense for usage in config files.
Implementing custom config loaders¶
IConfigLoader
is a special case, this class is not inheriting from
BaseEntity
and requires idf
to be the same as file extension suffix:
class TxtConfigLoader(IConfigLoader, idf="txt"):
...
Otherwise it should crash the run.
Registering for use in config files¶
The framework will automatically handle the addition of this custom class
to the registry
, and it will become available for use in the config files.
However, the framework needs to know where to look for the custom code. So, users
need to set up an environment variable RTV_USER_CODE_PATH
:
export RTV_USER_CODE_PATH=<custom_code_directory_path>
Substitute <custom_code_directory_path
with an actual path on your file
system where you gonna store the custom code for RTV. You can structure and
and name those files as you want.
Defining custom entities in config¶
YAML Config Example:
Using custom class name:
definitions: - name: my_awesome_transformer class: MyAwesomeTransformer my_awesome_param: 42 # ...
Using
idf
(identifier/alias):definitions: - name: my_awesome_transformer class: awesome my_awesome_param: 42 # ...
NOTE: Custom actions
should not be defined, just used by alias instead
Using custom entities in actions¶
YAML Config Example:
Custom transformer:
actions: - transform: input: data output_name: transformed_data transformers: my_awesome_transformer # ...
Custom action:
actions: - my_awesome_action: awesome_parameter: 42 # ...
Running examples¶
This repository holds an example
directory with some configuration files
and scripts that you can use to test/explore the RTV’s functionality and
features. The directory contains nested README file describing specific
example scripts and/or configuration files in detail.
Setup¶
In order to execute these examples you need to download example
directory
from this repository.
You will also need to set up an environment variable RTV_EXAMPLE_PATH
with the absolute path to the example
directory, like this:
UNIX:
export RTV_EXAMPLE_PATH=<example_path>
Substitute <example_path>
with an actual path to the example
directory in
your filesystem.
Running example scripts¶
Navigate to the examples directory.
Execute example script of choice, e.g.:
python scripts/basic.py
Running example config files¶
Navigate to the examples directory.
Run the
rtv
executable providing path to the example config as a command line argument, e.g.:rtv configs/basic.yaml
Further details on specific examples can be found in the example
directory’s
readme.
Troubleshooting¶
RTVs error output is written into the working directory as files with names
like rtv-error-<timestamp>.log
. You will also see some info logs and
warnings in your stdout
during the execution.