Developer Interface

Main Interface

vpmbench.api.calculate_metric_or_summary(annotated_variant_data, evaluation_data, report, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates a metrics or a summary for all plugins in the annotated variant data.

Parameters

annotated_variant_data (AnnotatedVariantData) – The annotated variant data
evaluation_data (EvaluationData) – The evaluation data
report (Union[Type[PerformanceMetric], Type[PerformanceSummary]]) – The performance summary or metric that should be calculated

Returns

A dictionary where the keys are the plugins and the result from the calculations are the values

Return type

Dict[Plugin, Any]

vpmbench.api.calculate_metrics_and_summaries(annotated_variant_data, evaluation_data, reporting, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the metrics and summaries for the plugin used to annotate the variants.

Uses calculate_metric_or_summary() to calculate all summaries and metrics from reporting.

Parameters

annotated_variant_data – The annotated variant data
evaluation_data – The evaluation data
reporting – The metrics and summaries that should be calculated

Returns

Keys: the name of the metric/summary; Values: The results from calculate_metric_or_summary()

Return type

Dict

vpmbench.api.extract_evaluation_data(evaluation_data_path, extractor=<class 'vpmbench.extractor.ClinVarVCFExtractor'>, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Extract the EvaluationData from the evaluation input data.

Parses the evaluation the evaluation input data given by the evaluation_data_path using the extractor.

Parameters

evaluation_data_path (Union[str, Path]) – The path to the evaluation input data
extractor (Type[Extractor]) – The extractor that should be used to parse the evaluation input data

Returns

The evaluation data extracted from evaluation_input_data using the extractor

Return type

EvaluationData

vpmbench.api.invoke_method(plugin, variant_data)

Invoke a prioritization method represented as a plugin on the variant_data.

Uses vpmbench.plugin.Plugin.run() to invoke the prioritization method.

Parameters

plugin (Plugin) – The plugin for the method that should be invoked
variant_data (pandas.DataFrame) – The variant data which should be processed by the method

Returns

The plugin and the resulting data from the method

Return type

Tuple[Plugin,pandas.DataFrame]

vpmbench.api.invoke_methods(plugins, variant_data, cpu_count=- 1)

Invoke multiple prioritization methods given as a list of plugins on the variant_data in parallel.

Calls vpmbench.api.invoke_method() for each in plugin in plugins on the variant_data. The compatibility of the plugins with the variant_data are checked via Plugin.is_compatible_with_data. If cpu_count is -1 then (number of cpus-1) are used to run the plugins in parallel; set to one 1 disable parallel execution. The resulting annotated variant data is constructed by collecting the outputs of the plugin use them as input for AnnotatedVariantData.from_results.

Parameters

variant_data (pandas.DataFrame) – The variant data which should be processed by the plugins
plugins (List[Plugin]) – A list of plugins that should be invoked
cpu_count (int) – The numbers of cpus that should be used to invoke the plugins in parallel

Returns

The variant data annotated with the scores from the prioritization methods

Return type

AnnotatedVariantData

vpmbench.api.load_plugin(manifest_path)

Load a manifest given by the manifest_path as a plugin.

Parameters: manifest_path (Union[str, Path]) – The path to the manifest
Returns: The loaded plugin
Return type: Plugin

vpmbench.api.load_plugins(plugin_path, plugin_selection=None)

Load all plugins from the plugin_directory and applies the plugin selection to filter them.

If plugin_selection is None all plugins in the plugin_path are returned.

Parameters

plugin_path (Union[str, PathLike]) – The path to your plugin directory
plugin_selection (Optional[Callable[[Plugin], bool]]) – The selection function that should be applied to filter the plugins

Returns

The list of plugins loaded from the plugin_path

Return type

List[Plugin]

Data

class vpmbench.data.AnnotatedVariantData(annotated_variant_data, plugins)

Represent the variant data annotated with the scores from the prioritization methods.

Contains the same information as the vpmbench.data.EvaluationData.variant_data() and the scores from the methods.

Parameters

annotated_variant_data (pandas.core.frame.DataFrame) – The variant data with the annotated scores
plugins (List[Plugin]) – The plugins used to calculate the scores

static from_results(original_variant_data, plugin_results)

Create annotated variant data from the original variant data and plugin results.

The annotated variant data is created by merging the plugin scores on the UID column.

Parameters

original_variant_data – The original variant data used to calculate the scores
plugin_results – The results from invoking the prioritization methods

Returns

The variant data annotated with the scores

Return type

AnnotatedVariantData

property scores

Return the list of scores from the annotated variant data

Returns: The list of scores.
Return type: List[Score]

class vpmbench.data.EvaluationData(table, interpretation_map=None)

Represent the evaluation data.

The evaluation data contains all the information about the variants required to use the data to evaluate the performance of the prioritization methods.

The data of the following information for the variants:

UID: A numerical identifier allowing to reference the variant

CHROM: The chromosome in which the variant is found

POS: The 1-based position of the variant within the chromosome

REF: The reference bases.

ALT: The alternative bases.

RG: The reference genome is used to call the variant

TYPE: The variation type of the variant

CLASS: The expected classification of the variant

Parameters: table (pandas.DataFrame) – The dataframe containing the required information about the variants.

static from_records(records)

Create a evaluation data table data from list of records.

This method also automatically assigns each record an UID.

Parameters: records (List[EvaluationDataEntry]) – The records that should be included in the table.
Returns: The resulting evaluation data
Return type: EvaluationData

property interpreted_classes

Interpret the CLASS data.

The CLASS data is interpreted by applying vpmbench.enums.PathogencityClass.interpret().

Returns: A series of interpreted classes
Return type: pandas.Series

validate()

Check if the evaluation data is valid.

The following constraints are checked:

CHROM has to be in {"1",...,"22","X","Y"}

POS has to be > 1

REF has to match with re.compile("^[ACGT]+$")

ALT has to match with re.compile("^[ACGT]+$")

RG has to be of type vpmbench.enums.ReferenceGenome

CLASS has to be of type vpmbench.enums.PathogencityClass

TYPE has to be of type vpmbench.enums.VariationType

UID has to be > 0

Raises: SchemaErrors – If the validation of the data fails

property variant_data

Get the pure variant data from the evaluation data.

The variant data consists of the data in columns: UID,CHROM,POS,REF,ALT,RG,TYPE

Returns: The variant data from the evaluation data.
Return type: DataFrame

class vpmbench.data.EvaluationDataEntry(CHROM, POS, REF, ALT, CLASS, TYPE, RG)

Represent an entry in the vpmbench.data.EvaluationData table.

Parameters

CHROM (str) – The chromosome in which the variant is found
POS (int) – The 1-based position of the variant within the chromosome
REF (str) – The reference bases
ALT (str) – The alternative bases
CLASS (str) – The expected classification of the variant
TYPE (vpmbench.enums.VariationType) – The variation type of the variant
RG (vpmbench.enums.ReferenceGenome) – The reference genome is used to call the variant

Extractors

class vpmbench.extractor.CSVExtractor(row_to_entry_func=None, **kwargs)

An implementation of a generic extractor for CSV files.

The implementations uses the Python DictReader to parse a CSV file. To extract the EvaluationData, the _row_to_evaluation_data_entry() is called. If a row to entry function is passed as an argument, this function will be used instead of the internal method.

Parameters

row_to_entry_func – A function that called for every row in the CSV file to extract a EvaluationDataEntry
kwards – Arguments that are passed to the CSV parser

_row_to_evaluation_data_entry(data_row)

Parses a row of a CSV file to an evaluation data entry.

Parameters: data_row (dict) – A dictionary representing a row of the CSV file
Returns: The evaluation data entry for the row
Return type: EvaluationDataEntry

class vpmbench.extractor.ClinVarVCFExtractor(record_to_pathogencity_class_func=None): An extractor ClinVAR VCF files based on VCFExtractor.

class vpmbench.extractor.Extractor

Extractors are used to extract the EvaluationData from evaluation input files.

abstract _extract(file_path)

Internal function to extract the evaluation data from the evaluation input file at file-path.

This function has to be implemented for every extractor.

Parameters: file_path – The file path to evaluation input data
Returns: The evaluation data
Return type: EvaluationData

extract(file_path)

Extract the EvaluationData from the file at file_path.

This function calls _extract() and uses vpmbench.data.EvaluationData.validate() to check if the evaluation data is valid.

Parameters

file_path – The file path to evaluation input data

Returns

The validated evaluation data

Return type

EvaluationData

Raises

RuntimeError – If the file can not be parsed
SchemaErrors – If the validation of the extracted data fails

class vpmbench.extractor.VCFExtractor(record_to_pathogencity_class_func=None)

An implementation of a generic extractor for VCF files.

The implementations uses pyvcf Reader to parse a VCF file. The implementation already extracts POS, CHOM, REF, ALT for each variant. To extract the CLASS the internal _extract_pathogencity_class_from_record() is called for each VCF entry. If a record to pathogenicity class func is passed as an argument, this function will be used instead of the internal method.

Parameters: record_to_pathogencity_class_func – A function that returns the pathogenicty class for each entry in the VCF file.

_extract_pathogencity_class_from_record(index, vcf_record)

Extracts the pathogencity class of a vcf record.

Parameters: vcf_record (vcf.model._Record) – A record of the VCF file
Returns: The pathogenicty class of the variant
Return type: str

class vpmbench.extractor.VariSNPExtractor: An implementation of an for VariSNP files based on CSVExtractor.

Plugins

class vpmbench.plugin.DockerEntryPoint(image, run_command, input, output, bindings=None)

Represent an entry point using Docker to run the custom processing logic

Parameters

image (str) – The name of the Docker image used to create a Docker container
run_command (str) – The command that invokes the custom processing logic in the Docker container input Information about the file-path and format of the input file.
output (dict) – Information about the file-path and format of the output file.
bindings (dict) – Additional bindings that should be mounted for Docker container. Keys: local file paths, Values: remote file paths

run(variant_information_table)

Run the custom processing for the entry point.

The variant_information_table is converted into the expected input file format using format_input(). The results from the Docker container are converted using format_output().

Parameters: variant_information_table – The variant information table
Returns: The results from the processing logic
Return type: DataFrame

class vpmbench.plugin.EntryPoint

Represent an entry point to the custom processing logic required to invoke a prioritization method.

abstract run(variant_information_table)

Run the custom processing logic

Has to return a DataFrame with two columns:

UID: The UID of the variants

SCORE: The calculated score for the variants

Parameters: variant_information_table – The variant information table
Returns: The results from the processing logic
Return type: DataFrame

class vpmbench.plugin.Plugin(name, version, supported_variations, supported_chromosomes, reference_genome, databases, entry_point, cutoff, manifest_path)

Represent a plugin

Basically, the plugin stores the information from the manifest files.

Parameters

name (str) – The name of the plugin
version (str) – The version of the plugin
supported_variations (List[vpmbench.enums.VariationType]) – The variation types supported by the prioritization method
reference_genome (vpmbench.enums.ReferenceGenome) – The reference genome supported by the prioritization method
databases (dict) – The accompanying databased of the prioritization method; Key: name of the database, Value: version of the database
entry_point (vpmbench.plugin.EntryPoint) – The entry point
cutoff (float) – The cutoff for pathogenicity
manifest_path (Union[str, pathlib.Path]) – The file path to the manifest file for the plugin

static _validate_score_table(variant_information_table, score_table)

Validate the results of the prioritization method.

The following constraints are checked:

Each UID from the variant_information_table is also in the score_table

Each SCORE in the score_table is a numerical value

Parameters

variant_information_table – The variant information table
score_table – The scoring results from the prioritization method

Raises

SchemaErrors – If the validation of the data fails

is_compatible_with_data(variant_information_table)

Check if the plugin is compatible with the variant information table.

The following constraints are checked:

In the variant information table are only variants with the same reference genome as the plugin

In the variant table are only variants with a variation type supported by the plugin

Parameters: variant_information_table – The variant information table
Raises: RuntimeError – If the validation fails

run(variant_information_table)

Run the plugin on the variant_information_table

Before running the plugin the compatibility of the data with the plugin is tested. Next the run() method of the entry_point is called with the variant_information_table. The result of the entry_point is validated to ensure that each variant from the variant_information_table got a valid score assigned. Finally, the score column is renamed using the score_column_name().

The resulting Dataframe consists of two columns:

UID: The UID of the variants

score_column_name(): The scores from the prioritization method

Parameters: variant_information_table – The variant information table
Returns: The plugin result.
Return type: DataFrame

property score_column_name

Return the column name for the AnnotatedVariantData.

The name is calculated by "{self.name}_SCORE".

Returns: The name of the column
Return type: str

class vpmbench.plugin.PluginBuilder

This class builds the Plugins

classmethod build_plugin(**kwargs)

Build a plugin from the arguments.

See the documentation for specification the manifest schema.

Parameters: kwargs – The arguments
Returns: The Plugin
Return type: Plugin
Raises: RuntimeError – If required in formation is missing.

class vpmbench.plugin.PythonEntryPoint(file_path)

Represent an entry point using Python to run the custom processing logic

The entry point has to be implemented in the Python file via a function entry_point accepting the variant_data() as input.

Parameters

file_path (pathlib.Path) – Path the Python file containing the implementation of the custom processing logic
plugin – Reference to the plugin of the entry point

run(variant_information_table)

Run the custom processing logic

Has to return a DataFrame with two columns:

UID: The UID of the variants

SCORE: The calculated score for the variants

Parameters: variant_information_table – The variant information table
Returns: The results from the processing logic
Return type: DataFrame

class vpmbench.plugin.Score(plugin, data)

Represent a score from a prioritization method.

Parameters

plugin (vpmbench.plugin.Plugin) – The method calculated the score
data (pandas.core.series.Series) – The calculated scores

property cutoff

Get the cutoff from the plugin of the score.

Returns: The cutoff
Return type: float

interpret(cutoff=None)

Interpret the score using the cutoff.

If the cutoff is None the vpmbench.data.Score.cutoff() is used to interpret the score. The score is interpreted by replacing all values greater as the cutoff by 1, 0 otherwise.

Parameters: cutoff – The cutoff
Returns: The interpreted scores
Return type: pandas.Series

Performance Summaries

class vpmbench.summaries.ConfusionMatrix

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the confusion matrix.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: tn - the number of true negatives, fp - the number of false positives, fn - the number of false negatives, tp - the number of the true positives

Return type

dict

class vpmbench.summaries.PerformanceSummary: Represent a performance summary

class vpmbench.summaries.PrecisionRecallCurve

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the precision recall curve.

Parameters

score – The score from the prioritization method
interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: precsion- precision values, recall - recall values, thresholds - the thresholds

Return type

dict

class vpmbench.summaries.ROCCurve

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the ROC curves.

Parameters

score – The score from the prioritization method
interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: fpr- false positive rates, tpr - true positives rates, thresholds - the thresholds

Return type

dict

Performance Metrics

class vpmbench.metrics.Accuracy

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the accuracy.

Uses a ConfusionMatrix to calculate the accuracy/true positive rate.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated accuracy

Return type

float

class vpmbench.metrics.AreaUnderTheCurveROC

Calculate the area under the roc curve (AUROC).

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated AUC

Return type

float

class vpmbench.metrics.Concordance

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the concordance, i.e, the sum of true positives and true negatives.

Uses a ConfusionMatrix to calculate the concordance.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated concordance

Return type

float

class vpmbench.metrics.MatthewsCorrelationCoefficient

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the matthews correlation coefficient.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The matthews correlation coefficient

Return type

float

class vpmbench.metrics.NegativePredictiveValue

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the negative predictive value.

Uses a ConfusionMatrix to calculate the negative predictive value.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated negative predictive value

Return type

float

class vpmbench.metrics.PerformanceMetric: Represent a metrics.

class vpmbench.metrics.Precision

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the precision.

Uses a ConfusionMatrix to calculate the precision/positive predictive value.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated precision

Return type

float

class vpmbench.metrics.Sensitivity

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the sensitivity.

Uses a ConfusionMatrix to calculate the sensitivity/true positive rate.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated sensitivity

Return type

float

class vpmbench.metrics.Specificity

static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the specificity.

Uses a ConfusionMatrix to calculate the specificity/false positive rate.

Parameters

score – The score from the plugin
interpreted_classes – The interpreted classes

Returns

The calculated specificity

Return type

float

Utilities

vpmbench.utils.plot_confusion_matrices(report, normalize=False, cmap='Blues')

Plot the confusion matrices of the prioritization method from a performance report

Shows the roc curve the vpmbench.summaries.ConfusionMatrix was calculated.

Parameters

report – The performance report
normalize – If true the values in the confusion matrix are normalized
cmap – The colormap that should be used to plot the confusion matrices

vpmbench.utils.plot_precision_recall_curves(report)

Plot the precision recall curves using a performance report

Shows the precision recall curve the vpmbench.summaries.PrecisionRecallCurve was calculated.

Parameters: report – The performance report

vpmbench.utils.plot_roc_curves(report)

Plot the ROC curves using a performance report

Shows the roc curve the vpmbench.summaries.ROCCurve was calculated.

Parameters: report – The performance report

vpmbench.utils.report_metrics(report)

Print the calculated metrics to the terminal

Parameters: report – The performance report

Processor

vpmbench.processors.format_input(variant_information_table, target_format, target_file, **kwargs)

Formats the variant information table into the target format and write the results to target file.

Parameters

variant_information_table – The input table
target_format – The format in which data should be written
target_file – The file in which the data should be written
kwargs – Additional arguments that are passed to the converter function

vpmbench.processors.format_output(variant_information_table, output_format, output_file, **kwargs)

Formats the content of the output file into a dataframe.

Parameters

variant_information_table – The variant information table used to calculate the results in the output file
output_format – The format of the output file
output_file – The file from which the output should be read
kwargs – Additional arguments

Returns

The formatted content of the output file

Return type

DataFrame

Enums

class vpmbench.enums.ReferenceGenome(value)

Represent reference genomes.

Following values are supported:

HG38

HG19

HG18

HG17

HG16

static resolve(name)

Resolve string into a reference genome.

The following rules apply:

if “grch38” in name.lower() -> ReferenceGenome.HG38

if “grch37” in name.lower() -> ReferenceGenome.HG19

otherwise: ReferenceGenome(name) is called

Parameters: name – The string.
Returns: The reference genome
Return type: ReferenceGenome
Raises: RuntimeError – If the name can not be solved

class vpmbench.enums.VariationType(value)

Represent the variation types of the variants.

Following values are supported:

SNP for single-nucleotide polymorphism

INDEL for insertions or deletions

static resolve(name)

Return the variation type based on the given string

The following rules apply:

if name.lower() == ‘snp’ -> VariationType.SNP

if name.lower() == ‘indel’ -> VariationType.INDEL

Parameters: name – The string
Returns: The variation type
Return type: VariationType
Raises: RuntimeError – If the name can not be solved

Predicates

vpmbench.predicates.is_multiclass_plugin(plugin)

Checks whether a plugin is a multi-class plugin.

Parameters: plugin (Plugin) – The plugin to be checked
Returns: The checking result
Return type: bool

vpmbench.predicates.was_trained_with(plugin, database_name)

Checks whether a plugin was trained with a specifc database.

Parameters

plugin (Plugin) – The plugin to be checked
database_name (str) – The database name

Returns

The checking result

Return type

bool

Configuration

The config modules defines the following variables:

vpmbench.config.DEFAULT_PLUGIN_PATH

The default plugin path where vpmbench searches for plugins.

Type: pathlib.Path
Value: The directory VPMBench-Plugins in the home directory of the current user, e.g, /home/user/VPMBench-Plugins.