Developer Interface

Main Interface

vpmbench.api.calculate_metric_or_summary(annotated_variant_data, evaluation_data, report, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates a metrics or a summary for all plugins in the annotated variant data.

Parameters
Returns

A dictionary where the keys are the plugins and the result from the calculations are the values

Return type

Dict[Plugin, Any]

vpmbench.api.calculate_metrics_and_summaries(annotated_variant_data, evaluation_data, reporting, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the metrics and summaries for the plugin used to annotate the variants.

Uses calculate_metric_or_summary() to calculate all summaries and metrics from reporting.

Parameters
  • annotated_variant_data – The annotated variant data

  • evaluation_data – The evaluation data

  • reporting – The metrics and summaries that should be calculated

Returns

Keys: the name of the metric/summary; Values: The results from calculate_metric_or_summary()

Return type

Dict

vpmbench.api.extract_evaluation_data(evaluation_data_path, extractor=<class 'vpmbench.extractor.ClinVarVCFExtractor'>, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Extract the EvaluationData from the evaluation input data.

Parses the evaluation the evaluation input data given by the evaluation_data_path using the extractor.

Parameters
  • evaluation_data_path (Union[str, Path]) – The path to the evaluation input data

  • extractor (Type[Extractor]) – The extractor that should be used to parse the evaluation input data

Returns

The evaluation data extracted from evaluation_input_data using the extractor

Return type

EvaluationData

vpmbench.api.invoke_method(plugin, variant_data)

Invoke a prioritization method represented as a plugin on the variant_data.

Uses vpmbench.plugin.Plugin.run() to invoke the prioritization method.

Parameters
  • plugin (Plugin) – The plugin for the method that should be invoked

  • variant_data (pandas.DataFrame) – The variant data which should be processed by the method

Returns

The plugin and the resulting data from the method

Return type

Tuple[Plugin,pandas.DataFrame]

vpmbench.api.invoke_methods(plugins, variant_data, cpu_count=- 1)

Invoke multiple prioritization methods given as a list of plugins on the variant_data in parallel.

Calls vpmbench.api.invoke_method() for each in plugin in plugins on the variant_data. The compatibility of the plugins with the variant_data are checked via Plugin.is_compatible_with_data. If cpu_count is -1 then (number of cpus-1) are used to run the plugins in parallel; set to one 1 disable parallel execution. The resulting annotated variant data is constructed by collecting the outputs of the plugin use them as input for AnnotatedVariantData.from_results.

Parameters
  • variant_data (pandas.DataFrame) – The variant data which should be processed by the plugins

  • plugins (List[Plugin]) – A list of plugins that should be invoked

  • cpu_count (int) – The numbers of cpus that should be used to invoke the plugins in parallel

Returns

The variant data annotated with the scores from the prioritization methods

Return type

AnnotatedVariantData

vpmbench.api.load_plugin(manifest_path)

Load a manifest given by the manifest_path as a plugin.

Parameters

manifest_path (Union[str, Path]) – The path to the manifest

Returns

The loaded plugin

Return type

Plugin

vpmbench.api.load_plugins(plugin_path, plugin_selection=None)

Load all plugins from the plugin_directory and applies the plugin selection to filter them.

If plugin_selection is None all plugins in the plugin_path are returned.

Parameters
  • plugin_path (Union[str, PathLike]) – The path to your plugin directory

  • plugin_selection (Optional[Callable[[Plugin], bool]]) – The selection function that should be applied to filter the plugins

Returns

The list of plugins loaded from the plugin_path

Return type

List[Plugin]

Data

class vpmbench.data.AnnotatedVariantData(annotated_variant_data, plugins)

Represent the variant data annotated with the scores from the prioritization methods.

Contains the same information as the vpmbench.data.EvaluationData.variant_data() and the scores from the methods.

Parameters
  • annotated_variant_data (pandas.core.frame.DataFrame) – The variant data with the annotated scores

  • plugins (List[Plugin]) – The plugins used to calculate the scores

static from_results(original_variant_data, plugin_results)

Create annotated variant data from the original variant data and plugin results.

The annotated variant data is created by merging the plugin scores on the UID column.

Parameters
  • original_variant_data – The original variant data used to calculate the scores

  • plugin_results – The results from invoking the prioritization methods

Returns

The variant data annotated with the scores

Return type

AnnotatedVariantData

property scores

Return the list of scores from the annotated variant data

Returns

The list of scores.

Return type

List[Score]

class vpmbench.data.EvaluationData(table, interpretation_map=None)

Represent the evaluation data.

The evaluation data contains all the information about the variants required to use the data to evaluate the performance of the prioritization methods.

The data of the following information for the variants:

  • UID: A numerical identifier allowing to reference the variant

  • CHROM: The chromosome in which the variant is found

  • POS: The 1-based position of the variant within the chromosome

  • REF: The reference bases.

  • ALT: The alternative bases.

  • RG: The reference genome is used to call the variant

  • TYPE: The variation type of the variant

  • CLASS: The expected classification of the variant

Parameters

table (pandas.DataFrame) – The dataframe containing the required information about the variants.

static from_records(records)

Create a evaluation data table data from list of records.

This method also automatically assigns each record an UID.

Parameters

records (List[EvaluationDataEntry]) – The records that should be included in the table.

Returns

The resulting evaluation data

Return type

EvaluationData

property interpreted_classes

Interpret the CLASS data.

The CLASS data is interpreted by applying vpmbench.enums.PathogencityClass.interpret().

Returns

A series of interpreted classes

Return type

pandas.Series

validate()

Check if the evaluation data is valid.

The following constraints are checked:

  • CHROM has to be in {"1",...,"22","X","Y"}

  • POS has to be > 1

  • REF has to match with re.compile("^[ACGT]+$")

  • ALT has to match with re.compile("^[ACGT]+$")

  • RG has to be of type vpmbench.enums.ReferenceGenome

  • CLASS has to be of type vpmbench.enums.PathogencityClass

  • TYPE has to be of type vpmbench.enums.VariationType

  • UID has to be > 0

Raises

SchemaErrors – If the validation of the data fails

property variant_data

Get the pure variant data from the evaluation data.

The variant data consists of the data in columns: UID,CHROM,POS,REF,ALT,RG,TYPE

Returns

The variant data from the evaluation data.

Return type

DataFrame

class vpmbench.data.EvaluationDataEntry(CHROM, POS, REF, ALT, CLASS, TYPE, RG)

Represent an entry in the vpmbench.data.EvaluationData table.

Parameters
  • CHROM (str) – The chromosome in which the variant is found

  • POS (int) – The 1-based position of the variant within the chromosome

  • REF (str) – The reference bases

  • ALT (str) – The alternative bases

  • CLASS (str) – The expected classification of the variant

  • TYPE (vpmbench.enums.VariationType) – The variation type of the variant

  • RG (vpmbench.enums.ReferenceGenome) – The reference genome is used to call the variant

Extractors

class vpmbench.extractor.CSVExtractor(row_to_entry_func=None, **kwargs)

An implementation of a generic extractor for CSV files.

The implementations uses the Python DictReader to parse a CSV file. To extract the EvaluationData, the _row_to_evaluation_data_entry() is called. If a row to entry function is passed as an argument, this function will be used instead of the internal method.

Parameters
  • row_to_entry_func – A function that called for every row in the CSV file to extract a EvaluationDataEntry

  • kwards – Arguments that are passed to the CSV parser

_row_to_evaluation_data_entry(data_row)

Parses a row of a CSV file to an evaluation data entry.

Parameters

data_row (dict) – A dictionary representing a row of the CSV file

Returns

The evaluation data entry for the row

Return type

EvaluationDataEntry

class vpmbench.extractor.ClinVarVCFExtractor(record_to_pathogencity_class_func=None)

An extractor ClinVAR VCF files based on VCFExtractor.

class vpmbench.extractor.Extractor

Extractors are used to extract the EvaluationData from evaluation input files.

abstract _extract(file_path)

Internal function to extract the evaluation data from the evaluation input file at file-path.

This function has to be implemented for every extractor.

Parameters

file_path – The file path to evaluation input data

Returns

The evaluation data

Return type

EvaluationData

extract(file_path)

Extract the EvaluationData from the file at file_path.

This function calls _extract() and uses vpmbench.data.EvaluationData.validate() to check if the evaluation data is valid.

Parameters

file_path – The file path to evaluation input data

Returns

The validated evaluation data

Return type

EvaluationData

Raises
  • RuntimeError – If the file can not be parsed

  • SchemaErrors – If the validation of the extracted data fails

class vpmbench.extractor.VCFExtractor(record_to_pathogencity_class_func=None)

An implementation of a generic extractor for VCF files.

The implementations uses pyvcf Reader to parse a VCF file. The implementation already extracts POS, CHOM, REF, ALT for each variant. To extract the CLASS the internal _extract_pathogencity_class_from_record() is called for each VCF entry. If a record to pathogenicity class func is passed as an argument, this function will be used instead of the internal method.

Parameters

record_to_pathogencity_class_func – A function that returns the pathogenicty class for each entry in the VCF file.

_extract_pathogencity_class_from_record(index, vcf_record)

Extracts the pathogencity class of a vcf record.

Parameters

vcf_record (vcf.model._Record) – A record of the VCF file

Returns

The pathogenicty class of the variant

Return type

str

class vpmbench.extractor.VariSNPExtractor

An implementation of an for VariSNP files based on CSVExtractor.

Plugins

class vpmbench.plugin.DockerEntryPoint(image, run_command, input, output, bindings=None)

Represent an entry point using Docker to run the custom processing logic

Parameters
  • image (str) – The name of the Docker image used to create a Docker container

  • run_command (str) – The command that invokes the custom processing logic in the Docker container input Information about the file-path and format of the input file.

  • output (dict) – Information about the file-path and format of the output file.

  • bindings (dict) – Additional bindings that should be mounted for Docker container. Keys: local file paths, Values: remote file paths

run(variant_information_table)

Run the custom processing for the entry point.

The variant_information_table is converted into the expected input file format using format_input(). The results from the Docker container are converted using format_output().

Parameters

variant_information_table – The variant information table

Returns

The results from the processing logic

Return type

DataFrame

class vpmbench.plugin.EntryPoint

Represent an entry point to the custom processing logic required to invoke a prioritization method.

abstract run(variant_information_table)

Run the custom processing logic

Has to return a DataFrame with two columns:

  • UID: The UID of the variants

  • SCORE: The calculated score for the variants

Parameters

variant_information_table – The variant information table

Returns

The results from the processing logic

Return type

DataFrame

class vpmbench.plugin.Plugin(name, version, supported_variations, supported_chromosomes, reference_genome, databases, entry_point, cutoff, manifest_path)

Represent a plugin

Basically, the plugin stores the information from the manifest files.

Parameters
  • name (str) – The name of the plugin

  • version (str) – The version of the plugin

  • supported_variations (List[vpmbench.enums.VariationType]) – The variation types supported by the prioritization method

  • reference_genome (vpmbench.enums.ReferenceGenome) – The reference genome supported by the prioritization method

  • databases (dict) – The accompanying databased of the prioritization method; Key: name of the database, Value: version of the database

  • entry_point (vpmbench.plugin.EntryPoint) – The entry point

  • cutoff (float) – The cutoff for pathogenicity

  • manifest_path (Union[str, pathlib.Path]) – The file path to the manifest file for the plugin

static _validate_score_table(variant_information_table, score_table)

Validate the results of the prioritization method.

The following constraints are checked:

  • Each UID from the variant_information_table is also in the score_table

  • Each SCORE in the score_table is a numerical value

Parameters
  • variant_information_table – The variant information table

  • score_table – The scoring results from the prioritization method

Raises

SchemaErrors – If the validation of the data fails

is_compatible_with_data(variant_information_table)

Check if the plugin is compatible with the variant information table.

The following constraints are checked:

  • In the variant information table are only variants with the same reference genome as the plugin

  • In the variant table are only variants with a variation type supported by the plugin

Parameters

variant_information_table – The variant information table

Raises

RuntimeError – If the validation fails

run(variant_information_table)

Run the plugin on the variant_information_table

Before running the plugin the compatibility of the data with the plugin is tested. Next the run() method of the entry_point is called with the variant_information_table. The result of the entry_point is validated to ensure that each variant from the variant_information_table got a valid score assigned. Finally, the score column is renamed using the score_column_name().

The resulting Dataframe consists of two columns:

Parameters

variant_information_table – The variant information table

Returns

The plugin result.

Return type

DataFrame

property score_column_name

Return the column name for the AnnotatedVariantData.

The name is calculated by "{self.name}_SCORE".

Returns

The name of the column

Return type

str

class vpmbench.plugin.PluginBuilder

This class builds the Plugins

classmethod build_plugin(**kwargs)

Build a plugin from the arguments.

See the documentation for specification the manifest schema.

Parameters

kwargs – The arguments

Returns

The Plugin

Return type

Plugin

Raises

RuntimeError – If required in formation is missing.

class vpmbench.plugin.PythonEntryPoint(file_path)

Represent an entry point using Python to run the custom processing logic

The entry point has to be implemented in the Python file via a function entry_point accepting the variant_data() as input.

Parameters
  • file_path (pathlib.Path) – Path the Python file containing the implementation of the custom processing logic

  • plugin – Reference to the plugin of the entry point

run(variant_information_table)

Run the custom processing logic

Has to return a DataFrame with two columns:

  • UID: The UID of the variants

  • SCORE: The calculated score for the variants

Parameters

variant_information_table – The variant information table

Returns

The results from the processing logic

Return type

DataFrame

class vpmbench.plugin.Score(plugin, data)

Represent a score from a prioritization method.

Parameters
property cutoff

Get the cutoff from the plugin of the score.

Returns

The cutoff

Return type

float

interpret(cutoff=None)

Interpret the score using the cutoff.

If the cutoff is None the vpmbench.data.Score.cutoff() is used to interpret the score. The score is interpreted by replacing all values greater as the cutoff by 1, 0 otherwise.

Parameters

cutoff – The cutoff

Returns

The interpreted scores

Return type

pandas.Series

Performance Summaries

class vpmbench.summaries.ConfusionMatrix
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the confusion matrix.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: tn - the number of true negatives, fp - the number of false positives, fn - the number of false negatives, tp - the number of the true positives

Return type

dict

class vpmbench.summaries.PerformanceSummary

Represent a performance summary

class vpmbench.summaries.PrecisionRecallCurve
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the precision recall curve.

Parameters
  • score – The score from the prioritization method

  • interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: precsion- precision values, recall - recall values, thresholds - the thresholds

Return type

dict

class vpmbench.summaries.ROCCurve
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculates the ROC curves.

Parameters
  • score – The score from the prioritization method

  • interpreted_classes – The interpreted classes

Returns

A dictionary with the following keys: fpr- false positive rates, tpr - true positives rates, thresholds - the thresholds

Return type

dict

Performance Metrics

class vpmbench.metrics.Accuracy
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the accuracy.

Uses a ConfusionMatrix to calculate the accuracy/true positive rate.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated accuracy

Return type

float

class vpmbench.metrics.AreaUnderTheCurveROC

Calculate the area under the roc curve (AUROC).

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated AUC

Return type

float

class vpmbench.metrics.Concordance
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the concordance, i.e, the sum of true positives and true negatives.

Uses a ConfusionMatrix to calculate the concordance.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated concordance

Return type

float

class vpmbench.metrics.MatthewsCorrelationCoefficient
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the matthews correlation coefficient.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The matthews correlation coefficient

Return type

float

class vpmbench.metrics.NegativePredictiveValue
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the negative predictive value.

Uses a ConfusionMatrix to calculate the negative predictive value.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated negative predictive value

Return type

float

class vpmbench.metrics.PerformanceMetric

Represent a metrics.

class vpmbench.metrics.Precision
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the precision.

Uses a ConfusionMatrix to calculate the precision/positive predictive value.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated precision

Return type

float

class vpmbench.metrics.Sensitivity
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the sensitivity.

Uses a ConfusionMatrix to calculate the sensitivity/true positive rate.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated sensitivity

Return type

float

class vpmbench.metrics.Specificity
static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})

Calculate the specificity.

Uses a ConfusionMatrix to calculate the specificity/false positive rate.

Parameters
  • score – The score from the plugin

  • interpreted_classes – The interpreted classes

Returns

The calculated specificity

Return type

float

Utilities

vpmbench.utils.plot_confusion_matrices(report, normalize=False, cmap='Blues')

Plot the confusion matrices of the prioritization method from a performance report

Shows the roc curve the vpmbench.summaries.ConfusionMatrix was calculated.

Parameters
  • report – The performance report

  • normalize – If true the values in the confusion matrix are normalized

  • cmap – The colormap that should be used to plot the confusion matrices

vpmbench.utils.plot_precision_recall_curves(report)

Plot the precision recall curves using a performance report

Shows the precision recall curve the vpmbench.summaries.PrecisionRecallCurve was calculated.

Parameters

report – The performance report

vpmbench.utils.plot_roc_curves(report)

Plot the ROC curves using a performance report

Shows the roc curve the vpmbench.summaries.ROCCurve was calculated.

Parameters

report – The performance report

vpmbench.utils.report_metrics(report)

Print the calculated metrics to the terminal

Parameters

report – The performance report

Processor

vpmbench.processors.format_input(variant_information_table, target_format, target_file, **kwargs)

Formats the variant information table into the target format and write the results to target file.

Parameters
  • variant_information_table – The input table

  • target_format – The format in which data should be written

  • target_file – The file in which the data should be written

  • kwargs – Additional arguments that are passed to the converter function

vpmbench.processors.format_output(variant_information_table, output_format, output_file, **kwargs)

Formats the content of the output file into a dataframe.

Parameters
  • variant_information_table – The variant information table used to calculate the results in the output file

  • output_format – The format of the output file

  • output_file – The file from which the output should be read

  • kwargs – Additional arguments

Returns

The formatted content of the output file

Return type

DataFrame

Enums

class vpmbench.enums.ReferenceGenome(value)

Represent reference genomes.

Following values are supported:

  • HG38

  • HG19

  • HG18

  • HG17

  • HG16

static resolve(name)

Resolve string into a reference genome.

The following rules apply:

  • if “grch38” in name.lower() -> ReferenceGenome.HG38

  • if “grch37” in name.lower() -> ReferenceGenome.HG19

  • otherwise: ReferenceGenome(name) is called

Parameters

name – The string.

Returns

The reference genome

Return type

ReferenceGenome

Raises

RuntimeError – If the name can not be solved

class vpmbench.enums.VariationType(value)

Represent the variation types of the variants.

Following values are supported:

  • SNP for single-nucleotide polymorphism

  • INDEL for insertions or deletions

static resolve(name)

Return the variation type based on the given string

The following rules apply:

  • if name.lower() == ‘snp’ -> VariationType.SNP

  • if name.lower() == ‘indel’ -> VariationType.INDEL

Parameters

name – The string

Returns

The variation type

Return type

VariationType

Raises

RuntimeError – If the name can not be solved

Predicates

vpmbench.predicates.is_multiclass_plugin(plugin)

Checks whether a plugin is a multi-class plugin.

Parameters

plugin (Plugin) – The plugin to be checked

Returns

The checking result

Return type

bool

vpmbench.predicates.was_trained_with(plugin, database_name)

Checks whether a plugin was trained with a specifc database.

Parameters
  • plugin (Plugin) – The plugin to be checked

  • database_name (str) – The database name

Returns

The checking result

Return type

bool

Configuration

The config modules defines the following variables:

vpmbench.config.DEFAULT_PLUGIN_PATH

The default plugin path where vpmbench searches for plugins.

Type

pathlib.Path

Value

The directory VPMBench-Plugins in the home directory of the current user, e.g, /home/user/VPMBench-Plugins.