Developer Interface
Main Interface
- vpmbench.api.calculate_metric_or_summary(annotated_variant_data, evaluation_data, report, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculates a metrics or a summary for all plugins in the annotated variant data.
- Parameters
annotated_variant_data (AnnotatedVariantData) – The annotated variant data
evaluation_data (EvaluationData) – The evaluation data
report (Union[Type[PerformanceMetric], Type[PerformanceSummary]]) – The performance summary or metric that should be calculated
- Returns
A dictionary where the keys are the plugins and the result from the calculations are the values
- Return type
Dict[Plugin, Any]
- vpmbench.api.calculate_metrics_and_summaries(annotated_variant_data, evaluation_data, reporting, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculates the metrics and summaries for the plugin used to annotate the variants.
Uses
calculate_metric_or_summary()
to calculate all summaries and metrics from reporting.- Parameters
annotated_variant_data – The annotated variant data
evaluation_data – The evaluation data
reporting – The metrics and summaries that should be calculated
- Returns
Keys: the name of the metric/summary; Values: The results from
calculate_metric_or_summary()
- Return type
Dict
- vpmbench.api.extract_evaluation_data(evaluation_data_path, extractor=<class 'vpmbench.extractor.ClinVarVCFExtractor'>, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Extract the EvaluationData from the evaluation input data.
Parses the evaluation the evaluation input data given by the evaluation_data_path using the extractor.
- Parameters
- Returns
The evaluation data extracted from evaluation_input_data using the extractor
- Return type
- vpmbench.api.invoke_method(plugin, variant_data)
Invoke a prioritization method represented as a plugin on the variant_data.
Uses
vpmbench.plugin.Plugin.run()
to invoke the prioritization method.- Parameters
plugin (Plugin) – The plugin for the method that should be invoked
variant_data (pandas.DataFrame) – The variant data which should be processed by the method
- Returns
The plugin and the resulting data from the method
- Return type
Tuple[Plugin,pandas.DataFrame]
- vpmbench.api.invoke_methods(plugins, variant_data, cpu_count=- 1)
Invoke multiple prioritization methods given as a list of plugins on the variant_data in parallel.
Calls
vpmbench.api.invoke_method()
for each in plugin in plugins on the variant_data. The compatibility of the plugins with the variant_data are checked viaPlugin.is_compatible_with_data
. If cpu_count is -1 then (number of cpus-1) are used to run the plugins in parallel; set to one 1 disable parallel execution. The resulting annotated variant data is constructed by collecting the outputs of the plugin use them as input forAnnotatedVariantData.from_results
.- Parameters
variant_data (pandas.DataFrame) – The variant data which should be processed by the plugins
plugins (List[Plugin]) – A list of plugins that should be invoked
cpu_count (int) – The numbers of cpus that should be used to invoke the plugins in parallel
- Returns
The variant data annotated with the scores from the prioritization methods
- Return type
- vpmbench.api.load_plugin(manifest_path)
Load a manifest given by the manifest_path as a plugin.
- vpmbench.api.load_plugins(plugin_path, plugin_selection=None)
Load all plugins from the plugin_directory and applies the plugin selection to filter them.
If plugin_selection is None all plugins in the plugin_path are returned.
Data
- class vpmbench.data.AnnotatedVariantData(annotated_variant_data, plugins)
Represent the variant data annotated with the scores from the prioritization methods.
Contains the same information as the
vpmbench.data.EvaluationData.variant_data()
and the scores from the methods.- Parameters
annotated_variant_data (pandas.core.frame.DataFrame) – The variant data with the annotated scores
plugins (List[Plugin]) – The plugins used to calculate the scores
- static from_results(original_variant_data, plugin_results)
Create annotated variant data from the original variant data and plugin results.
The annotated variant data is created by merging the plugin scores on the UID column.
- Parameters
original_variant_data – The original variant data used to calculate the scores
plugin_results – The results from
invoking
the prioritization methods
- Returns
The variant data annotated with the scores
- Return type
- class vpmbench.data.EvaluationData(table, interpretation_map=None)
Represent the evaluation data.
The evaluation data contains all the information about the variants required to use the data to evaluate the performance of the prioritization methods.
The data of the following information for the variants:
UID: A numerical identifier allowing to reference the variant
CHROM: The chromosome in which the variant is found
POS: The 1-based position of the variant within the chromosome
REF: The reference bases.
ALT: The alternative bases.
RG: The reference genome is used to call the variant
TYPE: The variation type of the variant
CLASS: The expected classification of the variant
- Parameters
table (pandas.DataFrame) – The dataframe containing the required information about the variants.
- static from_records(records)
Create a evaluation data table data from list of records.
This method also automatically assigns each record an UID.
- Parameters
records (List[EvaluationDataEntry]) – The records that should be included in the table.
- Returns
The resulting evaluation data
- Return type
- property interpreted_classes
Interpret the CLASS data.
The CLASS data is interpreted by applying
vpmbench.enums.PathogencityClass.interpret()
.- Returns
A series of interpreted classes
- Return type
- validate()
Check if the evaluation data is valid.
The following constraints are checked:
CHROM has to be in
{"1",...,"22","X","Y"}
POS has to be
> 1
REF has to match with
re.compile("^[ACGT]+$")
ALT has to match with
re.compile("^[ACGT]+$")
RG has to be of type
vpmbench.enums.ReferenceGenome
CLASS has to be of type
vpmbench.enums.PathogencityClass
TYPE has to be of type
vpmbench.enums.VariationType
UID has to be
> 0
- Raises
SchemaErrors – If the validation of the data fails
- property variant_data
Get the pure variant data from the evaluation data.
The variant data consists of the data in columns: UID,CHROM,POS,REF,ALT,RG,TYPE
- Returns
The variant data from the evaluation data.
- Return type
DataFrame
- class vpmbench.data.EvaluationDataEntry(CHROM, POS, REF, ALT, CLASS, TYPE, RG)
Represent an entry in the
vpmbench.data.EvaluationData
table.- Parameters
CHROM (str) – The chromosome in which the variant is found
POS (int) – The 1-based position of the variant within the chromosome
REF (str) – The reference bases
ALT (str) – The alternative bases
CLASS (str) – The expected classification of the variant
TYPE (vpmbench.enums.VariationType) – The variation type of the variant
RG (vpmbench.enums.ReferenceGenome) – The reference genome is used to call the variant
Extractors
- class vpmbench.extractor.CSVExtractor(row_to_entry_func=None, **kwargs)
An implementation of a generic extractor for CSV files.
The implementations uses the Python
DictReader
to parse a CSV file. To extract theEvaluationData
, the_row_to_evaluation_data_entry()
is called. If a row to entry function is passed as an argument, this function will be used instead of the internal method.- Parameters
row_to_entry_func – A function that called for every row in the CSV file to extract a
EvaluationDataEntry
kwards – Arguments that are passed to the CSV parser
- class vpmbench.extractor.ClinVarVCFExtractor(record_to_pathogencity_class_func=None)
An extractor ClinVAR VCF files based on
VCFExtractor
.
- class vpmbench.extractor.Extractor
Extractors are used to extract the
EvaluationData
from evaluation input files.- abstract _extract(file_path)
Internal function to extract the evaluation data from the evaluation input file at file-path.
This function has to be implemented for every extractor.
- Parameters
file_path – The file path to evaluation input data
- Returns
The evaluation data
- Return type
- extract(file_path)
Extract the
EvaluationData
from the file at file_path.This function calls
_extract()
and usesvpmbench.data.EvaluationData.validate()
to check if the evaluation data is valid.- Parameters
file_path – The file path to evaluation input data
- Returns
The validated evaluation data
- Return type
- Raises
RuntimeError – If the file can not be parsed
SchemaErrors – If the validation of the extracted data fails
- class vpmbench.extractor.VCFExtractor(record_to_pathogencity_class_func=None)
An implementation of a generic extractor for VCF files.
The implementations uses pyvcf
Reader
to parse a VCF file. The implementation already extracts POS, CHOM, REF, ALT for each variant. To extract the CLASS the internal_extract_pathogencity_class_from_record()
is called for each VCF entry. If a record to pathogenicity class func is passed as an argument, this function will be used instead of the internal method.- Parameters
record_to_pathogencity_class_func – A function that returns the pathogenicty class for each entry in the VCF file.
- _extract_pathogencity_class_from_record(index, vcf_record)
Extracts the pathogencity class of a vcf record.
- Parameters
vcf_record (vcf.model._Record) – A record of the VCF file
- Returns
The pathogenicty class of the variant
- Return type
- class vpmbench.extractor.VariSNPExtractor
An implementation of an for VariSNP files based on
CSVExtractor
.
Plugins
- class vpmbench.plugin.DockerEntryPoint(image, run_command, input, output, bindings=None)
Represent an entry point using Docker to run the custom processing logic
- Parameters
image (str) – The name of the Docker image used to create a Docker container
run_command (str) – The command that invokes the custom processing logic in the Docker container input Information about the
file-path
andformat
of the input file.output (dict) – Information about the
file-path
andformat
of the output file.bindings (dict) – Additional bindings that should be mounted for Docker container. Keys: local file paths, Values: remote file paths
- run(variant_information_table)
Run the custom processing for the entry point.
The variant_information_table is converted into the expected input file format using
format_input()
. The results from the Docker container are converted usingformat_output()
.- Parameters
variant_information_table – The variant information table
- Returns
The results from the processing logic
- Return type
DataFrame
- class vpmbench.plugin.EntryPoint
Represent an entry point to the custom processing logic required to invoke a prioritization method.
- abstract run(variant_information_table)
Run the custom processing logic
Has to return a
DataFrame
with two columns:UID: The UID of the variants
SCORE: The calculated score for the variants
- Parameters
variant_information_table – The variant information table
- Returns
The results from the processing logic
- Return type
DataFrame
- class vpmbench.plugin.Plugin(name, version, supported_variations, supported_chromosomes, reference_genome, databases, entry_point, cutoff, manifest_path)
Represent a plugin
Basically, the plugin stores the information from the manifest files.
- Parameters
name (str) – The name of the plugin
version (str) – The version of the plugin
supported_variations (List[vpmbench.enums.VariationType]) – The variation types supported by the prioritization method
reference_genome (vpmbench.enums.ReferenceGenome) – The reference genome supported by the prioritization method
databases (dict) – The accompanying databased of the prioritization method; Key: name of the database, Value: version of the database
entry_point (vpmbench.plugin.EntryPoint) – The entry point
cutoff (float) – The cutoff for pathogenicity
manifest_path (Union[str, pathlib.Path]) – The file path to the manifest file for the plugin
- static _validate_score_table(variant_information_table, score_table)
Validate the results of the prioritization method.
The following constraints are checked:
Each UID from the variant_information_table is also in the score_table
Each SCORE in the score_table is a numerical value
- Parameters
variant_information_table – The variant information table
score_table – The scoring results from the prioritization method
- Raises
SchemaErrors – If the validation of the data fails
- is_compatible_with_data(variant_information_table)
Check if the plugin is compatible with the variant information table.
The following constraints are checked:
In the variant information table are only variants with the same reference genome as the plugin
In the variant table are only variants with a variation type supported by the plugin
- Parameters
variant_information_table – The variant information table
- Raises
RuntimeError – If the validation fails
- run(variant_information_table)
Run the plugin on the variant_information_table
Before running the plugin the
compatibility
of the data with the plugin is tested. Next therun()
method of the entry_point is called with the variant_information_table. The result of the entry_point isvalidated
to ensure that each variant from the variant_information_table got a valid score assigned. Finally, the score column is renamed using thescore_column_name()
.The resulting Dataframe consists of two columns:
UID: The UID of the variants
score_column_name()
: The scores from the prioritization method
- Parameters
variant_information_table – The variant information table
- Returns
The plugin result.
- Return type
DataFrame
- property score_column_name
Return the column name for the
AnnotatedVariantData
.The name is calculated by
"{self.name}_SCORE"
.- Returns
The name of the column
- Return type
- class vpmbench.plugin.PluginBuilder
This class builds the
Plugins
- classmethod build_plugin(**kwargs)
Build a plugin from the arguments.
See the documentation for specification the manifest schema.
- Parameters
kwargs – The arguments
- Returns
The Plugin
- Return type
- Raises
RuntimeError – If required in formation is missing.
- class vpmbench.plugin.PythonEntryPoint(file_path)
Represent an entry point using Python to run the custom processing logic
The entry point has to be implemented in the Python file via a function
entry_point
accepting thevariant_data()
as input.- Parameters
file_path (pathlib.Path) – Path the Python file containing the implementation of the custom processing logic
plugin – Reference to the plugin of the entry point
- run(variant_information_table)
Run the custom processing logic
Has to return a
DataFrame
with two columns:UID: The UID of the variants
SCORE: The calculated score for the variants
- Parameters
variant_information_table – The variant information table
- Returns
The results from the processing logic
- Return type
DataFrame
- class vpmbench.plugin.Score(plugin, data)
Represent a score from a prioritization method.
- Parameters
plugin (vpmbench.plugin.Plugin) – The method calculated the score
data (pandas.core.series.Series) – The calculated scores
- interpret(cutoff=None)
Interpret the score using the cutoff.
If the cutoff is None the
vpmbench.data.Score.cutoff()
is used to interpret the score. The score is interpreted by replacing all values greater as the cutoff by 1, 0 otherwise.- Parameters
cutoff – The cutoff
- Returns
The interpreted scores
- Return type
Performance Summaries
- class vpmbench.summaries.ConfusionMatrix
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculates the confusion matrix.
- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
A dictionary with the following keys:
tn
- the number of true negatives,fp
- the number of false positives,fn
- the number of false negatives,tp
- the number of the true positives- Return type
- class vpmbench.summaries.PerformanceSummary
Represent a performance summary
- class vpmbench.summaries.PrecisionRecallCurve
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculates the precision recall curve.
- Parameters
score – The score from the prioritization method
interpreted_classes – The interpreted classes
- Returns
A dictionary with the following keys:
precsion
- precision values,recall
- recall values,thresholds
- the thresholds- Return type
- class vpmbench.summaries.ROCCurve
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculates the ROC curves.
- Parameters
score – The score from the prioritization method
interpreted_classes – The interpreted classes
- Returns
A dictionary with the following keys:
fpr
- false positive rates,tpr
- true positives rates,thresholds
- the thresholds- Return type
Performance Metrics
- class vpmbench.metrics.Accuracy
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the accuracy.
Uses a
ConfusionMatrix
to calculate the accuracy/true positive rate.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated accuracy
- Return type
- class vpmbench.metrics.AreaUnderTheCurveROC
Calculate the area under the roc curve (AUROC).
- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated AUC
- Return type
- class vpmbench.metrics.Concordance
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the concordance, i.e, the sum of true positives and true negatives.
Uses a
ConfusionMatrix
to calculate the concordance.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated concordance
- Return type
- class vpmbench.metrics.MatthewsCorrelationCoefficient
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the matthews correlation coefficient.
- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The matthews correlation coefficient
- Return type
- class vpmbench.metrics.NegativePredictiveValue
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the negative predictive value.
Uses a
ConfusionMatrix
to calculate the negative predictive value.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated negative predictive value
- Return type
- class vpmbench.metrics.PerformanceMetric
Represent a metrics.
- class vpmbench.metrics.Precision
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the precision.
Uses a
ConfusionMatrix
to calculate the precision/positive predictive value.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated precision
- Return type
- class vpmbench.metrics.Sensitivity
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the sensitivity.
Uses a
ConfusionMatrix
to calculate the sensitivity/true positive rate.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated sensitivity
- Return type
- class vpmbench.metrics.Specificity
- static calculate(score, interpreted_classes, pathogenicity_class_map={'benign': 0, 'pathogenic': 1})
Calculate the specificity.
Uses a
ConfusionMatrix
to calculate the specificity/false positive rate.- Parameters
score – The score from the plugin
interpreted_classes – The interpreted classes
- Returns
The calculated specificity
- Return type
Utilities
- vpmbench.utils.plot_confusion_matrices(report, normalize=False, cmap='Blues')
Plot the confusion matrices of the prioritization method from a performance report
Shows the roc curve the
vpmbench.summaries.ConfusionMatrix
was calculated.- Parameters
report – The performance report
normalize – If true the values in the confusion matrix are normalized
cmap – The colormap that should be used to plot the confusion matrices
- vpmbench.utils.plot_precision_recall_curves(report)
Plot the precision recall curves using a performance report
Shows the precision recall curve the
vpmbench.summaries.PrecisionRecallCurve
was calculated.- Parameters
report – The performance report
- vpmbench.utils.plot_roc_curves(report)
Plot the ROC curves using a performance report
Shows the roc curve the
vpmbench.summaries.ROCCurve
was calculated.- Parameters
report – The performance report
- vpmbench.utils.report_metrics(report)
Print the calculated metrics to the terminal
- Parameters
report – The performance report
Processor
- vpmbench.processors.format_input(variant_information_table, target_format, target_file, **kwargs)
Formats the variant information table into the target format and write the results to target file.
- Parameters
variant_information_table – The input table
target_format – The format in which data should be written
target_file – The file in which the data should be written
kwargs – Additional arguments that are passed to the converter function
- vpmbench.processors.format_output(variant_information_table, output_format, output_file, **kwargs)
Formats the content of the output file into a dataframe.
- Parameters
variant_information_table – The variant information table used to calculate the results in the output file
output_format – The format of the output file
output_file – The file from which the output should be read
kwargs – Additional arguments
- Returns
The formatted content of the output file
- Return type
DataFrame
Enums
- class vpmbench.enums.ReferenceGenome(value)
Represent reference genomes.
Following values are supported:
HG38
HG19
HG18
HG17
HG16
- static resolve(name)
Resolve string into a reference genome.
The following rules apply:
if “grch38” in name.lower() ->
ReferenceGenome.HG38
if “grch37” in name.lower() ->
ReferenceGenome.HG19
otherwise:
ReferenceGenome(name)
is called
- Parameters
name – The string.
- Returns
The reference genome
- Return type
- Raises
RuntimeError – If the name can not be solved
- class vpmbench.enums.VariationType(value)
Represent the variation types of the variants.
Following values are supported:
SNP for single-nucleotide polymorphism
INDEL for insertions or deletions
- static resolve(name)
Return the variation type based on the given string
The following rules apply:
if name.lower() == ‘snp’ ->
VariationType.SNP
if name.lower() == ‘indel’ ->
VariationType.INDEL
- Parameters
name – The string
- Returns
The variation type
- Return type
- Raises
RuntimeError – If the name can not be solved
Predicates
- vpmbench.predicates.is_multiclass_plugin(plugin)
Checks whether a plugin is a multi-class plugin.
Configuration
The config modules defines the following variables:
- vpmbench.config.DEFAULT_PLUGIN_PATH
The default plugin path where vpmbench searches for plugins.
- Type
- Value
The directory VPMBench-Plugins in the home directory of the current user, e.g,
/home/user/VPMBench-Plugins
.