Datasources

This section documents the datasources used as input for the static data available in VarFish.

The download and precomputation is done by the Snakemake workflow in varfish-db-downloader. This git repository uses continous integration with a reduced dataset (and some small data that is used from the repository directly, such as a list of curated microdeletion/-duplication regions from the literature) for automated testing. The reduced dataset is downloaded automatically from URLs in a download_urls.yml file. Thus, there is full transparency and traceability of the data sources used. Further, a nightly CI job is run to check whether the URLs are still available (but not if the data has changed).

Data in Repository

The following datasources are used directly from the repository.

Name

License

Synopsis

Source

ACMG SF List v3.1

public domain

Supplementary Findings Gene List of ACMG

PMID:35802134

DOMINO

Public Domain

Score for assessing the probability for a gene to harbour dominant changes

Institute of Molecular and Clinical Ophthalmology Basel; PMID:28985496

Enrichment Regions

Public Domain

Target regions of NGS enrichment kits

UCSC Table Browser

Patho MMS

Public Domain

Curated regions for microdeletion and microduplication scores

PMID:36435749

sHet

N/A (Emailed Author)

Gene haploinsuffiency score

PMID:31004148

Downloaded Data

The following datasources are downloaded from public internet resources.

Name

License

Synopsis

Source

AlphaMissense

CC BY-NC-SA 4.0

AlphaMissense score

AlphaMissense

CADD Score

free for non-commercial

sequence variant pathogenicity scores

CADD

ClinGen

CC0

clinical gene and genome annotation

ClinGen

Comparative Toxicogenomics Database

free for non-commercial

database of biological named entities

CTD

dbNSFP academic

suitable for academic use

nonsynonymous variant pathogenicity scores

dbNSFP

dbNSFP commercial

suitable for commercial use

nonsynonymous variant pathogenicity scores

dbNSFP

dbSNP

no restrictions

Structural variants from dbSNP

NCBI dbVar

dbVar

no restrictions

Structural variants from dbVar

NCBI dbVar

Database of Genomic Variants (DGV)

no restrictions

Structural variants from DGV

The Centre for Applied Genomics

DECIPHER HI

N/A (Emailed Author)

DECIPHER haploinsufficiency score

PMID:20976243

ENSEMBL

no restriction

ENSEMBL gene/genome annotation and transcripts

ENSEMBL

ExAC CNVs

no restrictions

Copy number variants from ExAC

gnomAD

GenomicsEngland PanelApp

non-commercial

Gene panels with disease associations from Genomics England

GenomicsEngland

gnomAD exomes and genomes

no restrictions

sequence and structural variants, gene constraint scores

gnomAD

GTeX

free

tissue-specific gene expression

GTEx

HelixMtDb

N/A (Emailed Author)

mitochondrial genome frequencies

HelixMtDb

HGNC

CC0

gene information

HGNC

HPO

free

Human Phenotype Ontology

HPO

Human Disease Ontology (DO)

CC0

ontology of human diseases

Disease Ontology

MONDO

CC BY 4.0

Mondo Disease Ontology

OBO Foundry

NCBI ClinVar

no restrictions

clinical variant interpretation

NCBI ClinVar

NCBI Gene

no restrictions

gene information

NCBI Gene

NCBI mim2gene

no restrictions

gene-disease associations

NCBI MedGen

NCBI RefSeq

no restrictions

gene/genome annotation and transcripts

NCBI RefSeq

OMIM titles

restricted

some OMIM disease names are contained in other databases such as HPO

misc. other datasources

ORDO

CC BY 4.0

Orphanet Rare Disease Ontology

BioOntology.org

Orphadata

CC BY 4.0

Orphanet disease-gene associations

Orphadata

rCNV Score

no restrictions

dosage sensitivity score

PMID:35917817

TAD annotation

N/A (Emailed Author)

Topologically Associated Domains annotation

YUE Lab

1000G SV map

Fort Lauderdale Agreement

structural variants from thousand genomes phase 3

IGSR

UCSC assembly-related tracks

no restrictions

assembly-related tracks, genomicSuperDups, rmsk, altSeqLiftOverPsl, fixSeqLiftOverPsl, multiz100way

UCSC Table Browser