Dataflows

This section describes the dataflows in the VarFish system. We split the description into the following parts.

  1. Bulk Data Preparation describes the dataflow for preparing the bulk background used by the Backing Services from Architecture.

  2. Annotation Process / Import describes the annotation process that prepares variant VCF files for import into VarFish and the import itself.

  3. Query Processing describes how the VarFish Server handles queries.

  4. Periodic Background Tasks describes the dataflows by the periodic background tasks.

  5. User Interaction describes the remaining dataflows done by the user annotation.

Bulk Data Preparation

There are three parts to the bulk data preparation, depicted below.

%%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart LR publicData[Public\nData Sources] varfishDbDownloader[Varfish DB\nDownloader] s3Server[S3 Server] deployedServer[Deployed\nInstance] ncbiClinvar[NCBI ClinVar] clinvarDataJsonl[clinvar-data-jsonl] clinvarDataJsonlGithub[GitHub\nReleases] ensembl[ENSEMBL] refseq[RefSeq] cdot[CDOT] mehariDataTx[mehari-data-tx] mehariDataTxGithub[GitHub\nReleases] publicData --> varfishDbDownloader varfishDbDownloader --> s3Server s3Server --> deployedServer ncbiClinvar --> clinvarDataJsonl clinvarDataJsonl --> clinvarDataJsonlGithub clinvarDataJsonlGithub --> deployedServer ensembl --> cdot refseq --> cdot cdot --> mehariDataTx mehariDataTx --> mehariDataTxGithub mehariDataTxGithub --> deployedServer

First, we use a Snakemake worfklow (called varfish-db-downloader) that downloads the necessary public domain data from the internet for most of the data. The data is then processed with the workflow and the bulk data files are created that can be used by the Backing Services.

The workflow is executed manually by the VarFish team. The results are uploaded to our public S3 servers. On deployment, the files are downloaded by downloader/installer scripts that the team provides.

The workflow features a continuous integration test mode where file excerpts are used for smoke testing the functionality of the workflow. Further, the continuous integration checks availability of the upstream files. Using a Snakemake workflow together with using a conda environment for dependencies allows for reproducible data preparation.

ClinVar data is prepared differently. Here, we have a software clinvar-this that is capable of converting ClinVar XML files and convert them into JSON lines (JSONL) format. These JSONL files can then be processed by the software packages also used in the Backing services. The GitHub repository clinvar-data-jsonl hosts continuous integration that downloads the weekly ClinVar releases, uses clinvar-this to transform the XML files to JSONL, and finally publish them as GitHub software releases. A third GitHub repository annonars-data-clinvar uses the output of clinvar-data-jsonl to prepare the per-gene aggregations and per-variant ClinVar files to be used by the Annonars Backing Service. These files are installed on deployment and can later be updated.

Transcript data is also prepared differently. We use the output of the third-party CDOT project that provides RefSeq and ENSEMBL transcripts. The CI in the GitHub project mehari-data-tx downloads the transcripts from the CDOT releases and fetches the corresponding sequences form the NCBI and ENSEMBL servers. It then prepares the transcript data files for the genome releases with the Mehari software. The resulting files are then also published as GitHub software releases. As for the ClinVar files, these files are installed on deployment and can later be updated.

Annotation Process / Import

Variant callers create variant call format (VCF) files that first must be annotated into tab separated value (TSV) files before import into VarFish. For this, we use the Mehari software. Mehari uses population frequency and transcript data files generated by the Bulk Data Preparation step that must be downloaded once.

%%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart LR freqTx[Frequency /\nTranscript Data] vcf[Seqvar/Strucvar\nVCF Files] mehariAnnotate[Mehari Annotate] tsv[Annotated TSV File] varfishCli[VarFish CLI] varfishServer[VarFish Server] postgres[(Postgres)] importJob[ImportJob] freqTx --> mehariAnnotate vcf --> mehariAnnotate mehariAnnotate --> tsv tsv --> varfishCli varfishCli --> varfishServer varfishServer -- "(1) store data" --> postgres varfishServer -- "(2) create job" --> importJob postgres -- "(3) load data" --> importJob importJob -- "(4) write final" --> postgres

The VarFish operator user then uses Mehari to annotate and aggregate each the sequence and the structural variant VCF files into on TSV file per variant type (seqvar/strucvar). These files are then uploaded via the VarFish Command Line Interface (CLI).

The VarFish Server stores the uploaded data in the Postgres database and creates a background job for importing the data. When the import job is run, it will perform certain data processing such as computing quality control metrics and performing fingerprinting of the variant data to allow checking for family relationships. The resulting data is then stored in the final location in the Postgres database where it is available to the user.

Query Processing

Query processing is straightforward and the same for seqvar and strucvar queries.

%%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart LR frontend[Frontend] varfishServer[VarFish Server] queryJob[Query Job] postgres[(Postgres)] frontend -- "(1.1) launch query" --> varfishServer frontend -- "(1.2) poll for query state" --> varfishServer varfishServer -- "(3) fetch results" --> frontend varfishServer -- "create job" --> queryJob queryJob -- "(2) execute query" --> postgres postgres -- "(3) query results" --> queryJob queryJob -- "(4) store result table" --> postgres varfishServer -- "(1.2) check state" --> postgres postgres -- "(3) fetch results" --> varfishServer

The user ceates a new query in the frontend provided by VarFish Server. The server creates a query background job with the query specificaiton for execution in the background.

When the job is executed, it loads the query, generates a Postgres SQL query and executes it. The resulting rows are inserted into the query results table for use by the user.

The frontend polls the server for the state of the query. When the query is complete, the data is loaded into the frontend for interaction by the user.

Periodic Background Tasks

There is a number of background tasks that work on the database. The most important maintenance task rebuilds the in-house background database. This is currently done by re-creating a materialized view in the Postgres database.

User Interaction

Besides query processing, the user can interact in various ways. This interactive works leads to transactional/atomic updates in the database, e.g., by editing properties of a case or annotating case members with HPO terms. This is done with operations that appear blocking to the client and not in background tasks.