Open Targets Genetics Documentation
  • Introduction
  • Release notes
  • Change log
  • FAQs
  • Future Plans
  • Licence
  • Citation
  • Terms of Use
  • Our Approach
    • Overview
    • Assigning Variants to Disease (V2D)
    • Assigning Variants to Genes (V2G)
    • Prioritising causal genes at GWAS loci (L2G)
    • Colocalisation analysis
    • Data Sources
      • Genetic Variation
        • FinnGen
      • Molecular Traits
        • Splice QTLs
      • Others
  • How To Use Open Targets Genetics starting with
    • A Gene
    • A Study (Trait)
    • Multiple Studies
    • A Variant
    • Introducing the Locus Plot
  • Technical Pipeline
    • What Technologies Do We Use?
    • GitHub Repositories
    • Pipeline schematic overview
  • Meetings
    • ESHG Workshop 2019
    • ASHG Workshop 2018
  • Data access
    • Data Download
    • GraphQL API
Powered by GitBook
On this page
  1. Technical Pipeline

What Technologies Do We Use?

PreviousIntroducing the Locus PlotNextGitHub Repositories

Last updated 6 years ago

Phase 1 of the pipeline is to prepare the input data (V2D, V2G and summary statistics tables) in a standardised way. Workflows are written in Python and run using Snakemake workflow management system to ensure analyses are reproducible and portable. Workflows are run on on a Google Compute instance, or the Sanger Institute cluster, and the output is stored on Google Cloud Storage (GCS).

Phase 2 of the pipeline processes and merges the input data to produce evidence linking traits to variants to genes. This merging pipeline is written in Scala and Spark running on a Google Dataproc cluster, which automatically scales to accommodate the quantity of the data. The output tables are saved in JSON files streamed directly to a Google Cloud Storage bucket before being loaded into a ClickHouse database.

The infrastructure used to serve the data through the front-end by an runs on Google Cloud. It allows to elastically accommodate the unpredictable demand for usage at a global-scale and to keep DevOps operations at minimum levels. Requests are routed by a globally distributed load-balancer to the nearest geo-localised zone; it is currently deployed across 3 main regions: Asia (northeast), Europe (west) and USA (east). In each geo-localised region, the infrastruture decribed below is mantained

  • An auto-scalable group of API instances which interprets GraphQL queries and serves the required data from,

  • another auto-scalable group of high-performance ClickHouse DB instances, through

  • an internal TCP regional load-balancer which makes transparent and high-available the number of ClickHouse nodes running at any period of time.

The front-end is written using the React javascript library and is hosted on Netlify.

The API, which also acts as a where you can interactively execute GraphQL queries and play with real data, is written in Scala with Play framework using Sangria as a server-side GraphQL implementation.

playground
API
Diagram outlining technologies used for Open Targets Genetics