Open Targets Genetics Documentation
  • Introduction
  • Release notes
  • Change log
  • FAQs
  • Future Plans
  • Licence
  • Citation
  • Terms of Use
  • Our Approach
    • Overview
    • Assigning Variants to Disease (V2D)
    • Assigning Variants to Genes (V2G)
    • Prioritising causal genes at GWAS loci (L2G)
    • Colocalisation analysis
    • Data Sources
      • Genetic Variation
        • FinnGen
      • Molecular Traits
        • Splice QTLs
      • Others
  • How To Use Open Targets Genetics starting with
    • A Gene
    • A Study (Trait)
    • Multiple Studies
    • A Variant
    • Introducing the Locus Plot
  • Technical Pipeline
    • What Technologies Do We Use?
    • GitHub Repositories
    • Pipeline schematic overview
  • Meetings
    • ESHG Workshop 2019
    • ASHG Workshop 2018
  • Data access
    • Data Download
    • GraphQL API
Powered by GitBook
On this page
  • Data Schema
  • Some Tips
  • Previewing datasets from the GCS bucket
  • Loading data into a ClickHouse instance
  1. Data access

Data Download

PreviousASHG Workshop 2018NextGraphQL API

Last updated 1 year ago

You can get all data from Open Targets Genetics via:

  • EMBL-EBI FTP, database

  • Google BigQuery, project

  • (GCS) paywalled public bucket

Please note that if you download this data using Google Cloud Storage, all charges to bucket open-targets-genetics-releases will be billed to the requester.

Please refer to the feature for Google Cloud Storage for more detail.

Data Schema

The list of datasets with each corresponding data schema

Please change the URL tags to their corresponding tables, stated above, as required.

Folder name
Format
Spark Schema
SQL Schema (Clickhouse Dialect)

variant-index

parquet

-

-

v2g

jsonl

v2d

jsonl

d2v2g

jsonl

-

lut/genes-index

jsonl

-

lut/overlap-index

jsonl

-

lut/study-index

jsonl

-

lut/variant-index

jsonl

-

Some Tips

Previewing datasets from the GCS bucket

The gsutil command can be used to preview datasets prior to downloading:

gsutil cat 'gs://open-targets-genetics-releases/19.03.04/lut/variant-index/part-*' | head -1 | jq .
{
  "chr_id": "1",
  "position": 55545,
  "ref_allele": "C",
  "alt_allele": "T",
  "rs_id": "rs28396308",
  "most_severe_consequence": "downstream_gene_variant",
  "gene_id_any_distance": 13546,
  "gene_id_any": "ENSG00000186092",
  "gene_id_prot_coding_distance": 13546,
  "gene_id_prot_coding": "ENSG00000186092",
  "raw": 0.028059,
  "phred": 3.065,
  "gnomad_afr": 0.3264216148287779,
  "gnomad_amr": 0.4533582089552239,
  "gnomad_asj": 0.26666666666666666,
  "gnomad_eas": 0.35822021116138764,
  "gnomad_fin": 0.31313131313131315,
  "gnomad_nfe": 0.26266330506532204,
  "gnomad_nfe_est": 0.3397858319604613,
  "gnomad_nfe_nwe": 0.23609443777511005,
  "gnomad_nfe_onf": 0.2256,
  "gnomad_nfe_seu": 0.1,
  "gnomad_oth": 0.27403846153846156
}

Loading data into a ClickHouse instance

echo create studies tables
clickhouse-client -m -n < studies_log.sql
gsutil cat "${base_path}/lut/study-index/part-*" | clickhouse-client -h 127.0.0.1 --query="insert into ot.studies_log format JSONEachRow "
clickhouse-client -m -n < studies.sql
clickhouse-client -m -n -q "drop table ot.studies_log;"

There is an script you can use in order to load all data into a ClickHouse instance. In that script, you will find lines like this

Open Targets Genetics
open-targets-genetics:genetics
Google Cloud Storage
Requester Pays
initial bash
schema link
schema link
schema link
schema link
schema link
schema link
schema link
schema link
schema link