Data Download

You can get all data from Open Targets Genetics via:

EMBL-EBI FTP, database Open Targets Genetics
Google BigQuery, project open-targets-genetics:genetics
Google Cloud Storage (GCS) paywalled public bucket

Please note that if you download this data using Google Cloud Storage, all charges to bucket open-targets-genetics-releases will be billed to the requester.

Please refer to the Requester Pays feature for Google Cloud Storage for more detail.

Data Schema

The list of datasets with each corresponding data schema

Please change the URL tags to their corresponding tables, stated above, as required.

Folder name	Format	Spark Schema	SQL Schema (Clickhouse Dialect)
variant-index	parquet	-	-
v2g	jsonl	schema link	schema link
v2d	jsonl	schema link	schema link
d2v2g	jsonl	-	schema link
lut/genes-index	jsonl	-	schema link
lut/overlap-index	jsonl	-	schema link
lut/study-index	jsonl	-	schema link
lut/variant-index	jsonl	-	schema link

Folder name

Format

Spark Schema

SQL Schema (Clickhouse Dialect)

variant-index

parquet

v2g

jsonl

schema link

v2d

jsonl

schema link

d2v2g

jsonl

schema link

lut/genes-index

jsonl

schema link

lut/overlap-index

jsonl

schema link

lut/study-index

jsonl

schema link

lut/variant-index

jsonl

schema link

Some Tips

Previewing datasets from the GCS bucket

The gsutil command can be used to preview datasets prior to downloading:

gsutil cat 'gs://open-targets-genetics-releases/19.03.04/lut/variant-index/part-*' | head -1 | jq .
{
  "chr_id": "1",
  "position": 55545,
  "ref_allele": "C",
  "alt_allele": "T",
  "rs_id": "rs28396308",
  "most_severe_consequence": "downstream_gene_variant",
  "gene_id_any_distance": 13546,
  "gene_id_any": "ENSG00000186092",
  "gene_id_prot_coding_distance": 13546,
  "gene_id_prot_coding": "ENSG00000186092",
  "raw": 0.028059,
  "phred": 3.065,
  "gnomad_afr": 0.3264216148287779,
  "gnomad_amr": 0.4533582089552239,
  "gnomad_asj": 0.26666666666666666,
  "gnomad_eas": 0.35822021116138764,
  "gnomad_fin": 0.31313131313131315,
  "gnomad_nfe": 0.26266330506532204,
  "gnomad_nfe_est": 0.3397858319604613,
  "gnomad_nfe_nwe": 0.23609443777511005,
  "gnomad_nfe_onf": 0.2256,
  "gnomad_nfe_seu": 0.1,
  "gnomad_oth": 0.27403846153846156
}

Loading data into a ClickHouse instance

There is an initial bash script you can use in order to load all data into a ClickHouse instance. In that script, you will find lines like this

echo create studies tables
clickhouse-client -m -n < studies_log.sql
gsutil cat "${base_path}/lut/study-index/part-*" | clickhouse-client -h 127.0.0.1 --query="insert into ot.studies_log format JSONEachRow "
clickhouse-client -m -n < studies.sql
clickhouse-client -m -n -q "drop table ot.studies_log;"

PreviousASHG Workshop 2018 NextGraphQL API

Last updated 9 months ago