Data Download
You can get all data from Open Targets Genetics via:
EMBL-EBI FTP, database Open Targets Genetics
Google BigQuery, project open-targets-genetics:genetics
Google Cloud Storage (GCS) paywalled public bucket
Please note that if you download this data using Google Cloud Storage, all charges to bucket open-targets-genetics-releases
will be billed to the requester.
Please refer to the Requester Pays feature for Google Cloud Storage for more detail.
Data Schema
The list of datasets with each corresponding data schema
Folder name
Format
Spark Schema
SQL Schema (Clickhouse Dialect)
variant-index
parquet
-
-
Some Tips
Previewing datasets from the GCS bucket
The gsutil
command can be used to preview datasets prior to downloading:
gsutil cat 'gs://open-targets-genetics-releases/19.03.04/lut/variant-index/part-*' | head -1 | jq .
{
"chr_id": "1",
"position": 55545,
"ref_allele": "C",
"alt_allele": "T",
"rs_id": "rs28396308",
"most_severe_consequence": "downstream_gene_variant",
"gene_id_any_distance": 13546,
"gene_id_any": "ENSG00000186092",
"gene_id_prot_coding_distance": 13546,
"gene_id_prot_coding": "ENSG00000186092",
"raw": 0.028059,
"phred": 3.065,
"gnomad_afr": 0.3264216148287779,
"gnomad_amr": 0.4533582089552239,
"gnomad_asj": 0.26666666666666666,
"gnomad_eas": 0.35822021116138764,
"gnomad_fin": 0.31313131313131315,
"gnomad_nfe": 0.26266330506532204,
"gnomad_nfe_est": 0.3397858319604613,
"gnomad_nfe_nwe": 0.23609443777511005,
"gnomad_nfe_onf": 0.2256,
"gnomad_nfe_seu": 0.1,
"gnomad_oth": 0.27403846153846156
}
Loading data into a ClickHouse instance
There is an initial bash script you can use in order to load all data into a ClickHouse instance. In that script, you will find lines like this
echo create studies tables
clickhouse-client -m -n < studies_log.sql
gsutil cat "${base_path}/lut/study-index/part-*" | clickhouse-client -h 127.0.0.1 --query="insert into ot.studies_log format JSONEachRow "
clickhouse-client -m -n < studies.sql
clickhouse-client -m -n -q "drop table ot.studies_log;"
Last updated