Assigning Variants to Disease (V2D)
Open Targets Genetics is based on the human reference genome assembly GRCh38. Lead variants,
, associated with a phenotype by hypothesis-free approaches (GWAS) are initially annotated with their associated trait(s) as described below:
Reported variant-phenotype associations in literature were identified via the NHGRI-EBI GWAS Catalog, a manually-curated database of published variants meeting certain inclusion criteria, which will be familiar to most geneticists. On an ongoing basis, GWAS Catalog extracts and records detailed variant and study-level data for variants reported to be associated with any phenotype at a significance level of
, and fit the inclusion criteria detailed here.
In Open Targets Genetics we include GWAS Catalog curated associations with
. A subset of studies then undergo distance based clustering (±500kb) to remove redundant associations that are an artefact of the curation process, as opposed to true independent signals. For associations that have a reported risk allele, we harmonised the effects so that all are with respect to the alternative allele.
Data from the GWAS Catalog summary statistics repository has been included in the portal as of June 2019. The initial release has been restricted to datasets derived from samples of predominantly European ancestry (N=201) due to the lack of suitable linkage-disequilibrium reference panels for conditional analysis. We are working to include all datasets in a future release of the portal.
Recent efforts to rapidly and systematically apply established GWAS methods to all available data fields in UK Biobank have made available large repositories of summary statistics. To leverage these data disease locus discovery, we used full summary statistics from:
- 1.The Neale lab Round 2 (N=2139). These analyses applied GWAS (implemented in Hail) to all data fields using imputed genotypes from HRC as released by UK Biobank in May 2017, consisting of 337,199 individuals post-QC. Full details of the Neale lab GWAS implementation are available here. We have remove all ICD-10 related traits from the Neale data to reduce overlap with the SAIGE results.
The fine-mapping section below explains how associated-loci were defined using the UK Biobank summary statistics.
Two methods are used to expand lead disease-associated variants into a more complete set of possibly causal tag variants. Linkage-disequilibrium expansion using a reference population is applied to all studies in Open Targets Genetics, and expansion by fine-mapping (credible set analysis) is used where full summary statistics are available (currently UK Biobank traits and those included in the GWAS Catalog summary statistics repository).
Linkage disequilibrium (LD) information is calculated using the 1000 Genomes Phase 3 (1KG) haplotype panel as a reference. LD is calculated in the 1KG super-population that most closely matches GWAS study ancestry information curated by the GWAS Catalog. If the study is conducted in a mixture of populations, a weighted-average (of Fisher Z-transformed correlation coefficients) across super-populations is used. If ancestry information is unknown, European ancestry is assumed. See here for full methods.
Overview of the fine-mapping pipeline
Summary statistics were harmonised to ensure that the ALT allele is always the effect allele, and were pre-filtered to remove variants with low minor allele counts which would lead to inaccurate effect estimation. Variants located in the MHC region (6:28,510,120–33,480,577 GRCh38) are excluded from the fine-mapping pipeline. See here for harmonisation scripts and here for "ingestion" scripts and detailed inclusion criteria.
Independently associated top loci are detected with GCTA stepwise selection procedure (cojo-slct) using unrelated European ancestry UK Biobank genotypes down-sampled to 10K individuals as an LD reference. Lead variants (the most associated variant at each locus) are kept if both the conditional and nominal p-values have
Where multiple index SNPs are found at the same locus (within 2Mb of each other), we perform GCTA single-variant association analysis conditional on other index SNPs at the locus. This produces a set of conditional summary statistics for each independently associated locus.
Credible set analysis is conducted for each associated locus using the above conditional summary statistics. We calculate an approximate Bayes factors (ABF) for all variants in a defined region around the index variant (±500kb). ABFs are computed using the
approx.bf.pmethod re-implemented from the coloc package. Variants are ordered by their posterior probabilities (PP) and sequentially added to the credible set until the cumulative sum is >0.95 (95% credible set).