Assigning Variants to Genes (V2G)
Last updated
Last updated
All variants in the variant index are annotated using our Variant-to-Gene (V2G) pipeline. The pipeline integrates V2G evidence that fall into four main data types:
Molecular phenotype quantitative trait loci experiments (eQTLs, pQTLs and sQTLs)
Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C)
In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl
Distance between the variant and each gene's canonical transcription start site (TSS)
Within each data type there are multiple sources of information produced by different experimental methods. Some of these sources can further be broken down into separate tissues or cell types (features). A full list of data sources used in the V2G pipeline can be seen on the Data Sources page.
Raw datasets are processed to conform to a standardised format and filtered so that they:
Only contain associations with strong evidence post-multiple testing correction
Only contain cis-regulatory associations
A full list of filters applied to each dataset, and workflows to reproduce the V2G files, can be found on GitHub.
Different data sources use different metrics to measure the association between variants (or genomic intervals) and a gene. For example QTLs provide a p-value from standard linear regression, whereas PCHi-C provides a CHiCAGO score. To harmonise scores across sources, a relevant study-specific metric is extracted followed by quantile transformation using a uniform distribution. If multiple features (tissues/cell types) are available, then the transformation is applied at the feature level. Transformed scores are rounded to the nearest decile, so a score of 1.0 is in the top decile, a score of 0.9 is in the 9th decile, and so on.
Next, each variant-gene pair is annotated with all available functional evidence. QTL and functional prediction data types contain -centric scores and so are simple to combine. Interaction data types link functional genomic regions (interval A) to gene positions (interval B). Variants that lie within interval A are assigned evidence scores that link it to genes located in interval B. The resulting V2G merge table consists of approximately 1.7 billion evidence strings.
Given the scale of the data, a scoring system was developed so that for a given variant we can get a list of genes ranked by either (i) the overall V2G score, (ii) a per-source V2G score.
Step 1, Aggregate across features (tissues or cell types). Some data sources (i.e. GTEx and PCHi-C) provide associations measured in multiple tissues or cell lines (features). Where multiple features exist, we aggregate by taking the maximum score across all features for each pair. This aggregation gives a per-source V2G score for each pair.
Step 2, Aggregate across sources. The next stage is to combine information across the sources to produce an overall V2G score. Given the heterogenous nature of the data, we may have more confidence in evidence from some sources over others. We therefore down-weight some sources before aggregation. Using a prior knowledge we rank evidence from sources in this order [ Transcript functional prediction > QTLs > Interaction based data sets ] and apply the following weights:
Data type | Experiment type | Source | Weighting |
In silico functional prediction | Transcript consequence | VEP | 1.0 |
QTL | sQTL | 1.0 | |
QTL | eQTL | many | 0.66 |
QTL | pQTL | many | 0.66 |
Interaction | PCHi-C | Javierre et al. (Cell, 2016) | 0.33 |
Interaction | Enhancer-TSS correlation | Andersson et al. (Nature, 2014) | 0.33 |
Interaction | DHS-promoter correlation | Thurman et al. (Nature, 2012) | 0.33 |
Distance | Canonical TSS | 0.33 |
After weighting, sources are aggregated across sources by taking the mean weighted-quantile to give an overall V2G score for each pair.
The following gene biotypes are excluded from all V2G analysis:
IG_C_pseudogene
, IG_J_pseudogene
, IG_pseudogene
, IG_V_pseudogene
, polymorphic_pseudogene
, processed_pseudogene
, pseudogene
, rRNA
, rRNA_pseudogene
, snoRNA
, snRNA
, transcribed_processed_pseudogene
, transcribed_unitary_pseudogene
, transcribed_unprocessed_pseudogene
, TR_J_pseudogene
, TR_V_pseudogene
, unitary_pseudogene
, unprocessed_pseudogene