Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

A method to predict the impact of regulatory variants from DNA sequence

Abstract

Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type–specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of our deltaSVM method.
Figure 2: deltaSVM can accurately predict SNPs associated with DNase I hypersensitivity.
Figure 3: deltaSVM correlations with dsQTL and eQTL effect sizes.
Figure 4: deltaSVM accurately predicts change in luciferase expression in targeted mutagenesis of Tyr and Tyrp1 mouse melanocyte enhancers.
Figure 5: deltaSVM accurately predicts change of expression in massively parallel reporter assays.
Figure 6: deltaSVM only identifies validated causal SNPs when trained on the appropriate cell type.

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

References

  1. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Ritchie, G.R.S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Hardison, R.C. & Taylor, J. Genomic approaches towards finding cis-regulatory modules in animals. Nat. Rev. Genet. 13, 469–483 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  8. ENCODE Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  9. Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Lee, D., Karchin, R. & Beer, M.A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21, 2167–2180 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Fletez-Brant, C., Lee, D., McCallion, A.S. & Beer, M.A. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 41, W544–W556 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Gorkin, D.U. et al. Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes. Genome Res. 22, 2290–2301 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  15. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  16. Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Lee, D. & Beer, M.A. in Genome Analysis: Current Procedures and Applications (ed. Poptsova, M.S.) 101–120 (Horizon Scientific Press, 2014).

  18. Ghandi, M., Mohammad-Noori, M. & Beer, M. A. Robust k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69, 469–500 (2014).

    Article  PubMed  Google Scholar 

  19. Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Murisier, F., Guichard, S. & Beermann, F. A conserved transcriptional enhancer that specifies Tyrp1 expression to melanocytes. Dev. Biol. 298, 644–655 (2006).

    Article  CAS  PubMed  Google Scholar 

  21. Murisier, F., Guichard, S. & Beermann, F. The tyrosinase enhancer is activated by Sox10 and Mitf in mouse melanocytes. Pigment Cell Res. 20, 173–184 (2007).

    Article  CAS  PubMed  Google Scholar 

  22. Patwardhan, R.P. et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265–270 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Kheradpour, P. et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 23, 800–811 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Huang, Q. et al. A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat. Genet. 46, 126–135 (2014).

    Article  CAS  PubMed  Google Scholar 

  26. Bauer, D.E. et al. An erythroid enhancer of BCL11A subject to genetic variation determines fetal hemoglobin level. Science 342, 253–257 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Farh, K.K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).

    Article  CAS  PubMed  Google Scholar 

  29. Jin, Y. et al. Genome-wide association analyses identify 13 new susceptibility loci for generalized vitiligo. Nat. Genet. 44, 676–680 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Barrett, J.C. et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat. Genet. 41, 703–707 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Barrett, J.C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 40, 955–962 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. International Multiple Sclerosis Genetics Consortium. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat. Genet. 45, 1353–1360 (2013).

  34. Dubois, P.C.A. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Genet. 39, 830–832 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Hinds, D.A. et al. A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci. Nat. Genet. 45, 907–911 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Mells, G.F. et al. Genome-wide association study identifies 12 new susceptibility loci for primary biliary cirrhosis. Nat. Genet. 43, 329–332 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Eyre, S. et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat. Genet. 44, 1336–1340 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Cooper, J.D. et al. Seven newly identified loci for autoimmune thyroid disease. Hum. Mol. Genet. 21, 5202–5208 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Gourraud, P.-A. et al. A genome-wide association study of brain lesion distribution in multiple sclerosis. Brain 136, 1012–1024 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Liu, J.Z. et al. Dense fine-mapping study identifies new susceptibility loci for primary biliary cirrhosis. Nat. Genet. 44, 1137–1141 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Zhang, Y. et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Heintzman, N.D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).

    Article  CAS  PubMed  Google Scholar 

  45. Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This research was supported in part by US National Institutes of Health grant R01 NS62972 to A.S.M. and by grant R01 HG007348 to M.A.B.

Author information

Authors and Affiliations

Authors

Contributions

M.A.B., A.S.M., D.L. and D.U.G. designed the study and wrote the manuscript. D.U.G. and M.B. performed the experiments and analyzed the data. D.L. and M.A.B. developed the computational algorithms and analyzed the data. B.J.S. and A.L.A. contributed computational analysis. D.L. and D.U.G. contributed equally to this work.

Corresponding authors

Correspondence to Andrew S McCallion or Michael A Beer.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Correlation of deltaSVM and dsQTL effect size drops with increasing distance between dsQTL SNPs and the center of the associated DNase I–sensitive regions.

The original set of dsQTLs was defined as SNPs within ±1,000 bp of covarying hypersensitive regions13. We find that deltaSVM is only consistent with dsQTL effect size (beta) when we constrain the set of dsQTLs to be within 200 bp of the modulated DHS region: (a) 0~50 bp (red), (b) 50~200 bp (green), (c) 200–500 bp and (d) 500–1,000 bp. Thus, our analysis is consistent with a local mechanism of action for dsQTLs.

Supplementary Figure 2 Bases predicted to reduce the activity of functional regions are evolutionarily constrained.

We calculated the average deltaSVM scores of all three possible mutations at each base within LCL GM12878 DHSs and compared the conservation (phyloP) for bases causing (a) negative (red), (b) neutral (gray) and (c) positive (blue) deltaSVM-predicted impact (the top 1% of bases with negative deltaSVM, 1% of bases with deltaSVM near 0 and the top 1% of bases with positive deltaSVM; n = 63,123). (d) Differential distributions relative to bases with neutral deltaSVM. Bases with negative or positive deltaSVM are more conserved than those with neutral deltaSVM; P < 1 × 10−300 (under machine precision) and P < 1 × 10−14 (Kolmogorov-Smirnov test), respectively. Also, bases with negative deltaSVM are much more conserved than those with positive deltaSVM (average phyloP of 1.00 versus 0.20, P < 1 × 10−300).

Supplementary Figure 3 Correlations of deltaSVM and in vivo mutation effect size in the ALDOB enhancer using an aggregate model.

We averaged the deltaSVM scores of all three possible mutations at each base and compared the expression changes from the univariate model reported by Patwardhan et al.22.

Supplementary Figure 4 High-confidence predicted causal SNPs in loci associated with autoimmune disease.

The significance of the maximum of Abs(deltaSVM) depends on the number of flanking candidate causal SNPs. Sampling of random SNPs scored with the TH1 gkm-SVM yielded the solid curves for the top 2% of all loci and the mean, with standard deviation shown (dashed line). Seventeen of the 413 immune-associated loci exceed the 2% threshold, whereas 8 would be expected by chance.

Supplementary Figure 5 Precision of deltaSVM prediction of dsQTLs as a function of gkm-SVM feature length.

As in Figure 2e, with varying (l, k) values (where l is the total k-mer length and k is the number of ungapped positions). Precision improves as k is increased, but gapped k-mer performance is always better than that of ungapped k-mers (where l = k). For this large training set (44,768 sequences), (11, 7) is a bit better than the default (10, 6), but for smaller training sets our default feature set (10,6) is recommended.

Supplementary Figure 6 Constraining distance to a TSS in the negative set does not affect the precision of deltaSVM prediction of dsQTLs.

In Figure 2, the gkm-SVM was trained on a negative sequence set matched for GC distribution and repeat fraction, but distance to a TSS was unconstrained. We generated an additional negative set that matched the GC distribution in the GM12878 positive set (a) but also matched the distribution of distance to a TSS of the positive set (b). As shown in c and d, using a gkm-SVM trained on this TSS-matched negative set does not affect performance in predicting dsQTLs.

Supplementary Figure 7 Constraining distance to a TSS or LD for negative dsQTL control SNPs does not affect the precision of deltaSVM prediction of dsQTLs.

In Figure 2, deltaSVM predictions were tested on the positive dsQTLs and a 50 times larger set of negative dsQTL control SNPs selected at random from the genome. Here we constrain the distance to a TSS for negative dsQTL control SNPs to match the distribution of distance to a TSS for the positive dsQTLs. This set is already matched to the positive dsQTL set in terms of the number of SNPs in strong LD (a). Further constraining distance to a TSS (b) does not affect performance in predicting dsQTLs relative to either negative set (c,d).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7. (PDF 303 kb)

Supplementary Table 1

All predictions of the 574 dsQTL SNPs and the 27,735 control SNPs. (XLSX 3759 kb)

Supplementary Table 2

deltaSVM predictions of all possible point mutations in the Tyr and Tyrp1 enhancers. (XLSX 58 kb)

Supplementary Table 3

Experimental validation results of randomly selected deltaSVM predictions from the Tyr and Typr1 enhancers. (XLSX 11 kb)

Supplementary Table 4

deltaSVM predictions of all possible single point mutations in the ALDOB enhancer and the corresponding in vivo effect size. (XLSX 48 kb)

Supplementary Table 5

deltaSVM predictions of mutations in K562 and HepG2 enhancers. (XLSX 23 kb)

Supplementary Table 6

deltaSVM predictions of all 3,113 autoimmune disease–associated SNPs. (XLSX 155 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, D., Gorkin, D., Baker, M. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 47, 955–961 (2015). https://doi.org/10.1038/ng.3331

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3331

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research