Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Literature mining for the biologist: from information retrieval to biological discovery

Key Points

  • Literature-mining tools are becoming essential to researchers because of the growth of the scientific literature and the shift from studying individual genes and proteins to entire systems.

  • Currently, information-retrieval tools such as PubMed are by far the most commonly used literature-mining methods among biologists.

  • Methods for identifying the genes, proteins and other entities that are mentioned in the literature — known as entity recognition — are key components of most complex literature-mining systems.

  • Recently, methods for extracting biomedical facts from text have improved considerably. Such methods will probably soon become mainstream tools for the annotation and analysis of large-scale experimental data sets.

  • By combining facts that have been extracted from several papers, text-mining methods can discover both global trends and generate new hypotheses that are based on the existing literature.

  • To realize the full discovery potential of literature mining, it should be integrated with other data types. Protein networks are well suited for unifying large-scale experimental data with knowledge that has been extracted from the biomedical literature.

  • Data-integration methods have also been developed for ranking candidate genes for inherited diseases and for associating genes with phenotypic characteristics.

  • Bridging the gap between biologists and computational linguists will be crucial to the success of approaches that integrate literature mining with high-throughput experimental data. We hope that this review will inspire more biologists to become actively involved in the development of literature-mining tools.

Abstract

For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Growth of Medline.
Figure 2: The current state of biomedical literature mining.
Figure 3: A literature-derived network for yeast.
Figure 4: Correlating phenotypes with genotypes.

Similar content being viewed by others

References

  1. Rebholz-Schuhmann, D. Facts from text — is text mining ready to deliver. PLoS Biol. 3, e65 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. Andrade, M. A. & Bork, P. Automated extraction of information in molecular biology. FEBS Lett. 476, 12–17 (2000).

    Article  CAS  PubMed  Google Scholar 

  3. Hirschman, L., Park, J. C., Tsujii, J., Wong, L. & Wu, C. H. Accomplishments and challenges in literature data mining for biology. Bioinformatics 18, 1553–1561 (2002).

    Article  CAS  PubMed  Google Scholar 

  4. Yandell, M. D. & Majoros, W. H. Genomics and natural language processing. Nature Rev. Genet. 3, 601–610 (2002).

    Article  CAS  PubMed  Google Scholar 

  5. Krallinger, M. & Valencia, A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 6, 224 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. Asano, S. et al. Concerted mechanism of swe1/wee1 regulation by multiple kinases in budding yeast. EMBO J. 24, 2194–2204 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Wilbur, W. J. & Yang, Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26, 209–222 (1996).

    Article  CAS  PubMed  Google Scholar 

  8. Wilbur, W. J. & Coffee, L. The effectiveness of document neighboring in search enhancement. Inf. Process. Manage. 30, 253–266 (1994).

    Article  Google Scholar 

  9. Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Pac. Symp. Biocomput. 5, 50–68 (2000).

    Google Scholar 

  10. Iliopoulos, I. Enright, A. J. & Ouzounis, C. A. Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac. Symp. Biocomput. 6, 384–395 (2001).

    Google Scholar 

  11. Glenisson, P., Antal, P., Mathys, J., Moreau, Y. & De Moor, B. Evaluation of the vector space representation in text-based gene clustering. Pac. Symp. Biocomput. 8, 391–402 (2003).

    Google Scholar 

  12. Marcotte, E. M., Xenarios, I. & Eisenberg, D. Mining literature for protein–protein interactions. Bioinformatics 17, 359–363 (2001).

    Article  CAS  PubMed  Google Scholar 

  13. Bhalotia, G., Nakov, P. I., Schwartz, A. S. & Hearst, M. A. BioText team report for the TREC 2003 genomics track [online], <http://trec.nist.gov/pubs/trec12/papers/ucal-berkeley.genomics.pdf> (2003).

    Google Scholar 

  14. Donaldson, I. et al. PreBIND and Textomy — mining the biomedical literature for protein–protein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Kayaalp, M. et al. Methods for accurate retrieval of MEDLINE citations in functional genomics [online], <http://trec.nist.gov/pubs/trec12/papers/nlm.genomics.pdf> (2003).

    Google Scholar 

  16. Goetz, T. & von der Lieth, C.-W. PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Nucleic Acids Res. 33, W774–W778 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Shah, P. K., Jensen, L. J., Boue, S. & Bork, P. Extraction of transcript diversity from scientific literature. PLoS Comp. Biol. 1, e10 (2005).

    Article  CAS  Google Scholar 

  18. Suomela, B. P. & Andrade, M. A. Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 6, 75 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. Hersh, W. & Bhuptiraju, R. T. TREC genomics track overview [online], <http://trec.nist.gov/pubs/trec12/papers/GENOMICS.OVERVIEW3.pdf> (2003).

    Google Scholar 

  20. Hersh, W. R. et al. TREC 2004 genomics track overview [online], <http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf> (2004).

    Google Scholar 

  21. Büttcher, S., Clarke, C. L. A. & Cormack, G. V. Domain-specific synonym expansion and validation for biomedical information retrieval [online], <http://trec.nist.gov/pubs/trec13/papers/uwaterloo-clarke.geo.Pdf> (2004).

    Google Scholar 

  22. Tanabe, L. et al. MedMiner: An internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 27, 1210–1217 (1999).

    Article  CAS  PubMed  Google Scholar 

  23. Muller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004). This paper presents an advanced full-text IR tool that is designed for the Caenorhabditis elegans research community.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  24. Perez-Iratxeta, C., Bork, P. & Andrade, A. M. XplorMed: a tool for exploring MEDLINE abstracts. Trends Biochem. Sci. 26, 573–575 (2001).

    Article  CAS  PubMed  Google Scholar 

  25. Hoffmann, R. & Valencia, A. A gene network for navigating the literature. Nature Genet. 36, 664 (2004).

    Article  CAS  PubMed  Google Scholar 

  26. Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Hoffmann, R. et al. Text mining for metabolic pathways, signaling cascades, and protein networks. Sci. STKE 283, pe21 (2005).

    Google Scholar 

  28. Fukuda, K., Tamura, A., Tsunoda, T. & Takagi, T. Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput. 3, 707–718 (1998).

    Google Scholar 

  29. Tanabe, L. & Wilbur, W. J. Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132 (2002).

    Article  CAS  PubMed  Google Scholar 

  30. Coller, N., Nobata, C. & Tsujii, J. Extracting the names of genes and gene products with a hidden Markov model. Int. Conf. Comput. Linguist. 18, 201–207 (2000).

    Google Scholar 

  31. Chang, J. T., Schutze, H. & Altman, R. B. GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20, 216–225 (2004).

    Article  CAS  PubMed  Google Scholar 

  32. McDonald, R. & Pereira, F. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6, S6 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Settles, B. ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21, 3191–3192 (2005).

    Article  CAS  PubMed  Google Scholar 

  34. Zhou, G., Shen, D., Zhang, J., Su, J. & Tan, S. Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6, S7 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Krauthammer, M., Rzhetsky, A., Morozov, P. & Friedman, C. Using BLAST for identifying gene and protein names in journal articles. Gene 259, 245–252 (2000).

    Article  CAS  PubMed  Google Scholar 

  36. Leonard, J. E., Colombe, J. B. & Levy, J. L. Finding relevant references to genes and proteins in Medline using a Bayesian approach. Bioinformatics 18, 1515–1522 (2002).

    Article  CAS  PubMed  Google Scholar 

  37. Mika, S. & Rost, B. Protein names precisely peeled off free text. Bioinformatics 20, i241–i247 (2004).

    Article  CAS  PubMed  Google Scholar 

  38. Finkel, J. et al. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6, S5 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  39. Crim, J., McDonald, R. & Pereira, F. Automatically annotating documents with normalized gene lists. BMC Bioinformatics 6, S13 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Fundel, K., Güttler, D., Zimmer, R. & Apostolakis, J. A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6, S15 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Hanisch, D., Fundel, K., Mevissen, H. T., Zimmer, R. & Fluck, J. ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 6, S14 (2005). This paper describes a simple biomedical ER system that relies primarily on a carefully curated list of synonyms. It was one of the methods that performed best in the BioCreAtIvE assessment.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Chen, L., Liu, H. & Friedman, C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21, 248–256 (2005). These authors provide a quantitative overview of the causes of gene-name ambiguity, and suggest how researchers and publishers can help to minimize this problem.

    Article  PubMed  CAS  Google Scholar 

  43. Gaudan, S., Kirsch, H. & Rebholz-Schuhmann, D. Resolving abbreviations to their senses in Medline. Bioinformatics 21, 3658–3664 (2005).

    Article  CAS  PubMed  Google Scholar 

  44. Schijvenaars, B. J. A. et al. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 6, 149 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  45. Tanabe, L., Xie, N., Thom, L. H., Matten, W. & Wilbur, W. J. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6, S3 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Craven, M. Kumlien, J. Constructing biological knowledge bases by extracting information from text sources. in Proc. Int. Conf. Intell. Syst. Mol. Biol. 7, 77–86 (1999).

    Google Scholar 

  47. Cooper, J. W. & Kershenbaum, A. Discovery of protein–protein interactions using a combination of linguistic, statistical and graphical information. BMC Bioinformatics 6, 143 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  48. Ramani, A. K., Bunescu, R. C., Mooney, R. J. & Marcotte, E. M. Consolidating the set of known human protein–protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, R40 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R. & Mostafa, J. Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput. 6, 483–495 (2001).

    Google Scholar 

  50. Blaschke, C. & Valencia, A. The frame-based module of the SUISEKI information extraction system. IEEE Intell. Syst. 17, 14–20 (2002).

    Google Scholar 

  51. Stapley, B. J. & Benoit, G. Biobibliometrics: information retrieval and visualization from co-occurrence of gene names in Medline abstracts. Pac. Symp. Biocomput. 5, 529–540 (2000).

    Google Scholar 

  52. Jenssen, T. K., Lægreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001). This paper describes an IE system, PubGene, that is based on simple co-occurrence, and shows how it can be used for the interpretion of microarray expression data.

    CAS  PubMed  Google Scholar 

  53. Bowers, P. M. et al. Prolinks: a database of protein functional linkages derived from coevolution. Nucleic Acids Res. 5, R35 (2003).

    Google Scholar 

  54. von Mering, C. et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437 (2005).

    Article  CAS  PubMed  Google Scholar 

  55. Schlitt, T. et al. From gene networks to gene function. Genome Res. 13, 2568–2576 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Wren, J. D. & Garner, H. R. Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20, 191–198 (2004).

    Article  CAS  PubMed  Google Scholar 

  57. Alako, B. T. et al. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 6, 51 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  58. Tiffin, N. et al. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 33, 1544–1552 (2005). This study combines tissue-expression data with disease–tissue relationships that were extracted from the literature to predict candidate disease genes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Ding, J., Berleant, d., Nettleton, D. & Wurtelle, E. Mining Medline: abstracts, sentences, or phrases? Pac. Symp. Biocomput. 7, 326–337 (2002).

    Google Scholar 

  60. Ray, S. & Craven, M. Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics 6, S18 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  61. Narayanaswamy, M., Ravikumar, K. E. & Vijay-Shanker, K. Beyond the clause: extraction of phosphorylation information from Medline abstracts. Bioinformatics 21, i319–i327 (2005).

    Article  CAS  PubMed  Google Scholar 

  62. Saric, J., Jensen, L. J., Ouzounova, R., Rojas, I. & Bork, P. Extraction of regulatory gene/protein networks from Medline. Bioinformatics 26 July 2005 (10.1093/bioinformatics/bti597).

  63. Rindflesch, T. C., Tanabe, L., Weinstein, J. N. & Hunter, L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput. 1, 517–528 (2000).

    Google Scholar 

  64. Proux, D., Rechenmann, F. & Julliard, L. A pragmatic information extraction strategy for gathering data on genetic interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 179–285 (2000).

    Google Scholar 

  65. Yakushiji, A., Tateisi, Y., Miyao, Y. & Tsujii, J. Event extraction from biomedical papers using a full parser. Pac. Symp. Biocomput. 6, 408–419 (2001).

    Google Scholar 

  66. Daraselia, N. et al. Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20, 604–611 (2004).

    Article  CAS  PubMed  Google Scholar 

  67. Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, S74–S82 (2001).

    Article  PubMed  Google Scholar 

  68. Rzhetsky, A. et al. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed. Inform. 37, 43–53 (2004). This paper is a good introduction to NLP-based IE and to the design of complex IE systems such as GeneWays.

    Article  CAS  PubMed  Google Scholar 

  69. Temkin, J. M. & Gilder, M. R. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19, 2046–2053 (2003).

    Article  CAS  PubMed  Google Scholar 

  70. Hao, Y., Zhu, X., Huang, M. & Li, M. Discovering patterns to extract protein–protein interactions from the literature: part II. Bioinformatics 21, 3294–3300 (2005).

    Article  CAS  PubMed  Google Scholar 

  71. Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Pac. Symp. Biocomput. 5, 707–709 (2000).

    Google Scholar 

  72. Hearst, M. A. Untangling text data mining. Proc. Assoc. Comput. Linguist., 37, 3–10 (1999).

    Google Scholar 

  73. Swanson, D. R. Fish oil, Raynaud's Syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30, 7–18 (1986). This is the original text-mining paper, which shows how new knowledge can be inferred from the existing literature.

    Article  CAS  PubMed  Google Scholar 

  74. Blagosklonny, M. V. & Pardee, A. B. Unearthing the gems. Nature 416, 373 (2002).

    Article  CAS  PubMed  Google Scholar 

  75. Swanson, D. R. Migrane and magnesium: eleven neglected connections. Perspect. Biol. Med. 31, 526–557 (1988).

    Article  CAS  PubMed  Google Scholar 

  76. Swanson, D. R. Somatomedin C and arginine: implicit connections between mutually isolated literatures. Perspect. Biol. Med. 33, 157–186 (1990).

    Article  CAS  PubMed  Google Scholar 

  77. Smalheiser, N. R. & Swanson, D. R. Linking estrogen to Alzheimer's disease: an informatics approach. Neurology 47, 809–810 (1996).

    Article  CAS  PubMed  Google Scholar 

  78. Swanson, D. R. Intervening in the life cycle of scientific knowledge. Library Trends 41, 606–631 (1988).

    Google Scholar 

  79. Smalheiser, N. R. & Swanson, D. R. Assessing a gap in the biomedical literature: Magnesium deficiency and neurological disease. Neurosci. Res. Commun. 15, 1–9 (1994).

    CAS  Google Scholar 

  80. Weeber, M. et al. Text-based discovery in biomedicine: the architecture of the DAD-system. Proc. AMIA Symp. 20, S903–S907 (2000).

    Google Scholar 

  81. Srinivasan, P. & Libbus, B. Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 20, i290–i296 (2004).

    Article  CAS  PubMed  Google Scholar 

  82. Wren, J. D. Extending the mutual information measure to rank inferred literature relationships. BMC Bioinformatics 5, 145 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  83. Hristovski, D., Peterlin, B., Mitchell, J. A. & Humphrey, S. M. Using literature-based discovery to identify disease candidate genes. Int. J. Med. Inform. 74, 289–298 (2005).

    Article  PubMed  Google Scholar 

  84. Grably, M. R., Stanhill, A., Tell, O. & Engelberg, D. HSF and Msn2/4p can exclusively or cooperatively activate the yeast HSP104 gene. Mol. Microbiol. 44, 21–35 (2002).

    Article  CAS  PubMed  Google Scholar 

  85. Chi, Y. et al. Negative regulation of Gcn4 and Msn2 transcription factors by Srb10 cyclin-dependent kinase. Genes Dev. 15, 1078–1092 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Bose, S., Dutko, J. A. & Zitomer, R. S. Genetic factors that regulate the attenuation of the general stress response of yeast. Genetics 169, 1215–1226 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Lenssen, E. et al. The Ccr4–Not complex independently controls both Msn2-dependent transcriptional activation — via a newly identified Glc7/Bud14 type I protein phosphatase module — and TFIID promoter distribution. Mol. Cell. Biol. 25, 488–498 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Xiao, Y. & Mitchell, A. P. Shared roles of yeast glycogen synthase kinase 3 family members in nitrogen-responsive phosphorylation of meiotic regulator Ume6p. Mol. Cell. Biol. 20, 5447–5453 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Eiznhamer, D. A., Ashburner, B. P., Jackson, J. C., Gardenour, K. R. & Lopes, J. M. Expression of the INO2 regulatory gene of Saccharomyces cerevisiae is controlled by positive and negative promoter elements and an upstream open reading frame. Mol. Microbiol. 39, 1395–1405 (2001).

    Article  CAS  PubMed  Google Scholar 

  90. Kennedy, M. A., Barbuch, R. & Bard, M. Transcriptional regulation of the squalene synthase gene (ERG9) in the yeast Saccharomyces cerevisiae. Biochim. Biophys. Acta 1445, 110–122 (1999).

    Article  CAS  PubMed  Google Scholar 

  91. Hoffmann, R. & Valencia, A. Life cycles of successful genes. Trends Genet. 19, 79–81 (2003).

    Article  CAS  PubMed  Google Scholar 

  92. de Lichtenberg, U., Jensen, L. J., Brunak, S. & Bork, P. Dynamic complex formation during the yeast cell cycle. Science 307, 724–727 (2005).

    Article  CAS  PubMed  Google Scholar 

  93. Morel, V. & Schweisguth, F. Repression by Suppressor of Hairless and activation by Notch are required to define a single row of single-minded expressing cells in the Drosophila embryo. Genes Dev. 14, 377–388 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. Woods, S. L. & Witelaw, M. L. Differential activities of Murine Single Minded 1 (SIM1) and SIM2 on a hypoxic response element. J. Biol. Chem. 277, 10236–10243 (2002).

    Article  CAS  PubMed  Google Scholar 

  95. Andrade, M. A. & Valencia, A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998).

    Article  CAS  PubMed  Google Scholar 

  96. Blaschke, C., Oliveros, J. C. & Valencia, A. Mining functional information associated with expression arrays. Funct. Integr. Genomics 1, 256–268 (2001).

    Article  CAS  PubMed  Google Scholar 

  97. Masys, D. R. et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 17, 319–326 (2001).

    Article  CAS  PubMed  Google Scholar 

  98. Chaussabel, D. & Sher, A. Mining microarray expression data by literature profiling. Genome Biol. 3, research0055.1–research0055.16 (2002).

    Article  Google Scholar 

  99. Raychaudhuri, S., Schutze, H. & Altman, R. B. Using text analysis to identify functionally coherent gene groups. Genome Res. 12, 1582–1590 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Raychaudhuri, S., Chang, J. T., Imam, F. & Altman, R. B. The computational analysis of scientific literature to define and recognize gene expression clusters. Nucleic Acids Res. 31, 4553–4560 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Glenisson, P. et al. TXTGate: profiling gene groups with text-based information. Genome Biol. 5, R43 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  102. Krauthammer, M., Kaufmann, C. A., Gilliam, T. C. & Rzhetsky, A. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. Proc. Natl Acad. Sci. USA 101, 15148–15153 (2004). The study shows how literature-based molecular networks and genetic linkage mapping can be integrated to find candidate disease genes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using text mining. Nature Genet. 31, 316–319 (2002).

    Article  CAS  PubMed  Google Scholar 

  104. Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2D: A tool for mining genes associated to disease. BMC Genetics 6, 45 (2005). Reference 103 integrates genetic linkage-mapping data with data from the literature to suggest candidate genes for inherited diseases. Reference 104 shows later improvements of the method.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  105. Korbel, J. O. et al. Systematic association of genes to phenotypes by genome and literature mining. PLoS Biol. 3, e134 (2005). These authors present a method for linking genotypes to phenotypes by comparing species profiles of genes and literature-derived keywords.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  106. Shah, P. K., Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics 4, 20 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  107. Schuemie, M. J. et al. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics 20, 2597–2604 (2004).

    Article  CAS  PubMed  Google Scholar 

  108. Dickman, S. Tough mining. PLoS Biol. 1, 144–147 (2005).

    Google Scholar 

  109. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. & Nielsen, H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000).

    Article  CAS  PubMed  Google Scholar 

  110. Yeh, A. S., Hirschman, L. & Morgan, A. A. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19, i331–i339 (2003).

    Article  PubMed  Google Scholar 

  111. Hirschman, L., Yeh, A., Blaschke, C. & Valencia, A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, S1 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  112. Krauthammer, M. et al. Of truth and pathways: chasing bits of information through myriads of articles. Bioinformatics 18, S249–S257 (2002).

    Article  PubMed  Google Scholar 

  113. Perez-Iratxeta, C. & Andrade, M. A. Worldwide scientific publishing activity. Science 297, 519 (2002).

    Article  CAS  PubMed  Google Scholar 

  114. Netzel, R., Perez-Iratxeta, C., Bork, P. & Andrade, M. A. The way we write. EMBO Rep. 4, 446–451 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Newman, M. E. J. Coauthorship networks and patterns of scientific collaboration. Proc. Natl Acad. Sci. USA 101, 5200–5205 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would like to thank T. Doerks and S. Hooper for help with figures, and other group members of P.B.'s group at the European Molecular Biology Laboratory and I. Rojas's group at EML Research for valuable discussions. J.S. is funded by the Klaus Tschira Foundation. This work was supported by grants from the European Community and the German Ministry for Education and Science through Nationales Genomforschungsnetz (NGFN).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Lars Juhl Jensen or Peer Bork.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Related links

Related links

DATABASES

OMIM

Alzheimer disease

Raynaud disease

FURTHER INFORMATION

An extended bibliography of biological literature-mining papers

Arrowsmith

Bitola

Ensembl

G2D

Genia

iHOP

HighWire Press

Medline

MedMiner

NCBI RefSeq

PennBioIE

PubMed

PubMed Central

STRING

Saccharomyces Genome Database

Textpresso

Glossary

Machine learning

The ability of a machine to learn from experience or extract knowledge from examples in a database. Artificial neural networks and support-vector machines are two commonly used types of machine-learning method.

Gene Ontology

A set of controlled vocabularies that are used to describe the molecular functions of a gene product, the biological processes in which it participates and the cellular components in which it can be found.

Syntax

The orderly manner in which words are put together to form phrases and sentences.

Semantics

The meaning that is implied by words and sentences. If an information-extraction method extracts the right facts from a sentence, it has interpreted the semantics correctly.

Anaphoric relationships

Back-references to previously mentioned entities. A protein that is mentioned in an earlier sentence might, for example, be subsequently be referred to as 'it'.

Corpus

A collection of texts. A corpus might consist of either the raw text only (for example, Medline) or be tagged so that, for example, gene and protein names are labelled (for example, GENIA).

Study bias

Study biases arise because some proteins (or other molecules) are more studied than others. For example, if a protein is known to be phosphorylated, it is also more likely to have been studied in other respects, and is therefore more likely to be known to be regulated by expression, for example.

MeSH terms

A controlled vocabulary that is used for annotating Medline abstracts. Several classes of MeSH term exist, the most relevant for literature mining being 'Chemicals and Drugs' (MeSH-D) and 'Diseases' (MeSH-C).

Linkage mapping

A method for localizing genes that is based on the co-inheritance of genetic markers and phenotypes in families over several generations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jensen, L., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7, 119–129 (2006). https://doi.org/10.1038/nrg1768

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1768

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing