The BioText Project

University of California, Berkeley

The BioText Project

Overview
    News

    PubMedCentral's Search Engine Adopts UI Changes Consistent with Recommendations from BioText Search Interface Research.

    Several Biotext research papers ( Bioinformatics and PLos ONE) find evidence for the efficacy of showing figures in bioscience search results and allowing search over figure captions. An NLM Technical Bulletin from July 2011 describes adoption of these features for PubMedCentral publications which have the necessarily open intellectual property rights. (Perhaps the BioText talk at NLM in spring 2009 had some impact? We hope so!)

    Project Goals

    When the project began, new methods and tools were needed to improve how bioscience researchers search for and synthesize information from textual descriptions of bioscience research. This project built a flexible, efficient, platform-independent database system infrastructure specifically geared towards supporting the advanced and particular search needs of bioscience researchers. It used this infrastructure to support the development and deployment of statistical approaches to natural language processing, which was used to identify entities and relations between them in bioscience texts.

    The BioText Search Engine

    This project also worked with bioscience researchers to develop intuitive, appealing, interfaces for using these facilities to perform efficient and effective searches. The resulting BioText Search system supports new ways of asking scientific questions of the underlying databases.

    Visit the BioText Search Engine for viewing figures and captions in bioscience literature search.

Publications
  • Do Peers See More in a Paper than its Authors?, Preslav Nakov, Anna Divoli, and Marti A. Hearst, Advances in Bioinformatics, special issue on Literature Mining Solutions for Life Science Research, Volume 2012.   html  pdf
  • Full Text and Figure Display Improves Bioscience Literature Search, Anna Divoli, Michael Wooldridge, and Marti A. Hearst, PLoS ONE 5(4): e9619, April 2010.   html
  • BioText Search Engine: beyond abstract search, Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge, and Jerry Ye, Bioinformatics 23(16):2196-2197, 2007. (Advance Access published on June 1, 2007.)   pdf
  • Exploring the efficacy of caption search for bioscience journal search interfaces, Marti Hearst, Anna Divoli, Michael Wooldridge, and Jerry Ye, in the proceedings of BioNLP 2007, ACL 2007 Workshop, Prague, Czech Republic.   pdf
  • Solving Relational Similarity Problems Using the Web as a Corpus, Preslav Nakov and Marti A. Hearst, in the Proceedings of ACL/HLT, 2008. pdf   poster
  • Evidence for Showing Gene/Protein Name Suggestions in Bioscience Literature Search, Anna Divoli, Marti A. Hearst, and Michael A. Wooldridge, in PSB 2008.  pdf
  • Improved Statistical Machine Translation Using Monolingual Paraphrases. Preslav Nakov. In Proceedings of the European Conference on Artificial Intelligence (ECAI'08), Patras, Greece, 2008. pdf
  • Paraphrasing Verbs for Noun Compound Interpretation. Preslav Nakov. In Proceedings of the Workshop on Multiword Expressions (MWE'08), in conjunction with the Language Resources and Evaluation conference, Marrakech, Morocco, 2008. pdf
  • Improving English-Spanish Statistical Machine Translation: Experiments with Domain Adaptation, Sentence-Level Paraphrasing, Tokenization, and Recasing, Preslav Nakov. In Proceedings of the Third Workshop on Statistical Machine Translation (WMT'08), in conjunction with ACL'2008. pdf
  • Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study, Preslav Nakov. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA'08), 2008. pdf
  • Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. Preslav Nakov (PhD Dissertation, 2008), Technical Report No. UCB/EECS-2007-173   html
  • Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding, Ariel Schwartz, Anna Divoli and Marti Hearst, in the proceedings of EMNLP-CoNLL-2007, Pragh, Czech Republic, 2007.   pdf
  • Showing Figures and Captions in the Biotext Journal Search Engine (Poster), Marti Hearst, Michael Wooldridge, Jerry Ye, and Anna Divoli, in the Proceedings of ISMB/ECCB 2007, Vienna, Austria. Poster pdf
  • Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT, Rochester NY, April 2007. pdf
  • Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments, Ariel Schwartz (PhD Dissertation, 2007), Technical Report No. UCB/EECS-2007-39.   html abstract   pdf
  • BioText Report for the Second BioCreAtIvE Challenge, Preslav Nakov and Anna Divoli, in the Proceedings of BioCreAtIvE II Workshop, Madrid, Spain, April 23-25, 2007. pdf
  • UCB System Description for the WMT 2007 Shared Task, Preslav Nakov and Marti Hearst, in Proceedings of Second Workshop on Statistical Machine Translation co-located with ACL-2007, Prague, June 23, 2007. pdf
  • UCB System Description for SemEval Task #4, Preslav Nakov and Marti Hearst, in the Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007. pdf
  • SemEval-2007 Task 04: Classification of Semantic Relations between Nominals, Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret, in the Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007. pdf
  • Using Verbs to Characterize Noun-Noun Relations, Preslav Nakov and Marti Hearst, in the Proceedings of the Twelfth International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA), Bulgaria, September 2006. pdf
  • Predicting Gene Functions from Text Using a Cross-Species Approach, Emilia Stoica and Marti Hearst, in the 2006 Pacific Biocomputing Symposium (PSB'06), Maui, HI.   pdf
  • Summarizing Key Concepts Using Citation Sentences, Ariel Schwartz and Marti Hearst, Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology (poster), at HLT-NAACL, 2006. pdf
  • Biotext Team Report for the TREC 2006 Genomics Track, Divoli et al., Proceedings of TREC 2006, Gaithersburg, MD, November 2006. pdf
  • Extraction of semantic relations from bioscience text, Barbara Rosario (PhD Dissertation), UC Berkeley, 2005. pdf
  • Multi-way Relation Classification: Application to Protein-Protein Interaction, Barbara Rosario and Marti Hearst, in HLT/EMNLP'05, Vancouver, 2005.   pdf
  • Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution, Preslav Nakov and Marti Hearst, in HLT/EMNLP'05, Vancouver, 2005.   pdf
  • A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies, Preslav Nakov and Marti Hearst, in RANLP'05, Borovets, Bulgaria, 2005   pdf
  • Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst, in CoNLL-2005, Ann Arbor, MI, 2005.   pdf
  • Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing, Preslav Nakov, Ariel Schwartz, Brian Wolf, and Marti Hearst, in ACL/ISMB BioLINK SIG 2005, Detroit, MI, 2005.   pdf
  • Supporting Annotation Layers for Natural Language Processing, Preslav Nakov, Ariel Schwartz, Brian Wolf, and Marti Hearst, in ACL'05 Poster/Demo Track, Ann Arbor, MI, 2005.   pdf
  • Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL'04, Barcelona, 2004.   pdf
  • Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics.   pdf
  • BioText Team Experiments for the TREC 2004 Genomics Track, Preslav Nakov, Ariel S. Schwartz, Emilia Stoica, Marti A. Hearst, Proceedings of TREC 2004, Gaithersburg, MD, 2005. pdf
  • Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, (7Oct2004) Available at BioMedCentral
  • Nearly-Automated Metadata Hierarchy Creation, Emilia Stoica and Marti Hearst, in HLT-NAACL'04, Companion Volume, Boston, May 2004. pdf
  • BioText Team Report for the TREC 2003 Genomics Track, Gaurav Bhalotia, Preslav Nakov, Ariel S. Schwartz, Marti A. Hearst, Proceedings of TREC 2003, Gaithersburg, MD, pdf
  • Category-based Pseudowords, Preslav Nakov and Marti Hearst, in the Companion Volume of the Proceedings of HLT-NAACL'03, Edmonton, Canada, May 2003. pdf
  • A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, in the proceedings of the Pacific Symposium on Biocomputing (PSB 2003) Kauai, Jan 2003. pdf
  • The Descent of Hierarchy, and Selection in Relational Semantics Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL-02, July, 2002. pdf   ps
  • Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy, Barbara Rosario and Marti Hearst, in the Proceedings of EMNLP '01, Pittsburgh, PA, June 2001.   pdf   ps
Talks
  • PubMed Should Search Full Text and Show Figures in Search Results, Marti Hearst, National Library of Medicine, Bethesda, MD, June 2009.
  • Text, Tags, and Thumbnails: Latest Trends in Bioscience Literature Search, Tutorial presented by Marti Hearst at the Pharmaceutical and Health Division of the SLA, Spring Meeting, March 2009. ppt (12.8M)  pdf (6.5M)
  • Caption Search for Bioscience Literature Search Interfaces , ACL Workshop in BioNLP, June 29, 2007. ppt
  • Castanet: Using Wordnet to Build Facet Hierarchies, NAACL-HLT 2007.  ppt  
  • Predicting Gene Functions from Text Using a Cross-Species Approach Pacific Biocomputing Symposium, PSB 2006. ppt
  • Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution, HLT-NAACL'05, October, 2005.   ppt
  • Classifying Semantic Relations in Bioscience Text ACL-04, July 2004. ppt
  • Biotext Team Report for TREC 2003 Genomics Track, November 2003. ppt
  • Biotext Project Overview, Myers Seminar, UC Berkeley, September 2003. ppt
  • Interfaces for Intense Information Analysis, IBM Workshop on The User Experience of Business Intelligence and Knowledge Management, March 2002. ppt
  • A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Pacific Symposium on Biocomputing, PSB'03 ppt
  • The Descent of Hierarchy, and Selection in Relational Semantics ACL-02, July, 2002. ppt
  • Category-Based Pseudowords, HLT-NAACL'03 ppt

Software
Data
  • Data for abbreviation recognition.   Download
  • Data for protein-protein interactions.   Download
  • Data for relation recognition.   Download
  • Data for noun compound relation recognition.   Download

Funding

    This research was supported by NSF grant DBI-0317510 as well as a gift from Genentech, an NSF ITR grant (EIA-0122599, part of the CITRIS project), and an ARDA AQUAINT contract.