Biotext
          Project Logo

BioText Search Engine  | Publications  | Talks & Tutorials  | Software & Data

Project Goals

New methods and tools are needed to improve how bioscience researchers search for and synthesize information from textual descriptions of bioscience research. We are building a flexible, efficient, platform-independent database system infrastructure specifically geared towards supporting the advanced and particular search needs of bioscience researchers. We are using this infrastructure to support the development and deployment of statistical approaches to natural language processing, which will identify entities and relations between them in bioscience texts. This will in turn facilitate more effective search and synthesis. We are working with bioscience researchers to develop intuitive, appealing, interfaces for using these facilities to perform efficient and effective searches. The resulting system will support new ways of asking scientific questions of the underlying databases, and new tools for assembling the pieces of biosciences puzzles. Visit the BioText Search Engine for viewing figures and captions in bioscience literature search.

Team Members

Publications

    Full Text and Figure Display Improves Bioscience Literature Search, Anna Divoli, Michael Wooldridge, and Marti A. Hearst, PLoS ONE 5(4): e9619, April 2010.   html

    BioText Search Engine: beyond abstract search, Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge, and Jerry Ye, Bioinformatics 23(16):2196-2197, 2007. (Advance Access published on June 1, 2007.)   pdf

    Exploring the efficacy of caption search for bioscience journal search interfaces, Marti Hearst, Anna Divoli, Michael Wooldridge, and Jerry Ye, in the proceedings of BioNLP 2007, ACL 2007 Workshop, Prague, Czech Republic.   pdf

    Solving Relational Similarity Problems Using the Web as a Corpus, Preslav Nakov and Marti A. Hearst, in the Proceedings of ACL/HLT, 2008. pdf   poster

    Evidence for Showing Gene/Protein Name Suggestions in Bioscience Literature Search, Anna Divoli, Marti A. Hearst, and Michael A. Wooldridge, in PSB 2008.  pdf

    Improved Statistical Machine Translation Using Monolingual Paraphrases. Preslav Nakov. In Proceedings of the European Conference on Artificial Intelligence (ECAI'08), Patras, Greece, 2008. pdf

    Paraphrasing Verbs for Noun Compound Interpretation. Preslav Nakov. In Proceedings of the Workshop on Multiword Expressions (MWE'08), in conjunction with the Language Resources and Evaluation conference, Marrakech, Morocco, 2008. pdf

    Improving English-Spanish Statistical Machine Translation: Experiments with Domain Adaptation, Sentence-Level Paraphrasing, Tokenization, and Recasing, Preslav Nakov. In Proceedings of the Third Workshop on Statistical Machine Translation (WMT'08), in conjunction with ACL'2008. pdf

    Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study, Preslav Nakov. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA'08), 2008. pdf

    Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. Preslav Nakov (PhD Dissertation, 2008), Technical Report No. UCB/EECS-2007-173   html

    Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding, Ariel Schwartz, Anna Divoli and Marti Hearst, in the proceedings of EMNLP-CoNLL-2007, Pragh, Czech Republic, 2007.   pdf

    Showing Figures and Captions in the Biotext Journal Search Engine (Poster), Marti Hearst, Michael Wooldridge, Jerry Ye, and Anna Divoli, in the Proceedings of ISMB/ECCB 2007, Vienna, Austria. Poster pdf

    Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT, Rochester NY, April 2007. pdf

    Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments, Ariel Schwartz (PhD Dissertation, 2007), Technical Report No. UCB/EECS-2007-39.   html abstract   pdf

    BioText Report for the Second BioCreAtIvE Challenge, Preslav Nakov and Anna Divoli, in the Proceedings of BioCreAtIvE II Workshop, Madrid, Spain, April 23-25, 2007. pdf

    UCB System Description for the WMT 2007 Shared Task, Preslav Nakov and Marti Hearst, in Proceedings of Second Workshop on Statistical Machine Translation co-located with ACL-2007, Prague, June 23, 2007. pdf

    UCB System Description for SemEval Task #4, Preslav Nakov and Marti Hearst, in the Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007. pdf

    SemEval-2007 Task 04: Classification of Semantic Relations between Nominals, Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret, in the Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007. pdf

    Using Verbs to Characterize Noun-Noun Relations, Preslav Nakov and Marti Hearst, in the Proceedings of the Twelfth International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA), Bulgaria, September 2006. pdf

    Predicting Gene Functions from Text Using a Cross-Species Approach, Emilia Stoica and Marti Hearst, in the 2006 Pacific Biocomputing Symposium (PSB'06), Maui, HI.   pdf

    Summarizing Key Concepts Using Citation Sentences, Ariel Schwartz and Marti Hearst, Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology (poster), at HLT-NAACL, 2006. pdf

    Biotext Team Report for the TREC 2006 Genomics Track, Divoli et al., Proceedings of TREC 2006, Gaithersburg, MD, November 2006. pdf

    Extraction of semantic relations from bioscience text, Barbara Rosario (PhD Dissertation), UC Berkeley, 2005. pdf

    Multi-way Relation Classification: Application to Protein-Protein Interaction, Barbara Rosario and Marti Hearst, in HLT/EMNLP'05, Vancouver, 2005.   pdf

    Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution, Preslav Nakov and Marti Hearst, in HLT/EMNLP'05, Vancouver, 2005.   pdf

    A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies, Preslav Nakov and Marti Hearst, in RANLP'05, Borovets, Bulgaria, 2005   pdf

    Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst, in CoNLL-2005, Ann Arbor, MI, 2005.   pdf

    Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing, Preslav Nakov, Ariel Schwartz, Brian Wolf, and Marti Hearst, in ACL/ISMB BioLINK SIG 2005, Detroit, MI, 2005.   pdf

    Supporting Annotation Layers for Natural Language Processing, Preslav Nakov, Ariel Schwartz, Brian Wolf, and Marti Hearst, in ACL'05 Poster/Demo Track, Ann Arbor, MI, 2005.   pdf

    Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL'04, Barcelona, 2004.   pdf

    Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics.   pdf

    BioText Team Experiments for the TREC 2004 Genomics Track, Preslav Nakov, Ariel S. Schwartz, Emilia Stoica, Marti A. Hearst, Proceedings of TREC 2004, Gaithersburg, MD, 2005. pdf

    Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, (7Oct2004) Available at BioMedCentral

    Nearly-Automated Metadata Hierarchy Creation, Emilia Stoica and Marti Hearst, in HLT-NAACL'04, Companion Volume, Boston, May 2004. pdf

    BioText Team Report for the TREC 2003 Genomics Track, Gaurav Bhalotia, Preslav Nakov, Ariel S. Schwartz, Marti A. Hearst, Proceedings of TREC 2003, Gaithersburg, MD, pdf

    Category-based Pseudowords, Preslav Nakov and Marti Hearst, in the Companion Volume of the Proceedings of HLT-NAACL'03, Edmonton, Canada, May 2003. pdf

    A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, in the proceedings of the Pacific Symposium on Biocomputing (PSB 2003) Kauai, Jan 2003. pdf

    The Descent of Hierarchy, and Selection in Relational Semantics Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL-02, July, 2002. pdf   ps

    Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy, Barbara Rosario and Marti Hearst, in the Proceedings of EMNLP '01, Pittsburgh, PA, June 2001.   pdf   ps

    Untangling Text Data Mining, Marti Hearst, Proceedings of ACL'99, the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999. html

Talks

    Text, Tags, and Thumbnails: Latest Trends in Bioscience Literature Search, Tutorial presented by Marti Hearst at the Pharmaceutical and Health Division of the SLA, Spring Meeting, March 2009. ppt (12.8M)  pdf (6.5M)

    Caption Search for Bioscience Literature Search Interfaces , ACL Workshop in BioNLP, June 29, 2007. ppt

    Castanet: Using Wordnet to Build Facet Hierarchies, NAACL-HLT 2007.  ppt  

    Predicting Gene Functions from Text Using a Cross-Species Approach Pacific Biocomputing Symposium, PSB 2006. ppt

    Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution, HLT-NAACL'05, October, 2005.   ppt

    Classifying Semantic Relations in Bioscience Text ACL-04, July 2004. ppt

    Biotext Team Report for TREC 2003 Genomics Track, November 2003. ppt

    Biotext Project Overview, Myers Seminar, UC Berkeley, September 2003. ppt

    Interfaces for Intense Information Analysis, IBM Workshop on The User Experience of Business Intelligence and Knowledge Management, March 2002. ppt

    A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Pacific Symposium on Biocomputing, PSB'03 ppt

    The Descent of Hierarchy, and Selection in Relational Semantics ACL-02, July, 2002. ppt

    Category-Based Pseudowords, HLT-NAACL'03 ppt

Live Search Interface

Software

Data

    Data for abbreviation recognition.   Download

    Data for protein-protein interactions.   Download

    Data for relation recognition.   Download

    Data for noun compound relation recognition.   Download

Funding

    This research is supported by NSF grant DBI-0317510 as well as a gift from Genentech, an NSF ITR grant (EIA-0122599, part of the CITRIS project), and an ARDA AQUAINT contract.

Logo design by Anita Wilhelm.
highlights
BioText Project awarded $840,000 from the National Science Foundation.