Biotext
          Project Logo

BioText Data

This web page contains links to training and testing sets for various research results produced by the BioText project.

    Recognizing Abbreviation Definitions

      Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

        A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, in the proceedings of the Pacific Symposium on Biocomputing (PSB 2003) pdf

      To develop this collection, 1000 MEDLINE abstracts were randomly selected from the results of a query on the term "yeast". These were then hand tagged, producing a list of 954 correct pairs.

      The dataset was first annotated by a researcher in computational and biosciences. The data was further verified by comparing any questionable pairs against other occurrences of the same abbreviation in other abstracts, using the web site provided by Chang, Schuetze, and Altman 2002. A pair extracted by the Schwartz and Hearst algorithm is considered correct only if it exactly matches a pair labeled in the dataset.

    • Unlabeled data
    • Labeled data

    Protein-Protein Interaction Data

      Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

        Multi-way Relation Classification: Application to Protein-Protein Interaction, Barbara Rosario and Marti Hearst, in HLT-NAACL'05, Vancouver, 2005.   pdf

      The dataset was annotated by a researcher in computational and biosciences. In the paper above we describe how we extracted the data. The format is the following: interaction_type====PaperPubMedID_Prot1_ID_Prot2_ID==>sentence with proteins labeled|| .....

    • Sentences from full papers
    • Citation sentences

    Relations between DISEASE/TREATMENT Entities

      Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

        Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti A. Hearst, in the proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004. pdf

        Information about, and links to, the files

    Noun Compound Semantics

      Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

        Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy.
        Barbara Rosario and Marti Hearst.
        Proceedings of 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburgh, PA (EMNLP 2001).

      In the following files are all the labeled NC used in the experiments described in the paper Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy.

      For questions about the data, please email rosario@sims.berkeley.edu