BioText DataThis web page contains links to training and testing sets for various research results produced by the BioText project.
Recognizing Abbreviation Definitions
To develop this collection,
1000 MEDLINE abstracts were randomly selected from the results of a query on the
term "yeast". These were then hand tagged, producing a list of 954 correct
The dataset was first annotated by a researcher in computational and
biosciences. The data was further verified by comparing any questionable pairs
against other occurrences of the same abbreviation in other abstracts, using the
web site provided by Chang,
Schuetze, and Altman 2002. A pair extracted by the Schwartz and Hearst
algorithm is considered correct only if it exactly matches a pair labeled in the
dataset.
Protein-Protein Interaction Data
The dataset was annotated by a researcher in computational and biosciences. In the paper above we describe how we extracted the data. The format is the following: interaction_type====PaperPubMedID_Prot1_ID_Prot2_ID==>sentence with proteins labeled|| .....
Relations between DISEASE/TREATMENT EntitiesPlease acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:
Noun Compound SemanticsPlease acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:
Barbara Rosario and Marti Hearst. Proceedings of 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburgh, PA (EMNLP 2001). In the following files are all the labeled NC used in the experiments described in the paper Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy. For questions about the data, please email rosario@sims.berkeley.edu |