Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:
Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti A. Hearst, in the proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004.
These files were obtained from MEDLINE 2001 using the first 100 titles
and the first 40 abstracts from the 59 files medline01n*.xml in Medline
2001. No keywords of any sort were used to retrieve the documents.
The annotator, Kaichi Sung, a (former) UC Berkeley SIMS master student with a biological background, looked at the titles and abstracts separately and did the labeling through the text sentence by sentence.
I (Barbara Rosario) decided to concentrate on the semantic roles TREATMENT and DISEASE and I asked Kaichi to see how many different types of relationships could be found between these two roles. She came up with 8 types of relationships (see below) and labeled the text accordingly. She writes:
"I labeled sentences based solely on the content of that individual sentence and not other sentences in the same abstract. Sometimes reading the abstract helped me figure out what was going on in general, especially when the disease or treatment names were obscure or weird or abbreviated. But overall I tried to ensure that a labeled relation within a sentence was not dependent on other sentences around it and could stand on its own."
We did not specify an exact labeling convention, and this has produced some inconsistency in the data. For example, for ovarian cancer only cancer was labeled to be a DISEASE but in another sentence with breast cancer both words were labeled as DISEASEs; in non-recurrent cancer of the cervix only non-recurrent cancer is a DISEASE but in complicated cancer of the large bowel the whole thing is a DISEASE. In same cases the different notations may be due to the different importance and emphases of the concepts in the sentences, at others, they may just be mistakes.
The annotation was done independently of any syntactic information, nor with any other constraints, and this also gives rise to some inconsistencies in the labeling; for example, in "The <DIS> lesion </DIS> was resected by..." only part of the noun phrase "The lesion" was labeled as a DISEASE and the determiner left out, while in "<TREAT> the paravertebral block </TREAT>" the whole NP was labeled.
I also retained the sentences that were not found to have the entities and relationships of interest and in my experiments I try to distinguish between relevant and non-relevant sentences. The non relevant sentences come from the same population of abstracts and titles than the relevant ones, and therefore relevant and non relevant (allso called positive or negative) sentences can be very similar to each other, as discussing the very same concepts.
This section describes the various types of semantic relations that were found to occur between the semantic classes of TREATMENT and DISEASE. Below are shown a few examples for each relationship.
Kaichi Sung writes: ``To label a sentence as `cure', the treatment has to cure the disease or it is meant to cure it but might still be in testing (e.g., clinical trials). On more thought I wonder if these two relationships should actually be separated into two relationships. This might be useful due to the obvious difference between a treatment that has been shown to be effective clinically versus a treatment that is still being tested or was inconclusive. We decided for the moment to have only one relation for these two concepts''.
<label> means that the word that follows it is the first of the entity and </label> that the word that proceeds it is the last of the entity.
Some examples for this relation:
When a treatment was not mentioned in the sentence (other entities
may have been present). Some examples:
When a disease was not mentioned in the sentence (other entities may
have been present). Some examples:
When there is a clear implication that a <TREAT> will prevent
a <DIS>. This might be inherent in the definition of the treatment,
e.g. a vaccine works by preventing a disease from occurring, or explicitly
stated, often with the words ``prevent'' or ``prevention of''. Also
seen is the phrase ``reduce incidents'', ``reduce rates of'', or ``reduction
in rates...'' because these also imply that disease events are being
When a DISEASE is a result of a TREATMENT. The cause/effect relationship
should be explicitly stated or at least very clearly
implied or hypothesized. Usually in ``side effect'' sentences (like in ``link'' sentences) there is a timeline element because the DISEASE occurs after some TREATMENT. Examples:
When there is semantically a very unclear relationship between a TREATMENT and a DISEASE. It can be either a TREATMENT that affects a DISEASE or something associated with the condition of a DISEASE or, not as often, a DISEASE that has some sort of effect on a TREATMENT.
Does NOT Cure
When a TREATMENT that is meant to cure a DISEASE does not work. Unfortunately (and, in my view, surprisingly), we found only 4 instances for this relationship:
We did not include in our experiments these more complex sentences (labeled as <TO SEE>) that incorporate more than one relationship, often with multiple entities or the same entities taking part in several interconnected relationships. For example, in the first sentence, we have a "cure'' relationship between oral fludarabine and the DISEASE chronic lymphocytic leukemia but also a ``side effect'', Progressive multifocal leukoencephalopathy. In the second one, we have a treatment that cures and one that doesn't. We found 75 of such sentences.
In the following file, all the labeled sentences used in the experiments described in the paper Classifying Semantic Relations in Bioscience Text.
In the following files, the abstracts and titles from Medline with the labels (these are only part of the data used in the experiments).