The BioText Project

University of California, Berkeley

Data for research on relations between DISEASE/TREATMENT entities

Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti A. Hearst, in the proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004.

Data Provenance

These files were obtained from MEDLINE 2001 using the first 100 titles and the first 40 abstracts from the 59 files medline01n*.xml in Medline 2001. No keywords of any sort were used to retrieve the documents.

Labeling Procedure

The annotator, Kaichi Sung, a (former) UC Berkeley SIMS master student with a biological background, looked at the titles and abstracts separately and did the labeling through the text sentence by sentence.

I (Barbara Rosario) decided to concentrate on the semantic roles TREATMENT and DISEASE and I asked Kaichi to see how many different types of relationships could be found between these two roles. She came up with 8 types of relationships (see below) and labeled the text accordingly. She writes:

    "I labeled sentences based solely on the content of that individual sentence and not other sentences in the same abstract. Sometimes reading the abstract helped me figure out what was going on in general, especially when the disease or treatment names were obscure or weird or abbreviated. But overall I tried to ensure that a labeled relation within a sentence was not dependent on other sentences around it and could stand on its own."

We did not specify an exact labeling convention, and this has produced some inconsistency in the data. For example, for ovarian cancer only cancer was labeled to be a DISEASE but in another sentence with breast cancer both words were labeled as DISEASEs; in non-recurrent cancer of the cervix only non-recurrent cancer is a DISEASE but in complicated cancer of the large bowel the whole thing is a DISEASE. In same cases the different notations may be due to the different importance and emphases of the concepts in the sentences, at others, they may just be mistakes.

The annotation was done independently of any syntactic information, nor with any other constraints, and this also gives rise to some inconsistencies in the labeling; for example, in "The <DIS> lesion </DIS> was resected by..." only part of the noun phrase "The lesion" was labeled as a DISEASE and the determiner left out, while in "<TREAT> the paravertebral block </TREAT>" the whole NP was labeled.

I also retained the sentences that were not found to have the entities and relationships of interest and in my experiments I try to distinguish between relevant and non-relevant sentences. The non relevant sentences come from the same population of abstracts and titles than the relevant ones, and therefore relevant and non relevant (allso called positive or negative) sentences can be very similar to each other, as discussing the very same concepts.


This section describes the various types of semantic relations that were found to occur between the semantic classes of TREATMENT and DISEASE. Below are shown a few examples for each relationship.


Kaichi Sung writes: ``To label a sentence as `cure', the treatment has to cure the disease or it is meant to cure it but might still be in testing (e.g., clinical trials). On more thought I wonder if these two relationships should actually be separated into two relationships. This might be useful due to the obvious difference between a treatment that has been shown to be effective clinically versus a treatment that is still being tested or was inconclusive. We decided for the moment to have only one relation for these two concepts''.

<label> means that the word that follows it is the first of the entity and </label> that the word that proceeds it is the last of the entity.

Some examples for this relation:

  • OBJECTIVES : <DIS> Obesity </DIS> is an important clinical problem , and the use of <TREAT> dexfenfluramine hydrochloride </TREAT> for weight reduction has been widely publicized since its approval by the Food and Drug Administration .
  • <TREAT> Antibiotics </TREAT> prescribed for <DIS> sore throat </DIS> during the previous year had an additional effect ( hazard ratio 1.69 , 1.20 to 2.37 ) .
  • <TREAT> Intravenous immune globulin </TREAT> for <DIS> recurrent spontaneous abortion </DIS> .

Only Disease

When a treatment was not mentioned in the sentence (other entities may have been present). Some examples:

  • The objective of this study was to determine if the rate of <DISONLY> preeclampsia </DISONLY> is increased in triplet as compared to twin gestations.
  • <DISONLY> Down syndrome </DISONLY> (12 cases) and <DISONLY> Edward syndrome </DISONLY> (11 cases) were the most common <DISONLY> trisomies </DISONLY> , while 4 cases of <DISONLY> Patau syndrome </DISONLY> were also diagnosed.
  • <DISONLY> Chronic pancreatitis </DISONLY> and <DISONLY> carcinoma of the pancreas </DISONLY>

Only Treatment

When a disease was not mentioned in the sentence (other entities may have been present). Some examples:

  • Patients were randomly assigned either <TREATONLY> roxithromycin </TREATONLY> 150 mg orally twice a
    day (n = 102) or placebo orally twice a day (n = 100).
  • <TREATONLY> Heterologous vaccines : </TREATONLY> proponent sparks some interest.
  • Meta-analysis of trials comparing <TREATONLY> antidepressants </TREATONLY> with active placebos.


When there is a clear implication that a <TREAT> will prevent a <DIS>. This might be inherent in the definition of the treatment, e.g. a vaccine works by preventing a disease from occurring, or explicitly stated, often with the words ``prevent'' or ``prevention of''. Also seen is the phrase ``reduce incidents'', ``reduce rates of'', or ``reduction in rates...'' because these also imply that disease events are being prevented. Examples:

  • We investigated the hypothesis that <TREAT PREV> an antichlamydial macrolide antibiotic , roxithromycin </TREAT PREV> , can prevent or reduce recurrent major ischaemic events in patients with <DIS PREV> unstable angina </DIS PREV>.
  • Immunogenicity of <DIS PREV> hepatitis B </DIS PREV> <TREAT PREV> vaccine </TREAT PREV> in term and preterm infants.
  • <TREAT PREV> Modified bra </TREAT PREV> in the prevention of <DIS PREV> mastitis </DIS PREV> in nursing mothers

Side Effect

When a DISEASE is a result of a TREATMENT. The cause/effect relationship should be explicitly stated or at least very clearly
implied or hypothesized. Usually in ``side effect'' sentences (like in ``link'' sentences) there is a timeline element because the DISEASE occurs after some TREATMENT. Examples:

  • Initially, all eyes that had <TREAT SIDE EFF> optic capture </TREAT SIDE EFF> without <TREAT SIDE EFF> vitrectomy </TREAT SIDE EFF> also remained clear, but after 6 months, four of five developed <DIS SIDE EFF> opacification </DIS SIDE EFF>
  • Appetite suppressants-most commonly <TREAT SIDE EFF> fenfluramines </TREAT SIDE EFF> -increase the risk of developing <DIS SIDE EFF> PPH </DIS SIDE EFF> ( odds ratio , 6.3 ) , particularly when used for more than 3 months (odds ratio , > 20)
  • The most common toxicity is <DIS SIDE EFF> bone pain </DIS SIDE EFF> , and other reactions such as <DIS SIDE EFF> inflammation </DIS SIDE EFF> at the site of <TREAT SIDE EFF> injection </TREAT SIDE EFF> have also occurred .


When there is semantically a very unclear relationship between a TREATMENT and a DISEASE. It can be either a TREATMENT that affects a DISEASE or something associated with the condition of a DISEASE or, not as often, a DISEASE that has some sort of effect on a TREATMENT.

  • Testing for <DIS VAG> Helicobacter pylori infection </DIS VAG> after <TREAT VAG> antibiotic treatment </TREAT VAG>
  • <DIS VAG> Hyponatremia </DIS VAG> with <TREAT VAG> venlafaxine </TREAT VAG>
  • <TREAT VAG> Hormone replacement therapy </TREAT VAG> and <DIS VAG> breast cancer </DIS VAG>
  • Acute effect of <TREAT VAG> lorazepam </TREAT VAG> on respiratory muscles in patients with <DIS VAG> chronic obstructive pulmonary disease </DIS VAG>

Does NOT Cure

When a TREATMENT that is meant to cure a DISEASE does not work. Unfortunately (and, in my view, surprisingly), we found only 4 instances for this relationship:

  • More of those initially prescribed <TREAT NO> antibiotics </TREAT NO> initially returned to the surgery with <DIS NO> sore throat </DIS NO>.
  • To avoid medicalising a self limiting illness doctors should avoid <TREAT NO> antibiotics </TREAT NO> or offer a delayed prescription for most patients with <DIS NO> sore throat </DIS NO> .
  • <TREAT NO> Subcutaneous injection of irradiated LLC-IL2 </TREAT NO> did not affect the growth of preexisting <DIS NO> s.c. tumors </DIS NO> and also did not improve survival of mice bearing the <DIS NO> lung or peritoneal tumors </DIS NO>
  • Evidence for double resistance to <TREAT NO> permethrin and malathion </TREAT NO> in <DIS NO> head lice </DIS NO>


We did not include in our experiments these more complex sentences (labeled as <TO SEE>) that incorporate more than one relationship, often with multiple entities or the same entities taking part in several interconnected relationships. For example, in the first sentence, we have a "cure'' relationship between oral fludarabine and the DISEASE chronic lymphocytic leukemia but also a ``side effect'', Progressive multifocal leukoencephalopathy. In the second one, we have a treatment that cures and one that doesn't. We found 75 of such sentences.

  • <DIS SIDE EFF> Progressive multifocal leukoencephalopathy </DIS SIDE EFF> following <TREAT> oral fludarabine </TREAT> treatment of <DIS> chronic lymphocytic leukemia </DIS> .
  • <TREAT> Intraperitoneal injection of irradiated LLC-IL2 </TREAT> cured <DIS> pre-existing LLC peritoneal tumors </DIS> and extended the survival of the mice but did not affect survival of mice bearing <DIS NO> lung tumors </DIS NO> nor did it affect the growth of <DIS NO> s.c. tumors </DIS NO> .
Data Files

In the following file, all the labeled sentences used in the experiments described in the paper Classifying Semantic Relations in Bioscience Text.

In the following files, the abstracts and titles from Medline with the labels (these are only part of the data used in the experiments).