Introduction to the Layered Query Language

Layered Query Language (LQL) is a language developed by Preslav Nakov, Ariel Schwartz, Brian Wolf and Marti Hearst of the Berkeley BioText project. Its goal is to be an intuitive, simple way to query for ranges of text from a database of documents. The text of these documents must first be annotated using any natural language processing method. The processing done by the BioText group creates several annotation layers that allow powerful querying of the text.

Examples of domain-independent layers that can be built on any collection of documents are: the sentence layer, the full_parse layer, the shallow_parse layer, the part-of-speech (pos) layer, and the word layer. The current annotation method does not include the word layer, as it is indistinguishable from the pos layer. The Biotext group has developed several layers for use in annotating biology texts, including the gene/protein layer, the MeSH layer (labeled the ontology layer in the following diagram), and the chemical layer. See diagram below.

An annotation can be thought of as a range of text that also comes with certain properties. Every annotation has the following properties: layer (the annotation layer into which the annotation was categorized by the natural language processor), start_char_pos (the index of the first character), end_char_pos (the index after the last character), tag_type, and text (the text from the original document that occurs between start_char_pos and end_char_pos). Also, as the BioText group annotates PubMed documents, every annotation has the property pmid.

So, for example, if you wanted a table of all of the sentences in all of the documents in the database, your query would look like this:

FROM [layer='sentence'] AS sen
SELECT sen.text

In the above example, the expression [layer='sentence'] matches all ranges of text whose start and end positions coincide with the start_char_pos and end_char_pos attributes of some sentence annotation. The AS sen binds the variable sen to each of these ranges, and the statement SELECT sen.text says to return a table containing the content of all of the ranges of text bound to sen.

The bit of code layer='sentence' is a test on the annotation, which must evaluate to true in order for the query to match. More complex tests are possible.

FROM [layer='sentence' && text ~ '%Berkeley%'] AS sen
SELECT sen.text

In LQL, the operator ~ is bound to the LIKE operator in SQL. Thus, the character '%' matches any sequence of characters of any length, and the character '_' matches any single character. The above query, therefore, would return a table of all sentences containing the string "Berkeley".

The above query would be one way to retrieve all sentences containing the word "Berkeley." However, the current implementation of the annotation database only stores the text of whole documents, rather than the text of each annotation. Thus, though the text property is available in the SELECT clause (after the annotations have been matched and processed), it cannot be part of the test on a layer. Fortunately, LQL allows ranges to be nested. Thus, to obtain a table of all sentences containing the word "Berkeley", write the following query:

FROM [layer='sentence'
       [layer='pos' && content='Berkeley']
     ] AS berk_sen
SELECT berk_sen.text

This query uses the content property that is unique to the pos layer. The assertion of this query is that there is a word annotation whose content is "Berkeley" (with exactly that case) contained within the range of a sentence annotation. The text of the sentence is returned.

The pos layer also has a content_lower property. A note about string comparisons: case-sensitivity of the tests is dependent on the underlying database implementation. Our implementation has content compare in a case-sensitive manner. We have created the content_lower property, which is the result of converting all of the letters in the content property to lower case, to allow testing while ignoring case.

Using double quotes ["] around a string causes the test not to compare the string to the property that the user specified, but to a property it acts as an alias for. For example, content is an alias for content_lower. Thus, typing the test content="berkeley" has exactly the same results as typing the test content_lower='berkeley'. In other words, both of these tests will also match a word whose content property is "Berkeley", "berkeley", "BERKELEY", or "bErKeLeY", as every word with any of these as its content property will have "berkeley" as its content_lower property.

Any range may contain multiple internal-ranges.

FROM [layer='sentence'
       [layer='pos' && content="attends"] AS attends
       [layer='pos' && content='Berkeley']
     ] AS attendance
SELECT attendance.text, attends.content

(Note that the double quotes around "attends" cause the test for that annotation to be interpreted as layer='pos' && content_lower='attends'. Thus, the value returned by attends.content in the SELECT clause may contain some capitalized letters.)

The preceding query selects all sentences which contain the word "attends" immediately followed by the word "Berkeley". This is the default behavior when multiple ranges are asserted to occur within the same enclosing range: they must be adjacent and in the order specified. The behavior can be modified.

FROM [layer='sentence' { ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='pos' && content='Berkeley']
     ] AS attendance
SELECT attendance.text

The above query specifies that the ranges contained within the sentence annotation need not be sequential in order for the query to match. Thus, the above query will match any sentence that contains the word "Berkeley" somewhere after the occurrence of the word "attends." To find all sentences containing both words but in either order, use the following query:

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='pos' && content='Berkeley']
     ] AS attendance
SELECT attendance.text

Say you want to find all sentences containing the word "attends" and the phrase "UC Berkeley."

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='pos' && content='UC']
       [layer='pos' && content='Berkeley']
     ] AS attendance
SELECT attendance.text

This query may return unintended results, because it does not require the word "UC" to immediately precede the word "Berkeley." The way to solve this problem is to introduce a new range which will default back to enforcing a sequential order.

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='shallow_parse' && tag_name='NP'
         [layer='pos' && content='UC']
	 [layer='pos' && content='Berkeley']
       ]
     ] AS attendance
SELECT attendance.text

Here, the range matching the shallow_parse reverts to the default behavior { ORDER, NO GAPS }, so this query does require the words "UC" and "Berkeley" to be adjacent. However, this is a very verbose way to get around the adjacency problem, and it also requires the sentence to be parsed such that the words "UC" and "Berkeley" appear in the same noun phrase shallow parse annotation. (tag_name is a property of the shallow_parse and pos layers.)

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       ( [layer='pos' && content='UC']
         [layer='pos' && content='Berkeley']
       )
     ] AS attendance
SELECT attendance.text

The parentheses in the above query create an artificial range. This range acts like an annotation range in that it can contain nested ranges, and it acts as though it has the start_char_pos and end_char_pos properties for the purposes of testing the order and sequentiality of other ranges nested in its parent range. However, it should not be given a name (with the AS operator), and its properties cannot be tested or returned in the SELECT clause. The artificial range has the default behavior { ORDER, NO GAPS }.

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='shallow_parse' && tag_name='NP'
         [layer='pos' && content='UC']
       ] AS school
     ] AS sentence
SELECT school.text, sentence.text

The above query is designed to find all sentences stating that someone attends some UC, and return the name of that campus (assumed to be the noun phrase containing the word "UC") as well as the sentence matched.

Just to show a little more of the power of the language:

FROM [layer='sentence' { NO ORDER, ALLOW GAPS }
       [layer='pos' && content="attends"]
       [layer='shallow_parse' && tag_name='NP'
         [layer='pos' && content='UC']
         [layer='pos' && ( content='Berkeley'
			|| content='Davis'
			|| content='Irvine'
			|| content='Los'	-- it is expected that this will be followed by 'Angeles'
			|| content='Merced'
			|| content='Riverside'
			|| content='San'	-- it is expected that this will be followed by either 'Diego' or 'Francisco'
			|| content='Santa'	-- it is expected that this will be followed by either 'Barbara' or 'Cruz'
			 )
         ] AS city
       ] AS campus
     ] AS sentence
SELECT city.content, campus.text, sentence.text

Note the comments in the above query that use the same comment syntax as SQL — two hyphens '--' begin a comment that lasts until the end of the line.

The special characters ^ and $ can be used as in regular expressions when specifying internal ranges.

FROM [layer='sentence'
       [layer='pos' && content="university"] $
     ] AS s
SELECT s.text

This query will return all sentences ending with the word "university."

FROM [layer='sentence'
       ^ [layer='pos' && content="class"]
     ] AS sen
SELECT sen.text

This query will return all sentences beginning with the word "class."

As described above, it is possible to create domain-dependent layers. One such layer is the MeSH layer. For example:

FROM [layer='shallow_parse' && tag_name="NP"
       [layer='pos' && tag_name="noun"
         [layer='mesh' && tree_number BELOW 'A01']
       ] AS m1
       [layer='pos' && tag_name="noun"
         [layer='mesh' && tree_number BELOW 'A07']
       ] AS m2 $
     ]
SELECT m1.content, m2.content

This query looks for two adjacent nouns in the same noun phrase, the first of which falls within the A01 sub-hierarchy of the MeSH hierarchy, which happens to be Body Regions, and the second of which has been categorized in the A07 sub-hierarchy, which is Cardiovascular System. The second noun must be the last word in the noun phrase, as indicated by the '$' in the query.

Something else to note here: the query asserts that the MeSH term should occur within a noun pos annotation. Since the purpose of this nesting is simply to assure that the MeSH term matched is a single noun, the query could have written as [layer='MeSH' && tree_number BELOW 'A01' ^ [layer='pos' && tag_name="noun"] $]. The point here is that the annotations overlap, and the annotation ranges in the query are both intended to match the same word, so it doesn't matter in which order they are specified in the query.

Regarding overlap, there are plans possibly to handle overlapping ranges of text. This functionality hasn't been completely developed, but it may be possible to rewrite the above query as something that resembles the following:

FROM [layer='shallow_parse' && tag_name="NP"
       ( { FULL_OVERLAP }
         [layer='pos' && tag_name="noun"] AS m1
         [layer='MeSH' && tree_number below 'A01'] )
       ( { FULL_OVERLAP }
         [layer='pos' && tag_name="noun"] AS m2
         [layer='MeSH' && tree_number below 'A07'] )
       $
     ]
SELECT m1.content, m2.content

The idea here is that it is specified that the two sub-ranges of each artificial range must completely overlap, i.e., that they must cover the same range of text.

Last updated: 2005-06-28 17:06