Protein-Protein Interactions Demo:

LQL, SQL and Execution Results


This query tries to find potential protein-protein interactions by looking for triples of the kind: PROTEIN...VERB...PROTEIN, in this order and with any number of intervening words. Each protein should be within an NP (from the shallow parse layer) and should end at the same character position as the NP (ensured by the $ symbol). Further, it has to be from the gene layer (genes and their corresponding proteins often share the same name and the difference between them is often elided), i.e., listed in LocusLink, and should be contained in the MeSH hierarchy as a protein (which is equivalent to be below the D12.776 subhierarchy). The ^ and $ symbols ensure that the protein and the MeSH layers span the same text fragments. Note that as we are working with spanning intervals, the layers can be switched: in the case of the first protein, the MeSH layer is inside the gene layer, but in the case of the second one it is the reverse order, but these are equivalent.

SELECT lql.p1, lql.verb, lql.p2, COUNT(*) AS cnt
FROM (
   BEGIN_LQL
     [layer='sentence' { ALLOW GAPS }
         [layer='shallow_parse' && tag_name='NP'
           [layer='gene'
             ^ [layer="mesh" && tree_number BELOW 'D12.776'] AS p1 $
           ] $
         ]
         [layer='pos' && tag_name="verb" && (content~"activate%" || content~"inhibit%" || content~"bind%") ] AS verb
         [layer='shallow_parse' && tag_name="NP"
           [layer="mesh" && tree_number BELOW "D12.776"
             ^ [layer='gene'] $
           ] AS p2 $
         ]
     ]
     SELECT p1.content AS p1, verb.content AS verb, p2.content AS p2
   END_LQL
) AS lql
GROUP BY lql.p1, lql.verb, lql.p2
ORDER BY cnt DESC

The verb should be a form of activate, inhibit or bind, e.g. inhibit, inhibits, binding, activated etc. (Of course, this simple way to handle morphological variants can lead to false positives, e.g., activation or inhibitors. Some of these will be filtered out via mismatch in part-of-speech.)

The % symbol here stands for zero or more symbols, as in SQL. It is interpreted as a wildcard within the scope of the ~ operator. The conditional statements are connected with boolean operators like || and &&. The double quotes stand for case insensitive match, thus inhibit, Inhibits and INHIBITED will all be matched. For case sensitive comparisons we use single quotes and in some cases we can use them interchangeably, as the example query shows. We use double quotes for tag_name="verb" as it is a macros, which expands to tag_name="VB%", i.e. to VB, VBZ, VBD etc. Note that the verb is from the POS layer, while the layers before and after it are shallow parse NPs. This is an example of ordering between elements from different layers. Finally, { ALLOW GAPS } allows for intervening words between the verb and the proteins.

The query returns the contents of the two MeSH and POS layers but we could have also selected the NP, the gene or the sentence layers. The real LQL query is enclosed within a BEGIN_LQL - END_LQL and additional SQL functions are allowed over the LQL selection.

See the automatically generated SQL query for the LQL statement above.

See the results of query execution.

The results are not quite good as there are a lot of proteins that are predicted to interact with themselves. This is because in a long sentence the proteins tend to be mentioned multiple times, and we did not put any constraints on how far away these can be from the verb. In addition, there have been only 227 results returned, which is due mainly to the redundant requirement that entities, identified as genes in LocusLink, need to be also listed as proteins in MeSH, which contains a much smaller set of proteins.





Simplified Protein-Protein Interactions Demo Query

LQL, SQL and Execution Results


We can improve the accuracy of the extracted triples by disallowing gaps. This will require the proteins and the verb to follow each other immediately and will lower the recall. To remedy for that, we can also remove the two MeSH layers from the query, which express an redundant requirement anyway (we already limited the two NPs to be entities from the gene/protein layer). This produces 91 triples.


SELECT lql.p1, lql.verb, lql.p2, COUNT(*) AS cnt
FROM (
   BEGIN_LQL
     [layer='sentence'
         [layer='shallow_parse' && tag_name='NP'
           [layer='gene'] AS p1 $
         ]
         [layer='pos' && tag_name="verb" && (content~"activate%" || content~"inhibit%" || content~"bind%") ] AS verb
         [layer='shallow_parse' && tag_name="NP"
           [layer='gene'] AS p2 $
         ]
     ]
     SELECT p1.content AS p1, verb.content AS verb, p2.content AS p2
   END_LQL
) AS lql
GROUP BY lql.p1, lql.verb, lql.p2
ORDER BY cnt DESC

See the automatically generated SQL query for the LQL statement above.

See the query execution results.


Back to the LQL homepage