Monthly Archives: May 2009

Can Hibernate do this?

The impedance mismatch problem is a well known one in mapping between an object-oriented language and a relational database. Tools such as Hibernate for Java make this easier for the ‘simple’ cases but can be awkward to use where one wants to take advantage of complex query processing within the DBMS. In contrast, prolog is a natural extension to the relational model (for the most part), which makes shifting between the two paradigms much easier.

Draxler’s sql_compiler can be used to easily map prolog goals to SQL queries. We can also use some neat tricks to dynamically map portions of prolog programs (specifically those in the pure prolog subset with no recursion) to on-the-fly SQL.

This blog post covers some of the methods that can be used.

Finding the least common ancestor of two classes in an ontology

In ontol_db.pro, the predicate subclass/2 is used to relate a class
to its superclass. For example:

subclass(human,homo).

This is an extensional predicate – i.e. a clause with only a head, no body that is designed to be asserted at run-time or compiled from a prolog database.

subclassT/2 is the transitive version of this predicate, and subclassRT/2 is the
reflexive transitive version (i.e. subclassRT(human,human)). These are intensional predicates – they are expressed as rules.

The class_pair_subclass_lca/3 definition looks like this:

class_pair_subclass_lca(X,Y,LCA):-
      subclassRT(X,LCA),
      subclassRT(Y,LCA),
      \+ (( subclassRT(X,CA),
            subclassRT(Y,CA),
            subclassT(CA,LCA))).

i.e. LCA is the lowest common ancestor of X and Y by subclass if:

  • subclassRT/2 holds between X and LCA (i.e. LCA is an ancestor of or identical to X)
  • subclassRT/2 holds between Y and LCA (i.e. LCA is an ancestor of or identical to Y)
  • there is no common ancestor of X and Y that is more ‘recent’ than LCA
    (We could avoid some repetition by defining CA first, and then LCA from there)

The above predicate will work given a prolog database of subclass/2 facts. For example, the NCBI Taxonomy (warning – large file).

If you have installed blip you can issue the following on the command line:

blip -i http://purl.org/obo/obo-all/ncbi_taxonomy/ncbi_taxonomy.pro -f ontol_db:pro -u ontol_db\
 findall "(class(X,'Mus musculus'),class(Y,'Homo sapiens'),class_pair_subclass_lca(X,Y,A))" -select A -label

This is equivalent to the prolog program:

:- use_module(bio(ontol_db)).
:- use_module(bio(io)).
test:-
  load_biofile('http://purl.org/obo/obo-all/ncbi_taxonomy/ncbi_taxonomy.pro',ontol_db:pro),
  class(X,'Mus musculus'),
  class(Y,'Homo sapiens'),
  class_pair_subclass_lca(X,Y,A),
  class(A,AN),
  writeln(A-AN),
  fail.

(This may take a while for the initial download, but the ontology is cached for future references)

After a short wait this will yield A=NCBITaxon:314146 (which has the scientific name ‘Euarchontoglires’. The -label option will use entity_label/2 to find the labels for class IDs). We can speed this up by using memoization/tabling of the recursive predicates, but that isn’t covered here.

Mapping to SQL

So far so good. This clause will also work for relational databases too. Here we demonstrate this using an instance of the OBD database in which the transitive closure of the relations are stored as extensional predicates in a table called ‘link’.

The first way to do this is to bind all predicates in the model to the database:

blip-obd -debug sql -r obd/pkb2 -sqlbind ontol_db:all -sqlbind metadata_db:all\
 findall "(class(X,'Mouse'),class(Y,'Human'),class_pair_subclass_lca(X,Y,A))" -select A -label

blip-obd is an alias for blip -u ontol_db -u blipkit_sql -u ontol_sqlmap_obd

ontol_sqlmap_obd contains mappings between ontol_db model facts and the obd schema.

It contains mapping rules such as the following:

ontol_db:subclassT(X,Y) <-
  node(XI,X,_,_),link(XI,RI,YI,_),node(YI,Y,_,_),node(RI,'OBO_REL:is_a',_,_),\+ (XI=YI).

The tables ‘node’ and ‘link’ are used for classes and transitive
reflexive relationships in OBD. We have to specify extra joins here
as like many relational databases this uses internal integer
artificial/surrogate keys. We don’t want to expose these at the ontol_db model
level, so they are hidden in the mapping.

We use sqlbind/2 to use all mappings in the sqlmap module.

Note that this results in class_pair_subclass_lca/3 goals being dynamically rewritten to a goal that executes a SQL query. This is done entirely automatically by examining the prolog rule – there is no human-specified mapping for this rule. The resulting SQL query is quite large and uses a SELECT .. WHERE NOT EXISTS … pattern with lots of joins. Here it is, in gory detail:

SELECT DISTINCT
 'NIF_Organism:birnlex_167' ,
 'NIF_Organism:birnlex_516' ,
 node_5.uid
 FROM
 node node_3 ,
 link link_1 ,
 node node_4 ,
 link link_2 ,
 node node_5 ,
 node node_6
 WHERE
 node_3.uid = 'NIF_Organism:birnlex_167' AND
 node_3.is_obsolete = 'f' AND
 link_1.node_id = node_3.node_id AND
 link_1.combinator = '' AND
 node_4.uid = 'NIF_Organism:birnlex_516' AND
 node_4.is_obsolete = 'f' AND
 link_2.node_id = node_4.node_id AND
 link_2.predicate_id = link_1.predicate_id AND
 link_2.object_id = link_1.object_id AND
 link_2.combinator = '' AND
 node_5.node_id = link_1.object_id AND
 node_5.is_obsolete = 'f' AND
 node_6.node_id = link_1.predicate_id AND
 node_6.uid = 'OBO_REL:is_a' AND
 node_6.is_obsolete = 'f' AND
 NOT EXISTS (
 SELECT DISTINCT
 *
 FROM
 node node_7 ,
 link link_3 ,
 node node_8 ,
 link link_4 ,
 node node_9 ,
 link link_5 ,
 node node_10 ,
 node node_11
 WHERE
 node_7.uid = 'NIF_Organism:birnlex_167' AND
 node_7.is_obsolete = 'f' AND
 link_3.node_id = node_7.node_id AND
 link_3.combinator = '' AND
 node_8.uid = 'NIF_Organism:birnlex_516' AND
 node_8.is_obsolete = 'f' AND
 link_4.node_id = node_8.node_id AND
 link_4.predicate_id = link_3.predicate_id AND
 link_4.object_id = link_3.object_id AND
 link_4.combinator = '' AND
 node_9.node_id = link_3.object_id AND
 node_9.is_obsolete = 'f' AND
 link_5.node_id = link_3.object_id AND
 link_5.predicate_id = link_3.predicate_id AND
 link_5.combinator = '' AND
 node_10.node_id = link_5.object_id AND
 node_10.uid = node_5.uid AND
 node_10.is_obsolete = 'f' AND
 node_11.node_id = link_3.predicate_id AND
 node_11.uid = 'OBO_REL:is_a' AND
 node_11.is_obsolete = 'f' AND
 link_3.object_id != link_5.object_id);

(Phew!)

In theory the RDBMS should be able to optimize this, but in practice, as more joins are added the optimizer suffers. Thankfully we have the option of choosing which clauses are rewritten to SQL and which are executed according to normal prolog WAM model.

Mixed prolog-SQL

The following only rewrites the transitive subclass predicates. class_pair_subclass_lca/3 is executed by the prolog engine:

blip-obd -debug sql -r obd/pkb2 -sqlbind ontol_db:class/1 \
   -sqlbind ontol_db:subclassT/2 -sqlbind ontol_db:subclassRT/2 -sqlbind metadata_db:all\
   findall "(class(X,'Mouse'),class(Y,'Human'),class_pair_subclass_lca(X,Y,A))" -select A -label

This results in multiple individual subclassT/2 and subclassRT/2 queries being executed. All CAs are checked until one that is the LCA is found. This more procedural approach may be less efficient, but it is less dependent on database join optimization, which makes its performance more predictable.

This kind of approach is only possible because a subset of prolog is declarative. In specifying what needs to be achieved rather than how, it’s possible to relegate the details to a configuration step.

Advertisements

SWI-Prolog + GMP rocks

SWI-Prolog provides unbounded integer and rational number arithmetic based on GMP library. Support comes out the box — no need to explicitly use a library in your code.

Even if you aren’t working with number ranges near the limits of most programming languages, unbounded integer support turns out to be fantastically useful for blazingly fast set operations by representing the sets as bit vectors. Given a mapping ix between set elements S and integers [0..n], we can assign any set an integer in the range [0..2**n], with each bit indicating whether an element is in the set, i.e  v(S) =  ∑ eS 2**ix(e). We can then use the bitwise arithmetic operators /\ and \/ to perform set-iintersection and set-union. The SWI-Prolog arithmetic function popcount/1 returns the cardinality of the set (corresponding the number of bits set).

This pattern is used in the blip simmatrix module. This allows for the computation of similarity scores between two features based on the attributes they share. E.g given a database with the predicate likes/2, e.g. between individuals and bands, we can determine the similarity of two individuals based on metrics like cosine similarity and Jacard coefficient.

Benchmarks to follow, but even on a 32 bit machine it appears possible to perform super-fast comparisons for databases of millions of facts utilizing 10k+ attributes (and thus integers in the range [0..2**10000]).

A prolog library for OWL2 and SWRL

Ontologies are vital for the life sciences. The Web Ontology Language (OWL) offers decidability of reasoning, and now with OWL2 and SWRL reasonably high levels of expressivity.

Vangelis Vassilidis and I are writing Thea2, based on his original Thea library. The redesign introduces prolog predicates for every OWL2 axiom, and prolog terms for owl class and property expressions. We use the SWI-Prolog semweb library for reading/writing to RDF. There is also an (optional) JPL bridge wrapping the Manchester OWLAPI.

There are a number of different reasoning strategies, including:

  • simple but limited backward chaining reasoning
  • using Grosof’s translation to DLP in conjunction with systems such as Yap, XSB or DLV
  • using standard OWL reasoners via JPL (DIG interface from Thea1 still needs ported)

Source: github
Documentation: pldoc

Documentation server up

Blipkit pldoc server now up and running