Category Archives: databases

Can Hibernate do this?

The impedance mismatch problem is a well known one in mapping between an object-oriented language and a relational database. Tools such as Hibernate for Java make this easier for the ‘simple’ cases but can be awkward to use where one wants to take advantage of complex query processing within the DBMS. In contrast, prolog is a natural extension to the relational model (for the most part), which makes shifting between the two paradigms much easier.

Draxler’s sql_compiler can be used to easily map prolog goals to SQL queries. We can also use some neat tricks to dynamically map portions of prolog programs (specifically those in the pure prolog subset with no recursion) to on-the-fly SQL.

This blog post covers some of the methods that can be used.

Finding the least common ancestor of two classes in an ontology

In ontol_db.pro, the predicate subclass/2 is used to relate a class
to its superclass. For example:

subclass(human,homo).

This is an extensional predicate – i.e. a clause with only a head, no body that is designed to be asserted at run-time or compiled from a prolog database.

subclassT/2 is the transitive version of this predicate, and subclassRT/2 is the
reflexive transitive version (i.e. subclassRT(human,human)). These are intensional predicates – they are expressed as rules.

The class_pair_subclass_lca/3 definition looks like this:

class_pair_subclass_lca(X,Y,LCA):-
      subclassRT(X,LCA),
      subclassRT(Y,LCA),
      \+ (( subclassRT(X,CA),
            subclassRT(Y,CA),
            subclassT(CA,LCA))).

i.e. LCA is the lowest common ancestor of X and Y by subclass if:

  • subclassRT/2 holds between X and LCA (i.e. LCA is an ancestor of or identical to X)
  • subclassRT/2 holds between Y and LCA (i.e. LCA is an ancestor of or identical to Y)
  • there is no common ancestor of X and Y that is more ‘recent’ than LCA
    (We could avoid some repetition by defining CA first, and then LCA from there)

The above predicate will work given a prolog database of subclass/2 facts. For example, the NCBI Taxonomy (warning – large file).

If you have installed blip you can issue the following on the command line:

blip -i http://purl.org/obo/obo-all/ncbi_taxonomy/ncbi_taxonomy.pro -f ontol_db:pro -u ontol_db\
 findall "(class(X,'Mus musculus'),class(Y,'Homo sapiens'),class_pair_subclass_lca(X,Y,A))" -select A -label

This is equivalent to the prolog program:

:- use_module(bio(ontol_db)).
:- use_module(bio(io)).
test:-
  load_biofile('http://purl.org/obo/obo-all/ncbi_taxonomy/ncbi_taxonomy.pro',ontol_db:pro),
  class(X,'Mus musculus'),
  class(Y,'Homo sapiens'),
  class_pair_subclass_lca(X,Y,A),
  class(A,AN),
  writeln(A-AN),
  fail.

(This may take a while for the initial download, but the ontology is cached for future references)

After a short wait this will yield A=NCBITaxon:314146 (which has the scientific name ‘Euarchontoglires’. The -label option will use entity_label/2 to find the labels for class IDs). We can speed this up by using memoization/tabling of the recursive predicates, but that isn’t covered here.

Mapping to SQL

So far so good. This clause will also work for relational databases too. Here we demonstrate this using an instance of the OBD database in which the transitive closure of the relations are stored as extensional predicates in a table called ‘link’.

The first way to do this is to bind all predicates in the model to the database:

blip-obd -debug sql -r obd/pkb2 -sqlbind ontol_db:all -sqlbind metadata_db:all\
 findall "(class(X,'Mouse'),class(Y,'Human'),class_pair_subclass_lca(X,Y,A))" -select A -label

blip-obd is an alias for blip -u ontol_db -u blipkit_sql -u ontol_sqlmap_obd

ontol_sqlmap_obd contains mappings between ontol_db model facts and the obd schema.

It contains mapping rules such as the following:

ontol_db:subclassT(X,Y) <-
  node(XI,X,_,_),link(XI,RI,YI,_),node(YI,Y,_,_),node(RI,'OBO_REL:is_a',_,_),\+ (XI=YI).

The tables ‘node’ and ‘link’ are used for classes and transitive
reflexive relationships in OBD. We have to specify extra joins here
as like many relational databases this uses internal integer
artificial/surrogate keys. We don’t want to expose these at the ontol_db model
level, so they are hidden in the mapping.

We use sqlbind/2 to use all mappings in the sqlmap module.

Note that this results in class_pair_subclass_lca/3 goals being dynamically rewritten to a goal that executes a SQL query. This is done entirely automatically by examining the prolog rule – there is no human-specified mapping for this rule. The resulting SQL query is quite large and uses a SELECT .. WHERE NOT EXISTS … pattern with lots of joins. Here it is, in gory detail:

SELECT DISTINCT
 'NIF_Organism:birnlex_167' ,
 'NIF_Organism:birnlex_516' ,
 node_5.uid
 FROM
 node node_3 ,
 link link_1 ,
 node node_4 ,
 link link_2 ,
 node node_5 ,
 node node_6
 WHERE
 node_3.uid = 'NIF_Organism:birnlex_167' AND
 node_3.is_obsolete = 'f' AND
 link_1.node_id = node_3.node_id AND
 link_1.combinator = '' AND
 node_4.uid = 'NIF_Organism:birnlex_516' AND
 node_4.is_obsolete = 'f' AND
 link_2.node_id = node_4.node_id AND
 link_2.predicate_id = link_1.predicate_id AND
 link_2.object_id = link_1.object_id AND
 link_2.combinator = '' AND
 node_5.node_id = link_1.object_id AND
 node_5.is_obsolete = 'f' AND
 node_6.node_id = link_1.predicate_id AND
 node_6.uid = 'OBO_REL:is_a' AND
 node_6.is_obsolete = 'f' AND
 NOT EXISTS (
 SELECT DISTINCT
 *
 FROM
 node node_7 ,
 link link_3 ,
 node node_8 ,
 link link_4 ,
 node node_9 ,
 link link_5 ,
 node node_10 ,
 node node_11
 WHERE
 node_7.uid = 'NIF_Organism:birnlex_167' AND
 node_7.is_obsolete = 'f' AND
 link_3.node_id = node_7.node_id AND
 link_3.combinator = '' AND
 node_8.uid = 'NIF_Organism:birnlex_516' AND
 node_8.is_obsolete = 'f' AND
 link_4.node_id = node_8.node_id AND
 link_4.predicate_id = link_3.predicate_id AND
 link_4.object_id = link_3.object_id AND
 link_4.combinator = '' AND
 node_9.node_id = link_3.object_id AND
 node_9.is_obsolete = 'f' AND
 link_5.node_id = link_3.object_id AND
 link_5.predicate_id = link_3.predicate_id AND
 link_5.combinator = '' AND
 node_10.node_id = link_5.object_id AND
 node_10.uid = node_5.uid AND
 node_10.is_obsolete = 'f' AND
 node_11.node_id = link_3.predicate_id AND
 node_11.uid = 'OBO_REL:is_a' AND
 node_11.is_obsolete = 'f' AND
 link_3.object_id != link_5.object_id);

(Phew!)

In theory the RDBMS should be able to optimize this, but in practice, as more joins are added the optimizer suffers. Thankfully we have the option of choosing which clauses are rewritten to SQL and which are executed according to normal prolog WAM model.

Mixed prolog-SQL

The following only rewrites the transitive subclass predicates. class_pair_subclass_lca/3 is executed by the prolog engine:

blip-obd -debug sql -r obd/pkb2 -sqlbind ontol_db:class/1 \
   -sqlbind ontol_db:subclassT/2 -sqlbind ontol_db:subclassRT/2 -sqlbind metadata_db:all\
   findall "(class(X,'Mouse'),class(Y,'Human'),class_pair_subclass_lca(X,Y,A))" -select A -label

This results in multiple individual subclassT/2 and subclassRT/2 queries being executed. All CAs are checked until one that is the LCA is found. This more procedural approach may be less efficient, but it is less dependent on database join optimization, which makes its performance more predictable.

This kind of approach is only possible because a subset of prolog is declarative. In specifying what needs to be achieved rather than how, it’s possible to relegate the details to a configuration step.

Advertisements