WP7a Reasoning Requirements

Under reasoning we understand a formal process to derive new implicit information based on the existing RDF statement. Thus, three different types of reasoning are considered in this document.

Transformation reasoning / Semantics ETLs / Mapping rules

?LinkedLifeData platform deals with a large number of heterogeneous data sources distributed in various data exchange formats like RDF, database dumps, OBO or XML. Since there is no single authority or best practice convention how to identify biomedical entities or represent relationships our team encountered a list of common patterns required to deal in order to link:

The final goal for all this patterns is to generate explicit links (the red one on the figure below) between the two suspected instances X and Y based on existing information in the model (the blue one). After the necessary transformation are asserted SPARQL may be used to query the integrated data model.

http://wiki.larkc.eu/LarkcProject/WP7a/M18prototype?action=AttachFile&do=get&target=WP7a+transformations.png

  1. Namespace mapping - Two RDF datasets uses one and the same local identifier but defines different namespace; this case is common for datasets to refer common database stable identifiers like GO, ?EntrezGene that has no resolvable URI supported by the authors

  2. Reference node - A trick to prevent the former pattern uses a reference dummy node to designate the database id and name; This is also the recommended way to create cross-references to external data sources in BioPAX specification
  3. Mismatched identifier - Database entries have multiple identifier used for different purposes. For example ?EntrezGene database has gene symbol (alphanumeric string) and id (numeric value). The id could be regarded also as a composite key constituted by unique combination of (gene symbol, organism)

  4. Value dereference - A lazy-way to reference controlled vocabularies by using only the concept name, but not identifier. For example: ?PubMed is annotated with the name of the MeSH term names, but not MeSH concept ids

  5. Transitive link - A pattern used to link two data sources based on the common relation to a third-one. For example DBPedia has links to Freebase and ICD-10 codes
  6. Literal extraction - is used to link resources based on literls that enumerates a list of named entities. For example ?DrugBank indication field lists a sequence of diseases.

Open Issue: It is not possible to express mapping rules #1, #2 and #3 in SPARQL, because the lack of proper string concatenation and substring functions

Schema inference

Some of the integrated data sources encode a special semantics. For instance, BioPAX domain ontology uses OWL-DL. The data sources distributed in OBO format are good candidates for SKOS schema. As a base line we find a reasonable requirement with respect to WP7a to request a reasonable combination between RDFS + SKOS + OWL. OWL-DL seems unnecessary computational complex. In this particular case the disjoint constructs could be substituted with simple consistency checking rules.

To successfully implement WP7a M18 prototype we require a semantic expressiveness capable to cover at least RDFS + SKOS specification.

Example:

<A> skos:broader <B> .
<B> skos:broader <C> .

entails

<A> skos:broaderTransitive <B> .
<B> skos:broaderTransitive <C> .
<A> skos:broaderTransitive <C> .

Another example used for the purpose of semantic data integration is the alignment of different biomedical thesaurus:

<A> skos:broadMatch <B> .

entails

<A> skos:mappingRelation <B> .
<A> skos:broader <B> .
<A> skos:broaderTransitive <B> .
<A> skos:semanticRelation <B> .
<A> rdf:type skos:Concept .
<B> rdf:type skos:Concept .

Consistency checking

The data integration process is characterized by a constant flow of new information in the knowledge base (e.g., new versions, additional data sources and etc). Thus, a special form to control the consistency or the correctness of the knowledge base is required. As a minimum we would need to enforce the consistency rules required by SKOS specification like:

<COPD_Disease> skos:prefLabel "COPD"@en .

<COPD_Disease> skos:prefLabel "Common Obstructive Pulmonary Disease"@en . - inconsistent

From another we foresee that the end-user will be able to modify and extend the knowledge base, so a need to enforce specific custom constrains in the model like:

<Document_X> wp7a:contains <COPD_Disease>
wp7a:contains rdfs:constraint_domain wp7a:Document
wp7a:contains rdfs:constraint_range skos:Concept

<COPD_disease> rdf:type wp7a:Document.

Another utilization of the consistency checking beyond M18 could be investigated to validate relations generated by information extraction algorithms.

Reasoning strategy

The section investigates factors that will estimate the optimal reasoning approach. Two main reasoning strategies can be outlined:

There are several factors to predetermine the decision:

Our proposal is to use forward chaining reasoning integrated as part of the data layer.

TODO: Motivate our decision based on the factors above.

Propose workflow architecture

The section discuss the software plugins that will implement the gathered reasoning requirements.

Workflow data gathering - process to get/update data

To initiate we probably would need a special query. For example: "ASK { datasourceURI lld:update "Date of last update"; lld:format skos:SKOS }"

  1. WP7aScriptedDecider [Decider] - to be specified
  2. ?DataSourceDownloader [Identifier] - the plugin checks and download the latest version of the data source; probably it has to return an URI/null

  3. <Format>Transformer [?InformationSetTransformer] - multiple plugins to implement the transformation to RDF from: OBO, RDBMS (JDBC connection string) + descriptor; XML; other

  4. ?MappingRules [Reasoner] - execute mapping rules from #1 to #5 (we have to implement a special workflow for #6 information extraction)

  5. ?ConsistencyCheckingReasoner - validates whether the knowledge base is consistent using a predetermined set of queries

Returns true/false/exception

Open decision: ?DataSourceDownloader & ?ConsistencyCheckingReasoner need additional meta-data like what's the URL of data source or the SPARQL queries to check the consistency. There are two alternative options how to store the meta-data 1) to pass it with the query or 2) to persist it in the repository

Workflow SKOS based information extraction - process to implement #6 mapping rule

TODO

LarkcProject/WP7a/WP7aRequiredReasoning (last edited 2009-07-27 09:41:41 by ?VassilMomtchev)