WP7a Reasoning Requirements
Under reasoning we understand a formal process to derive new implicit information based on the existing RDF statement. Thus, three different types of reasoning are considered in this document.
Transformation reasoning / Semantics ETLs / Mapping rules
?LinkedLifeData platform deals with a large number of heterogeneous data sources distributed in various data exchange formats like RDF, database dumps, OBO or XML. Since there is no single authority or best practice convention how to identify biomedical entities or represent relationships our team encountered a list of common patterns required to deal in order to link:
- equal instances
- resources with a specific form of semantic relationship
The final goal for all this patterns is to generate explicit links (the red one on the figure below) between the two suspected instances X and Y based on existing information in the model (the blue one). After the necessary transformation are asserted SPARQL may be used to query the integrated data model.
Namespace mapping - Two RDF datasets uses one and the same local identifier but defines different namespace; this case is common for datasets to refer common database stable identifiers like GO, ?EntrezGene that has no resolvable URI supported by the authors
- Reference node - A trick to prevent the former pattern uses a reference dummy node to designate the database id and name; This is also the recommended way to create cross-references to external data sources in BioPAX specification
Mismatched identifier - Database entries have multiple identifier used for different purposes. For example ?EntrezGene database has gene symbol (alphanumeric string) and id (numeric value). The id could be regarded also as a composite key constituted by unique combination of (gene symbol, organism)
Value dereference - A lazy-way to reference controlled vocabularies by using only the concept name, but not identifier. For example: ?PubMed is annotated with the name of the MeSH term names, but not MeSH concept ids
- Transitive link - A pattern used to link two data sources based on the common relation to a third-one. For example DBPedia has links to Freebase and ICD-10 codes
Literal extraction - is used to link resources based on literls that enumerates a list of named entities. For example ?DrugBank indication field lists a sequence of diseases.
Open Issue: It is not possible to express mapping rules #1, #2 and #3 in SPARQL, because the lack of proper string concatenation and substring functions
Schema inference
Some of the integrated data sources encode a special semantics. For instance, BioPAX domain ontology uses OWL-DL. The data sources distributed in OBO format are good candidates for SKOS schema. As a base line we find a reasonable requirement with respect to WP7a to request a reasonable combination between RDFS + SKOS + OWL. OWL-DL seems unnecessary computational complex. In this particular case the disjoint constructs could be substituted with simple consistency checking rules.
To successfully implement WP7a M18 prototype we require a semantic expressiveness capable to cover at least RDFS + SKOS specification.
Example:
<A> skos:broader <B> . <B> skos:broader <C> . entails <A> skos:broaderTransitive <B> . <B> skos:broaderTransitive <C> . <A> skos:broaderTransitive <C> .
Another example used for the purpose of semantic data integration is the alignment of different biomedical thesaurus:
<A> skos:broadMatch <B> . entails <A> skos:mappingRelation <B> . <A> skos:broader <B> . <A> skos:broaderTransitive <B> . <A> skos:semanticRelation <B> . <A> rdf:type skos:Concept . <B> rdf:type skos:Concept .
Consistency checking
The data integration process is characterized by a constant flow of new information in the knowledge base (e.g., new versions, additional data sources and etc). Thus, a special form to control the consistency or the correctness of the knowledge base is required. As a minimum we would need to enforce the consistency rules required by SKOS specification like:
<COPD_Disease> skos:prefLabel "COPD"@en .
<COPD_Disease> skos:prefLabel "Common Obstructive Pulmonary Disease"@en . - inconsistent
From another we foresee that the end-user will be able to modify and extend the knowledge base, so a need to enforce specific custom constrains in the model like:
<Document_X> wp7a:contains <COPD_Disease> wp7a:contains rdfs:constraint_domain wp7a:Document wp7a:contains rdfs:constraint_range skos:Concept
<COPD_disease> rdf:type wp7a:Document.
Another utilization of the consistency checking beyond M18 could be investigated to validate relations generated by information extraction algorithms.
Reasoning strategy
The section investigates factors that will estimate the optimal reasoning approach. Two main reasoning strategies can be outlined:
- Forward-chaining: to start from the known facts and to perform the inference in an inductive fashion. This kind of reasoning can have diverse objectives, for instance: to compute the inferred closure; to answer a particular query; to infer a particular sort of knowledge (e.g. the class taxonomy); etc.
- Backward-chaining: to start from a particular fact or from a query and by means of using deductive reasoning to try to verify that fact or to obtain all possible results of the query. Typically, the reasoner decomposes the fact into simpler facts that can be found in the knowledge base or transforms it into alternative facts that can be proven applying further recursive transformations.
There are several factors to predetermine the decision:
- Monotonic/non-monotonic logic - a monotonic logic is a fragment where the addition of new explicit facts (or statements) to the knowledge base (or repository) has the effect that new implicit facts may extend the inferred closure, while at the same time the removal of facts, which were part of the inferred closure, is disallowed. In other words, the addition of new facts can only extend the inferred closure monotonically.
- The size of inference closure - depending on the use case data it might be not feasible to persist the full inference closure; we should estimate based on a small sample dataset what will be the overall repository size in terms of explicit and implicit statements
- Completeness - to rethink if it is a significant factor! we expect that the most of user queries will require consistency and complete results
Our proposal is to use forward chaining reasoning integrated as part of the data layer.
TODO: Motivate our decision based on the factors above.
Propose workflow architecture
The section discuss the software plugins that will implement the gathered reasoning requirements.
Workflow data gathering - process to get/update data
To initiate we probably would need a special query. For example: "ASK { datasourceURI lld:update "Date of last update"; lld:format skos:SKOS }"
- WP7aScriptedDecider [Decider] - to be specified
?DataSourceDownloader [Identifier] - the plugin checks and download the latest version of the data source; probably it has to return an URI/null
<Format>Transformer [?InformationSetTransformer] - multiple plugins to implement the transformation to RDF from: OBO, RDBMS (JDBC connection string) + descriptor; XML; other
?MappingRules [Reasoner] - execute mapping rules from #1 to #5 (we have to implement a special workflow for #6 information extraction)
?ConsistencyCheckingReasoner - validates whether the knowledge base is consistent using a predetermined set of queries
Returns true/false/exception
Open decision: ?DataSourceDownloader & ?ConsistencyCheckingReasoner need additional meta-data like what's the URL of data source or the SPARQL queries to check the consistency. There are two alternative options how to store the meta-data 1) to pass it with the query or 2) to persist it in the repository
Workflow SKOS based information extraction - process to implement #6 mapping rule
TODO
