Thinking aloud about the LarKC plugin's, their functionalities and their interfaces
First version of this page by Frank and Annette on 12 August 2008.
On this page we try to think aloud about the LarKC plugins:what is their functionality, some examples of each of them, what control structure governs them, how do they interact, etc. This is meant as input to the discussion on how to define the interfaces to the different plugin-types.
Control flow in the pipeline
The original proposal mentions the following five plugin-types, organised in a looped sequence:
repeat
obtain a selection of data; (RETRIEVAL)
transform to an appropriate representation; (ABSTRACTION)
draw a sample; (SELECTION)
reason on the sample; (INFERENCE)
if more time is available (DECIDING)
and/or the result is not good enough (DECIDING)
then increase/decrease the data selection
else exit
endThis is the leftmost version in the picture below. Clearly, this is already flawed, since one might want to be able to return to any point in the loop after deciding, not just at the start. This is the middle version in the picture.
However, it seems much more plausible that control decisions need to be taken in between every step, potentially. This leads to the the third picture, where the DECIDING has become as a "meta"-component, to enable quality/resource/control decsions at any point in the pipeline.
In the picture, the dotted arrows are control-flow, the solid arrows are data-flow. Also, the picture leaves out the data-storage layer, to which all components need to have read/write access.
Overall I/O spec of the LarKC platform
As agreed in Amsterdam, the overall behaviour of the LarKC platform will be "a SPARQL endpoint on steroids". Something like:
- In: SPARQL query + QoS constraints
- Out: variable bindings (for select queries) or triple set (for transform query).
QUESTION: should a report on resource usage and indicated quality of the answer also be part of the output?
Anytime behaviour
QUESTION: how to deal with anytime behaviour of a SPARQL endpoint (= returning a progressively better set of answers) Is all this hidden inside the decide-step (so that the user only sees the final answer?) Or do we want the intermediate answers to be returned over time to the user? Related question is how anytime behaviour of plugins is combined in such a loop. Similarly for parallel computation.
Informal one-line definitions of each of the plugin-types
Here we try to give informal one-line definitions of each of the plugin-types.
NOTE: if there is disagreement on these one-line definitions, we should sort it out, because that is a sure symptom that we are not sharing the same conceptual view.
- RETRIEVE: find the right collections of resources to answer a query (e.g. find the right triple stores, document sets, info-services, etc)
- ABSTRACT: ensure that content from these collections is in appropriate form for further processing (e.g. vocabulary mapping, or mapping instances from multiple collections to a single ontology)
- SELECT: select a subset from the content (once it is in the right format) (e.g. only pick triples that you expect to be immediately relevant to the query)
- INFER: draw conclusions/hypotheses/etc from the selected information (e.g. do RDFS/OWL inference to derive conclusions)
- DECIDE: decide which step to do next, and which resources to allocate to it (e.g. compare currently obtained quality with minimally required quality and expected progress given available resources to decide on continuation or not)
Examples of each of the plugins
The above one-line definitions are pretty abstract. Hence, we give some examples for each of the plugin-types below. Some of these we made up, we also list all examples that we promised to build in the DoW.
RETRIEVE:
Sindice <http://sindice.com/> as a retrieval plug-in: given identifiers (URI's) for objects, classes, relations, Sindice returns web-locations of RDF resources containing those identifiers (it's what Google does with words and web-pages, but then with identifiers and RDF resources)
NOTE that folk from WP2 are suggesting to remove this component type. Can we think of other examples for this plugin-type beyond Sindice?
ABSTRACT:
(from ?SaltLux) generalising phone-caller-data to social network. Given raw triples about phone calls (source, target, length, time of day), learn from this triples encoding the social network of the callers (friends, colleagues, family)
classifying: given unclassified instances (from some source) plus an ontology (possibly from another source), compute how to classify the instances in the ontology (= learning rdf:typeOf)
mapping: learning links between two vocabularies, or stated differently,x rephrasing one vocabulary in terms of another. (= learning rdf:subClassof and owl:sameAs)
NOTE that these are all inductive reasoning tasks, the INFER step does not do induction but only deduction (and possibly abduction).
QUESTION: is a "text-to-triples" extraction also an example of this plugin-type?
SELECT:
- activation spreading based on query terms: given some features (e.g. a query) and a triple set, select from that triple set the most relevant subset.
INFER:
all kinds of deductive reasoning (incomplete, unsound, approx, etc)
QUESTION: should abductive reasoning be allowed to happen here as well? (= "what should I have added to the graph for the given conclusion to follow?"). If so, the SPARQL format is not sufficient as input-format. Do this only as a later extension?
DECIDE:
- simplest version: simple fixed sequence of hardcoded components
- next simplest version: repeat the fixed sequence until out-of-time
- smart version: given some QoS constraints of the user and given R/A/S/I plugins, decide on a distribution of resources between those plugins
- supersmart version: given some QoS constraints from the user, decide on a selection of plugins and distribution of resources.
plugin signatures
Based on the above one-line informal definition, we can try to derive what is the generic signature (I/O-types) of each of the plugins:
RETRIEVE: (find the right collections of resources to answer a query)
- In: features (e.g. terms from query),
- Out: (pointers to) triple sets satisfying these features
(we will write triple sets whenever we mean pointers to triple sets)
ABSTRACT: (ensure that content is in appropriate form for further processing)
- In: triple set(s?)
- Out: triple set(s?) that are suitable abstractions of the input sets
NOTE: we assume that the result of the abstraction/learning is a new set of triples (either entirely different from the input set, or a superset of it).
SELECT: (select a subset from the content)
- In: triple set(s?)
- Out: subset(s?) of the input triple set(s)
INFER: (draw conclusions from the selected information)
- In: triple set + SPARQL query,
- Out: inferred variable bindings (for select queries) or triple set (for transform queries)
DECIDE: (decide which step to do next, and which resources to allocate to it)
- In: SPARQL query,
- Out: variable bindings (for select queries) or triple set (for transform queries)
QUESTION: do all components need access to the original query? If so, this should be added everywhere.
Quality of Service aspects
Since LarKC is also about managing resources to get approximate and/or anytime behaviour, the interfaces for the plugins must also capture Quality of Service aspects. Hence, each plugin interface should deal with QoS aspects, as well as with the functional behaviour, as above. We have noticed in other projects (on web-services) that there is very little agreement on a common vocabulary to describe QoS aspects. Here we give some examples of what these could be for the various types.
DECIDE:
This plugin receives QoS constraints from the user (time, memory, allowed user interactions, allowed network traffic), in general any QoS-constraint meaningful to a query-formulating-user. These must then be translated into QoS constraints on the other plugins, in terms that are meaningful to plugin-developers. An example of such a translation is a max. response time dictated by the user, which is translated to a max. number of triples to retrieve by the RETRIEVE plugin, in order to limit the required computation time by the INFERENCE plugin.
RETRIEVE
Examples of relevant QoS constraints could be
- upper/lower-bounds on the number of triples required,
- importance ranking on input identifiers (must-have, nice-to-have);
- required quality/trust values on retrieved resources
