LarKC Plug-in API - Prototype Version 02
Barry Bishop, Mick Kerrigan - 8th October 2008
Contents
1. Introduction
This article contains the developer's notes relating to a prototype implementation of the LarKC plug-in API that has some 'anytime' behaviour.
This prototype is the next step from the first (batch) version of the API platform v01.
If you want to take this code and conduct your own experiments then you are strongly advised to fork this code-base and place the code in a suitably named folder in the subversion repository, e.g. platform_barry_experiments23.
The source code for this revision of the API and prototype plug-ins can be found here:
https://svn.gforge.hlrs.de/svn/larkc/trunk/platform_v02
2. How to Run The Demo
If you wish to try out the web-interface then you will require a working installation of tomcat:
- Copy the platform_v02/web-demo/larkc folder in to the webapps folder of your tomcat instance
Start your browser and point to http://localhost:8080/larkc
There is a more basic way of testing out the prototype by running demo/eu.larkc.demo.ExecuteSparqlQueryAgainstWebResourcesAnytime.main. The query to execute is hard coded in this java file.
3. Implemented Plug-ins
In pipe-line order, the following plug-ins are included:
Identify: SindiceKeywordIdentifier, SindiceTriplePatternIdentifier, SwoogleDocumentSearchIdentifier, SwoogleOntologySearchIdentifier, SwoogleTermSearchIdentifier
Transform: SPARQLToTriplePatternQueryTransformer, SPARQLToKeywordQueryTransformer, NullDataTransformer
Select: FirstDataSetOnlySelecter, GrabEverythingSelecter
Reasoner: SparqlQueryEvaluationReasoner
Decide: SimpleConfigurableOnePassDecider, SimpleAnytimeDecider
4. Renaming of data types
We have already had some useful feedback (thanks to Hamish and Naso) regarding the naming of data structures and their functionality. We have started to try and align with the ORDI data model as put forward by ontotext.
The data structures (at the moment) are:
InformationSet - base class of all data structures that are processed by LarKC
NaturalLangaugeDocument - derived from InformationSet, candidate for data structures that hold free-form text/natural language
RdfGraph - derived from InformationSet, an RDF named graph
RdfStatement - a single triple (quad) of subject, predicate, object (context)
At this stage there is no DataSet or TripleSet from the ORDI data model, but this is expected to change in the near future.
5. General Approach for this Anytime Experiment
The plug-ins implemented for the first version of the prototype where re-used in this version. It was decided to wrap the existing plug-in interfaces in threaded wrapper classes, in order to shield plug-in developers from:
- the mechanics of passing/serialising data objects as they are passed between components
- synchronisation issues - the emerging framework takes care of this and plug-in developers can concentrate on implementing atomic units of functionality.
In the (badly named) core/eu.larkc.core.anytime package are container classes that manage a single plug-in. These containers have their own thread, pass input to the plug-in, collect output from the plug-in, listen for and respond to command messages and pass data objects to other plug-in containers by means of external queue objects.
Pipeline construction occurs within the decider when it is instantiated and continues like this:
- A decider instantiates quene objects that serve as the output destination for each plug-in
- Plug-ins and their containers are instantiated and the data pipeline is connected up by passing queue objects to the container constructors
- The control pipe-line is connected up, by calling a method on each container.
- The containers are started - at which point they go in to a 'ready' state waiting for an instruction
At the moment, the control pipeline goes directly from one plug-in to the previous plug-in in the pipeline, but in the near future this will instead go via the decider.
From now on, the 'user' who created the decider can proceed in two ways:
- call go() (returns immediately) with no parameters followed by repeated calls to getResult(). Execution terminates when the last call to getResult() returns null.
- call go( int contractSize ) with the number of results required. Execution will stop when a null is returned from getResult() or pause when the number of desired results has been returned. After this (or even before), go(...) can be called again to continue generating answers.
Internally, each plug-in container waits in its ready state (actually blocking on its control queue) until a command message is received. At this point, the thread wakes up, sees that it has been requested to so something, does it and passes the result to its output queue. If the plug-in is not at the beginning of the pipeline (i.e. not an identify component) then it will first of all request input from the plug-in one step back from it in the pipeline and wait for this input to arrive.
6. API Modifications Required
Due to the approach taken (of wrapping plug-ins in threaded containers) there were no required changes to the plug-in API. Most of the changes in this version of the prototype involved the creation of the framework for asynchronous execution of plug-ins and communication.
However, a 'contract' parameter has been added (although not used yet) to each method from the plug-in interfaces. The idea is that a decider will be able to control how the plug-in behaves during execution, e.g. by changing the contract size (number of units of output for each request), although this can be extended to include any parameter that can be updated during execution.
7. Issues Arising
Observation: Careful thought must be given to the issue of 'contract size'. Different plug-ins might have different units of work for what they do, e.g. sindice happens to return pages of 10 results per search request, but this should not dictate the quantity of items output to the next stage in the pipeline.
Observation: At the moment, plug-ins bundle up their results and put them on the queue in one 'chunk', e.g. the sindice results appear as one collection of RDF URLs that get picked up by the next pipeline plug-in. However, it might be better (particularly for plug-ins that do intensive processing for each result) if they put results on to their output queues one-at-a-time, regardless of the request's contract size. In which case, a slightly different mechanism is needed to deliver data objects between plug-ins, the most obvious being to simply insert some kind of 'begin-batch' and 'end-batch' messages.
