Baseline components version 2

For V2 of the baseline selector components we are:

The objective behind 1/ is that very often selection based on current baseline criteria tends to return far more results that could be needed which renders the selection process obsolete and makes it less useful that full query evaluation. For example if keywords found in SPARQL query literals contain stopwords or words with high occurrence probability (e.g. "and", "a", "is") the selected molecules might span across a lot more RDF graph nodes than the original query. In situations like it would make sense to have a mechanism to limit the result set of the baseline selection algorithms. We implement this by passing selection limit in the plugin contract and then select no more nodes than this limit allows. In effect of applying this result trimming some of the selected molecules might not be full (e.g. if the limit got reached while exploring a certain selected molecule).

Putting a limit on the selection makes it important to have the "more relevant" RDF molecules selected first. Inspired by classical information retrieval techniques we implement a next iteration of the baseline selection components - the IRSelector plug-in. It uses RDF Rank - an RDF graph analysis ranking method similar to Google's ?PageRank, to compute a measure of the RDF nodes' importance (further development of the PageRankRDF component presented in D2.4.1). Further on, for each node in the graph, a textual molecule is built from all the literals in its molecule. The textual molecule is then indexed by the Lucene full-text indexing engine. The Lucene indexing and RDF Rank computation are implemented as a preprocessing step that needs to be run before selection is made possible. During selection time the query keywords are used as Lucene query search terms. Selection is then expanded over the molecules of the most relevant nodes (according to RDF Rank) until the contracted limit is reached. In summary, the IRSelector plug-in implements selection through Lucene's VSM retrieval model with results' relevance adjusted according to their RDF Rank.

LarkcProject/WP2/D222 (last edited 2010-02-26 19:27:49 by ?DanicaDamljanovic)