Selection
Original idea of selection was to sacrifice accuracy for speed. This means, if we start with the SPARQL query, given a huge dataset (RDF graph), it might take a while to get the answer back. The role of subsetting would be to, for example, partition the RDF graph into smaller subsets, and then, at query time, the process would be faster as we will execute queries against seleted subsets - not the whole huge graph. We can look at this as an indexing process where offline you create an index and then you use that index to perform various searches later.
The other example would be doing some kind of 'ranking' of all statements (RDF triples) which are included in the huge RDF graph. Then if we set the threshold, we can still call this subsetting. So, if I can simplify the selection/subsetting it would be like this:
(this happens at runtime, offline we can do all various calculations which take a lot of time such as building vector space, index...etc.) input: SPARQL query output: sub-graph which is a correct result (set of RDF triples) (output should be the same with or without using selection/subsetting method)
Using ML for selection
As we are working with RDF, we need to create these 'artificial' documents, which in our case we call RDF Molecules. These are lexicalisations of RDF, which we try to process using some method and see if we can get something useful. For example, could do some kind of clustering, where each cluster would be a set of RDF molecules (or maybe a set of triples relevant for the given query). At runtime (when someone sends the SPARQL query, we could look at only most relevant cluster, not the whole RDF graph). And 'the most relevant cluster' is the one with the highest score.
Active learning
Experimental design
Evaluation
The simplest way would be comparing results with and without selection method. The results should be the same i.e. selection method must take care of not loosing relevant data.
Mutual collaborations between WP2 and WP3 from WP3’s Viewpoint
A. What WP2 can contribute to WP3
- Selection
- WP2’s SUNS approach has three selection steps.
- Definition and retrieval of the population. The issue here is to define a group of entities that are homogeneous with respect to the query to be posed.
- Subsampling the population. Random sampling, link following sampling and active sampling are options here.
- Selection of the SUNS.
- I assume that the selection that WP2 pursued attempts to select all information that is known about an entity of interest. Let’s assume that I want to make inference about Angela Merkel, and then selection finds available information on Angela Merkel. In contrast, WP3 is interested to select entities that are similar to Angela Merkel, i.e. all German chancellors, or all germen politicians or all female people living in Berlin. Thus it is not immediately clear how the selection step of WP2 can support WP3.
- WP2 is deriving textual features that could be interesting features for WP3
- WP3 is exploring if random projections (in context of compressed sensing) are applicable to te SUNS model
B. What WP3 can contribute to WP2
- Active Learning:
- The basic idea is the following. Based on a selection step by WP2 too many instances are returned, let’s say 100000.
- The goal is now to select, let’s say, 100 most relevant instances out off the returned 100000 instances.
- In a first step, each instance is described by a feature vector, as in the ML SUNS approach
- In a second step a kernel matrix is formed based on the instances and instance features
- Based on the kernel matrix, the 100 most relevant instances are selected (based on the approach by Yu, Lin, Tresp (2006).
- The 100 most relevant instances are made available to the reasoning engine
- In a way: ranking via active learning
- Linking RDF Graph with textual features. This is of great interest to us and we believe that we can contribute here to the ranking of the entities
