Parallelisation for identifying and selection
As mentioned in this page, in order to do within-plugin parallelisation, we need to prepare the following and send it to HRLS:
1.- Algorithm (in general, source code of the plug-in, and whether it is already embedded within a pipeline/workflow, has the LarKC API, etc. Or source code of the complete workflow, if it exists)
2. Data set
3. Input query
4. Expected output
5. Performance metrics and some evaluation results (before parallelization)
6. Guidelines on how to test the code (together with the sw to do so)
Random Indexing
Both USFD and MPG are doing experiments with Random Indexing (semanticVectors and airHead libraries). It takes quite a long time to generate vectors for huge datasets such as those we experiment with in LarKC. Moreover, it takes a while to calculate inner product between query vector and semantic space. Could parallelisation help here?
SemanticVectors (Proposal by MPG)
Prepare a self contained example for HLRS and send them in order to parallelise semanticVector library?
AirHead
Requirement from |
USFD |
USFD Participants |
|
HLRS Participants |
We have developed a Random Indexing Selector which wraps the ?AirHead library: http://code.google.com/p/airhead-research
- Using this library we perform two operations:
- generating vectors from the text file: input is a text file, output is a binary file with vectors usually named 'something.sspace'
- searching e.g. finding similar words using cosine function: input is a term, and a path to the vectors file (something.sspace), and output is a set of similar terms (for example synonyms)
Both generating vectors and searching is time-consuming on the subsets of LLD with which we are experimenting. We are currently in the process of trying to parallelize these with the help from HLRS.
Another possible case for parallelisation is generation of text from an RDF Graph.
User interests based Selection and Query Refinement (by WICI)
Requirement from |
WICI |
WICI Participants |
|
HLRS Participants |
For user interests based selection, mainly 3 set of tasks could be parallelized:
(1) User interests extraction and calculation;
(2) Interests based Query refinement processing;
(3) Interests based Selection.
For (2), currently, there are two types of query refinement processing that could be parallelized:
- Query refinement using 9 interests at one time;
- Query refinement using 9 interests one by one.
The materials prepared for HLRS (already sent to Katharina) is attached here WICI-query-parallelization.rar, including:
- Algorithm;
- Dataset (2);
- Input query;
- Expected output samples;
- Performance metrics and some evaluation results;
- Guidelines on how to test the code;
- How it can be parallelized;
Currently, the program is running under the following environment.
- HP xw8600 Workstation
- CPU: 3.00GHz
- Memory: 32.0GB
- Operating System : Windows XP Professional Edition x64
- Software packages that are needed: Jdk 1.6.0_16, Jena 2.6.2, eclipse 3.5.0
