Using parallel statistical semantics applications in LarKC
This page aims at both users and developers of the statistical semantics applications and provides basic instruction for running parallelised versions of the LarKC applications, in particular on the high-performance systems.
The following application are addressed in the following:
- Semantic Space generation by means of Random Indexing
- Semantic Space search by means of Random Indexing
Semantic Space generation by means of Random Indexing
The application is parallelised by means of the MapReduce technology and runs in a Hadoop based cluster environment. How one can set up such an environment and how to start the application is described in the sections below.
Accessing the BW-Grid cluster of HLRS, preparing the disk workspace
Principally, access is restricted to users who have requested a dedicated login (i.e., username and password) and thus provided a machine name (i.e., a valid publicly available IP Address) from which they are going to connect to the cluster environment.
Said this, users can use any SSH-based tool (in case of Windows) or any Shell that supports remote logins via the SSH protocol to access the system. A connection can be initiated by running the following command:
ssh -l username frbw.dgrid.hlrs.de
As the user's home space on the disk is limited (usually to ~700MB), a dedicated disk space has to be allocated for proceeding the files of big size. This can be done by the command below.
ws_allocate workspace_name duration_in_days !!!Please note, the maximum life duration of the workspace is 30 days, after expiring of which all data are getting lost!!!
The relevant data, such as source code or application input files, can be up/downloaded using scp:
scp corpusfile username@frbw.dgrid.hlrs.de:workspace_name/data/corpusfile
Executing Hadoop-based Random Indexing application
After being logged in, users have to request resources for running their jobs/applications. How this is done is detailed out on the following Wiki sites: https://wickie.hlrs.de/dgrid/index.php/Main_Page
We highly recommend to read through the cluster documentation before working with the environment.
However, we provide here a quick starting guide for "immediate" testing of Hadoop as well as the application.
Note that this configuration is a reasonable suggestion to immediately execute and test the entire stuff on the cluster. It does not provide good performance nor scalability.
If one aims at reaching better performance and higher scalability, the number of nodes need to be significantly increased. We recommand to use at least a minimum of 4 or 8 cluster nodes.
1. Request resources for "interactive" usage
qsub -I -l nodes=[num_of_nodes, e.g 2,4,or 6]:bwgrid,walltime=[max job's life time, hh:mm:ss, e.g. 1:00:00]
2. Set up the Hadoop environment -> see below
3. Run the application -> see below
Setting up the Hadoop environment
To create a Hadoop-based environment, one has to download the Hadoop package from SVN (https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/Hadoop-Indexing/hadoop_ri_hlrs.zip), copy it into your cluster workspace and run the setup script which automatically initialises the nodes as well as prepares the required settings. When downloaded and extracted, simply launch the following command on your console (Note that you need to reserve the necessary amount of nodes first!!!):
hadoop-0.20.2/setupHadoop.sh workspace_name username
Starting the parallelised indexing
The indexing can be exectued by running the hadoop-ri.jar within the Hadoop environment (i.e., the previous step Setting up the Hadoop environment has to be completed first). Before executing the application, one has to upload the data to the virtual file system because Hadoop requires all data to be stored within its distributed file system (dfs). To start these processes, simply enter the following commands into your console:
hadoop-0.20.2/bin/hadoop dfs -copyFromLocal workspace_name/data/corpusfile corpus-dir1
hadoop-0.20.2/bin/hadoop jar hadoop-ri.jar corpus-dir1 output-dir/output.sspace
Further information about the ?HadoopRandomIndexing is provided here: http://code.google.com/p/airhead-research/wiki/HadoopRandomIndexing
Hint: An excellent documentation on "Running Hadoop applications" can be found at the following URL: http://www.michael-noll.com/tutorials/
Semantic Space search by means of Random Indexing (GATE)
The pilot implementation of the application, based on the Airhead Semantic Spaces library (http://code.google.com/p/airhead-research/w/list), is available at LarKC@SF svn at https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/MPI-Search/.
The parallelised Airhead Semantic Space library is located at https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/Airhead/
The application is parallelised by means of the Message-Passing interface (MPI). MPI is a standard for developing the process-based parallel applications to be run in distributed computing and cluster environments. More about MPI specification for the Java programming language can be found at http://www.hpjava.org/reports/mpiJava-spec/mpiJava-spec/mpiJava-spec.html.
Among the available implementation of the MPI library for the Java applications, we advise to use mpiJava (the source code as well as the installation package are available at http://sourceforge.net/projects/mpijava/).
The following steps are required for building and running the application:
1. Logging in to the cluster's front-end
ssh -l username frbw.dgrid.hlrs.de
2. Setting up the mpiJava configuration environment
module load compiler/gnu/4.4.3 module load mpi/openmpi/1.4.3-static-gnu-4.4.3 module load compiler/java/jdk1.6.0_20 module load mpi/mpijava
3. Building the application
1) Please update the mpi.jar library in both Airhead/lib and MPI-Search/lib from the one located at "$MPIJAVA_HOME/lib"
2) The application can be build with ant, same as for the sequential version
4. Running the application
- 4.1. Requesting the cluster's parallel nodes
qsub -I -X -l nodes=[number of nodes requested]:bwgrid,walltime=[estimated execution time, hh:mm:ss]
*Note*: a standard cluster environment only supports jobs with the execution time less as 24 hours.
- 4.2. Splitting the semantic space into chunks, corresponding to the number of the processes requested
The MPI version supposes the semantic space to be split up into chunks, the number of those should strictly correspond to the number of processes the application is running on. Please note that in case if the number of the files containing semantic subspaces doesn't correspond to the number of the processes, the application can not work correctly.
The subspaces should be stored in the format [original space]_[subspace number, 0..n].sspace
For partitioning a semantic space file, please use the utility program sspace-fragmenter.jar, which can be found in "misc" folder of the package contained in svn. The usage format is as follows:
java -jar sspace-fragmenter.jar -p [part of the original semantic space to be included in each of subspaces; e.g. 0.5 means that 2 subspaces will be created, each of them containing a half of the vectors from the original semantic space] -f [format of the output subspaces, e.g. sparce_binary] [old space file name] [new space file name]
See a usage example in fragment.sh script, located in the "misc" folder.
- 4.3. Launching the application on the requested nodes
prunjava [number of processes] [all the java arguments and parameters as for the sequential version]
For more examples how to run a parallel version of the application, please also see start-up scripts from "use_cases" folder of the code checked out from svn.
