Using parallel statistical semantics applications in LarKC

This page aims at both users and developers of the statistical semantics applications and provides basic instruction for running parallelised versions of the LarKC applications, in particular on the high-performance systems.

The following application are addressed in the following:

Semantic Space generation by means of Random Indexing

The application is parallelised by means of the MapReduce technology and runs in a Hadoop based cluster environment. How one can set up such an environment and how to start the application is described in the sections below.

Accessing the BW-Grid cluster of HLRS, preparing the disk workspace

Principally, access is restricted to users who have requested a dedicated login (i.e., username and password) and thus provided a machine name (i.e., a valid publicly available IP Address) from which they are going to connect to the cluster environment.

Said this, users can use any SSH-based tool (in case of Windows) or any Shell that supports remote logins via the SSH protocol to access the system. A connection can be initiated by running the following command:

ssh -l username frbw.dgrid.hlrs.de

As the user's home space on the disk is limited (usually to ~700MB), a dedicated disk space has to be allocated for proceeding the files of big size. This can be done by the command below.

ws_allocate workspace_name duration_in_days

!!!Please note, the maximum life duration of the workspace is 30 days, after expiring of which all data are getting lost!!!

The relevant data, such as source code or application input files, can be up/downloaded using scp:

scp corpusfile username@frbw.dgrid.hlrs.de:workspace_name/data/corpusfile

Executing Hadoop-based Random Indexing application

After being logged in, users have to request resources for running their jobs/applications. How this is done is detailed out on the following Wiki sites: https://wickie.hlrs.de/dgrid/index.php/Main_Page

We highly recommend to read through the cluster documentation before working with the environment.

However, we provide here a quick starting guide for "immediate" testing of Hadoop as well as the application.

Note that this configuration is a reasonable suggestion to immediately execute and test the entire stuff on the cluster. It does not provide good performance nor scalability.

If one aims at reaching better performance and higher scalability, the number of nodes need to be significantly increased. We recommand to use at least a minimum of 4 or 8 cluster nodes.

1. Request resources for "interactive" usage

qsub -I -l nodes=[num_of_nodes, e.g 2,4,or 6]:bwgrid,walltime=[max job's life time, hh:mm:ss, e.g. 1:00:00]

2. Set up the Hadoop environment -> see below

3. Run the application -> see below

Setting up the Hadoop environment

To create a Hadoop-based environment, one has to download the Hadoop package from SVN (https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/Hadoop-Indexing/hadoop_ri_hlrs.zip), copy it into your cluster workspace and run the setup script which automatically initialises the nodes as well as prepares the required settings. When downloaded and extracted, simply launch the following command on your console (Note that you need to reserve the necessary amount of nodes first!!!):

hadoop-0.20.2/setupHadoop.sh workspace_name username

Starting the parallelised indexing

The indexing can be exectued by running the hadoop-ri.jar within the Hadoop environment (i.e., the previous step Setting up the Hadoop environment has to be completed first). Before executing the application, one has to upload the data to the virtual file system because Hadoop requires all data to be stored within its distributed file system (dfs). To start these processes, simply enter the following commands into your console:

hadoop-0.20.2/bin/hadoop dfs -copyFromLocal workspace_name/data/corpusfile corpus-dir1

hadoop-0.20.2/bin/hadoop jar hadoop-ri.jar corpus-dir1 output-dir/output.sspace

Further information about the ?HadoopRandomIndexing is provided here: http://code.google.com/p/airhead-research/wiki/HadoopRandomIndexing

Hint: An excellent documentation on "Running Hadoop applications" can be found at the following URL: http://www.michael-noll.com/tutorials/

Semantic Space search by means of Random Indexing (GATE)

The pilot implementation of the application, based on the Airhead Semantic Spaces library (http://code.google.com/p/airhead-research/w/list), is available at LarKC@SF svn at https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/MPI-Search/.

The parallelised Airhead Semantic Space library is located at https://larkc.svn.sourceforge.net/svnroot/larkc/branches/RandomIndexing/Airhead/

The application is parallelised by means of the Message-Passing interface (MPI). MPI is a standard for developing the process-based parallel applications to be run in distributed computing and cluster environments. More about MPI specification for the Java programming language can be found at http://www.hpjava.org/reports/mpiJava-spec/mpiJava-spec/mpiJava-spec.html.

Among the available implementation of the MPI library for the Java applications, we advise to use mpiJava (the source code as well as the installation package are available at http://sourceforge.net/projects/mpijava/).

The following steps are required for building and running the application:

1. Logging in to the cluster's front-end

ssh -l username frbw.dgrid.hlrs.de

2. Setting up the mpiJava configuration environment

module load compiler/gnu/4.4.3
module load mpi/openmpi/1.4.3-static-gnu-4.4.3
module load compiler/java/jdk1.6.0_20
module load mpi/mpijava

3. Building the application

1) Please update the mpi.jar library in both Airhead/lib and MPI-Search/lib from the one located at "$MPIJAVA_HOME/lib"

2) The application can be build with ant, same as for the sequential version

4. Running the application

qsub -I -X -l nodes=[number of nodes requested]:bwgrid,walltime=[estimated execution time, hh:mm:ss]

*Note*: a standard cluster environment only supports jobs with the execution time less as 24 hours.

The MPI version supposes the semantic space to be split up into chunks, the number of those should strictly correspond to the number of processes the application is running on. Please note that in case if the number of the files containing semantic subspaces doesn't correspond to the number of the processes, the application can not work correctly.

The subspaces should be stored in the format [original space]_[subspace number, 0..n].sspace

For partitioning a semantic space file, please use the utility program sspace-fragmenter.jar, which can be found in "misc" folder of the package contained in svn. The usage format is as follows:

java -jar sspace-fragmenter.jar -p [part of the original semantic space
to be included in each of subspaces; e.g. 0.5 means that 2 subspaces
will be created, each of them containing a half of the vectors from the
original semantic space] -f [format of the output subspaces, e.g.
sparce_binary] [old space file name] [new space file name]

See a usage example in fragment.sh script, located in the "misc" folder.

prunjava [number of processes] [all the java arguments and parameters as for the sequential version]

For more examples how to run a parallel version of the application, please also see start-up scripts from "use_cases" folder of the code checked out from svn.

LarkcProject/statisticalSemantics/parallelisation (last edited 2011-05-04 15:08:05 by ?DanicaDamljanovic)