Keyword Reasoner
Contents
This page gives an overview of the tool built in WP4 to support reasoning in the use case WP7b.
Description
The plugin takes as input a list of keywords (simply english words, see below for an example) and extends/improves this set by adding new keywords derived from ontologies. Keywords can be derived from either a single ontology which must be specified by the user or from the Linked Life Data initiative.
Keywords are classified into three different groups or categories (see [1] for details):
- Group A: contains terms/keywords referring to the organ the disease resides in (eg. for the disease "lung cancer" this group would contain the keyword "lung").
- Group B: contains terms referring to etiological factors such as "smoking" for "lung cancer" disease. This group is also used for general terms such as "genome-wide association study".
- Group C: contains mechanistic terms such as "DNA damage" or "genetic disease".
How to derive keywords in group A?
The current implementation of the plugin (and its associated Keyword Reasoner) can derive only keywords that belong to group A. The basic idea behind the algorithm for deriving such keywords is to search an ontology for concepts that refer to body parts or anatomical structures and whose names refer to the keyword's value (either an exact match or a partial match). Once a concept has been found its descendants are proposed as new keywords. This is what we call anatomical reasoning which uses the ontology's sub class hierarchy to derive new keywords in group A using an initial set of keywords in group A.
For example, suppose we have the following input keyword lung and that we want to derive new keywords using the MeSH ontology. We search the ontology for a concept whose name (see below to know how the algorithm determine a concept's name) is lung. In this case the MeSH ontology contains the following concept http://org.snu.bike/MeSH#lung. Because the group A of keywords contains terms that refer to body organs it make sense to retrieve this concept's descendants and return each of them as a new keyword. Notice that its descendant will be part of the same organ and thus a valid keyword for group A. In this case, one of such descendants is the concept http://org.snu.bike/MeSH#extravascular_lung_water which is returned as a new keyword (the actual keyword returned is extravascular lung water). Notice that it may happen that the ontology does not contain a concept with name lung in which case we try to find concepts whose names are somehow related to the input keyword, eg. their names are super-strings of the keyword's value.
More concretely, the algorithm to derive keywords in group A is described in the following pseudo-code:
inputKeywords = initial set of keywords (as defined by the user or domain expert);
allFlag = true;
threshold = the threshold defined by the user (eg. 3.5)
usesubstringmatch = true;
FOR EACH keyword K in inputKeywords DO{
concepts = search for ontology concepts whose name is an exact match with K's string value;
FOR EACH concept C in concepts DO{
IF (allFlag){
outputKeywords += all the descendants of concept C;
}ELSE{
outputKeywords += only direct sub concepts of concept C;
}
IF (usesubstringmatch){
otherConcepts = search for ontology concepts whose name is a super-string of K's string value
FOR EACH concept D in otherConcepts DO{
IF (distance(D,K) <= threshold){
IF (allFlag){
outputKeywords += all the descendants of concept D;
}ELSE{
outputKeywords += only direct sub concepts of concept D;
}
}
outputKeywords += D
}
}
}
}
return outputKeywords;In other words, for each keyword (K) in input list the algorithm searches in the ontology (or LLD if that option is selected) for all the concepts whose name is an exact match with the keyword K's value. The algorithm determines a concept's name from its skos:prefLabel property (or skos:altLabel, rdf:label in case the former is missing). If no such property is present the concept's name is derived from its URI, removing characters such as '_' and '-'.
Once these concepts have been retrieved the next step is to return descendants of each concept as new keywords. In addition to this, if the allFlag is set to true the algorithm will retrieve those concepts whose names are super-strings of the keyword K's value. Because these concepts may refer to a wide range of terms (eg. "cancer research facility") we need to make sure that those concepts that do not refer to anatomical structures or body parts are discarded. For this, the algorithm computes the distance between a keyword and a concept (K,D) and only those concepts D for which the distance is less or equal than a given threshold are used, the rest are discarded. The final step consists in retrieving the descendants of the concepts derived before and not discarded.
Currently, the Keyword Reasoner uses two measures of the distance between a keyword and a concept. The first one is the Normalized Google Distance and the second one os the Levenshtein distance. In the experiments run so far the later has proven to be quite useful as it discards many unwanted terms/concepts such as "cancer research facility" which is not a valid keyword in group A.
How to Use the Keyword Reasoner
In the following we show how to run the Keyword Reasoner to derive keywords for a given set of initial keywords. In the next section we show how to use the reasoner using LarKC by running the LarKC plugin ?KeywordReasonerPlugin.
Method 1: Compiling and Running as a Java Project in Eclipse
Download the source code from the SVN at source code
Edit/Run the main class nl.vu.few.krr.larkc.keywordreasoner.Test to test the ?KeywordReasoner.
You can run this class by executing the following command: java Test [keywordfile] [src=[0|1]] {o=ontologyfile} {t=value} {-s} {d=[1|2]} {-a}
The meaning of each parameter is as follows:
- keywordfile: the file that contains the initial set of keywords (REQUIRED)
src: indicates whether the reasoner should use the given ontology (as specified by the parameter o or LLD (REQUIRED)
src = 0 use LLD.
src = 1 use given ontology.
o: the URL of the file that contains the ontology to use. If src = 1 this parameter must be present.
- t: the threshold to discard concepts (eg. 2.4).
- d: the type of distance measure to use.
d = 1 means Normalized Google distance.
d = 2 means Levenshtein distance.
- -s: if present indicates that we want the algorithm to use sub-string matching to find concepts in the ontology.
- -a: instructs the reasoner to consider all the descendants of a given concept.
Method 2: Using the Keyword Manager Tool (GUI)
Another way to try the Keyword Reasoner and the easiest one is to execute the Keyword Manager Tool which offers a GUI that allows the user to specify all the required parameters and visualize the results in table.
To use it just download the tool from the SVN at Keyword Manager Tool and the run the JAR file keywordmanagertool.jar with the following command:
java -jar keywordmanagertool.jar
Method 3: Using the Web-based Interface
The Keyword Reasoner can also be accessed from a Web interface. The interface has a layout similar to the GUI of the Keyword Manager Tool and thus allows the user to specify all the required parameters.
To run the reasoner using the Web interface do the following:
1- Download the reasoner from the SVN at tool
2- Execute the following command: java -Xmx512m -jar keywordreasoning.jar IP where IP is the IP address of the machine where the tool is running. For example: java -Xmx512m -jar keywordreasoning.jar 127.0.0.1
3- Point your browser to the following address: http://IP:8070/ where IP is the same IP as before.
How to Use the KeywordReasonerPlugin Plugin in LarKC
Because the LarKC plugin is still not shipped with the latest version of the platform and is not in the project's SVN (yet) in order to run and use the ?KeywordReasonerPlugin plugin you need to get the code from the SVN (see below) and then integrate it into the LarKC platform (easy as copying a file).
Note: when the plugin becomes a stable component of the platform the only thing users will have to do in order to use it will be to run the platform (step 5) and invoke it from a workflow (step 6 forward). There will be no need to follow steps 3 and 4.
Follow these steps to try it out:
1- Download the plugin from the SVN at KeywordReasonerPlugin
2- Download the latest version of the platform (currently v2.5). See LarKC to know how to do this.
3- Run the following command: mvn assembly:assembly or, in Eclipse: Run As -> Maven assembly:assembly
4- Integrate/deploy the plugin into the platform by copying the file plugin.KeywordReasonerPlugin-0.0.1-SNAPSHOT-LarkcPluginAssembly into the platform's plugins folder, eg. platform/plugins.
5- Run the LarKC platform by executing the provided scripts or by running the main class eu.larkc.core.Larkc via Run As -> Java Application. If the platform was able to load the plugin you should be able to see a message like this:
14:17:16.573 INFO e.l.c.p.PluginRegistry: Registered the eu.larkc.plugin.keywordreasoner.KeywordReasonerPlugin
6- Point your browser to the following address: http://localhost:8182/ (provided the platform is running on localhost)
7- Submit a workflow definition that makes use of the plugin. This could be ANY workflow description. As an example we can try to submit the following LarKC workflow that uses a single plugin, namely the eu.larkc.plugin.keywordreasoner.KeywordReasonerPlugin
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix larkc: <http://larkc.eu/schema#> . # Define the plug-ins _:krPlugin a <urn:eu.larkc.plugin.reasoner.keywordreasoner.KeywordReasonerPlugin> . _:plugin2 a <urn:eu.larkc.plugin.SOStoVBtransformer> . # define the parameters _:krPlugin larkc:hasParameter _:pp1 . _:pp1 <urn:larkc.keywordreasoner.keywordsfile> "http://www.few.vu.nl/~gtagni/data/initialkeywords1.txt" . _:pp1 <urn:larkc.keywordreasoner.sourcetype> "1" . _:pp1 <urn:larkc.keywordreasoner.ontology> "http://wasp.cs.vu.nl/larkc/ontology/medical-ontology/meshonto.owl" . _:pp1 <urn:larkc.keywordreasoner.threshold> "6" . _:pp1 <urn:larkc.keywordreasoner.substringmatch> "true" . _:pp1 <urn:larkc.keywordreasoner.semanticdistance> "2" . _:pp1 <urn:larkc.keywordreasoner.directsubclasses> "false" . # connect the plugins _:krPlugin larkc:connectsTo _:plugin2 . # Define a path to set the input and output of the workflow _:path a larkc:Path . _:path larkc:hasInput _:krPlugin . _:path larkc:hasOutput _:plugin2 . # Connect an endpoint to the path <urn:eu.larkc.endpoint.sparql.ep1> a <urn:eu.larkc.endpoint.sparql> . <urn:eu.larkc.endpoint.sparql.ep1> larkc:links _:path .
Note: In order to test this plugin you need to use a LarKC plugin to transform the results returned by the plugin into a set of variable bindings. For this, you can use the LarKC plugin eu.larkc.plugin.SOStoVBtransformer. Remember that you will need to download and build the eu.larkc.plugin.SOStoVBtransformer plugin.
The keywords and ontology files can also be located in the local file system. For example, you can instruct the reasoner to use a local copy of the MeSH ontology by passing the following parameter:
_:pp1 <urn:larkc.keywordreasoner.ontology> "file://path/to/the/file/ontology.owl" .
Keywords are stored in the keywords file as a list of words, one per line. The following is an example of a list of keywords.
breast cancer breast cyst
To try deriving keywords from the LLD change the source type as follows:
_:pp1 <urn:larkc.keywordreasoner.sourcetype> "0".
8- After successful workflow creation, use the following URL to get an endpoint for a specific workflow: HTTP GET /workflow/WID/endpoint/?urn=urn Where the WID is the unique ID of your workflow (as returned by successful workflow creation) and urn is the endpoint's urn (e.g. <urn:myTestEndpoint>).For example, suppose we get the following WID = 135cc37f-b6e5-490c-8f5b-8271e8d20e17, then to get an endpoint you must do the following:
HTTP GET http://localhost:8182/workflow/135cc37f-b6e5-490c-8f5b-8271e8d20e17/endpoint?urn=urn:eu.larkc.endpoint.sparql.ep1
9- The URL of your endpoint will be returned (e.g. http://localhost:8183/testendpoint ). Start your workflow by sending a HTTP POST request to the endpoint of your workflow. Example:
curl -d "query=SELECT * WHERE {?s ?p ?o}" http://127.0.1.1:8183/testendpointor by submitting the SPARQL query from the Web interface, like this:
http://localhost:8183/sparql?query=SELECT+%2A+ WHERE+%7B%3Fs+%3Fp+%3Fo%7D
The ?KeywordReasonerPlugin returns a ?SetOfStatements representing the resulting list of keywords. Each keyword is represented by a blank node with 4 predicates specifying the keyword's name, the URI of the ontology concept from which the keyword derives, the confidence level and the semantic distance measure. The following is an example of the results returned by the plugin:
_:k1, urn:larkc.keywordreasoner.results.keyword.name, "nipple" . _:k1, urn:larkc.keywordreasoner.results.keyword.uri, "http://org.snu.bike/MeSH#nipple" . _:k1, urn:larkc.keywordreasoner.results.keyword.cl, "1"^^<http://www.w3.org/2001/XMLSchema#int> . _:k1, urn:larkc.keywordreasoner.results.keyword.sd, "-1.0"^^<http://www.w3.org/2001/XMLSchema#double> . _:k2, urn:larkc.keywordreasoner.results.keyword.name, "human mammary gland" . _:k2, urn:larkc.keywordreasoner.results.keyword.uri, "http://org.snu.bike/MeSH#human_mammary_gland" . _:k2, urn:larkc.keywordreasoner.results.keyword.cl, "1"^^<http://www.w3.org/2001/XMLSchema#int> . _:k2, urn:larkc.keywordreasoner.results.keyword.sd, "-1.0"^^<http://www.w3.org/2001/XMLSchema#double> . _:k3, urn:larkc.keywordreasoner.results.keyword.name, "breast cyst" . _:k3, urn:larkc.keywordreasoner.results.keyword.uri, "http://org.snu.bike/MeSH#breast_cyst" . _:k3, urn:larkc.keywordreasoner.results.keyword.cl, "2"^^<http://www.w3.org/2001/XMLSchema#int> . _:k3, urn:larkc.keywordreasoner.results.keyword.sd, "5.0"^^<http://www.w3.org/2001/XMLSchema#double> .
The ?KeywordReasonerPlugin2 is a keyword reasoner plugin which allows for parameter input from a SPARQL query, like this:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX gwas: <http://www.gate.ac.uk/gwas#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX krr: <http://www.cs.vu.nl/krr#>
SELECT * WHERE {
gwas:x rdf:type gwas:Experiment .
gwas:x gwas:hasName "experiment1" .
gwas:x gwas:hasKeywordGroup gwas:g1 .
gwas:g1 gwas:hasKeyword "lung" .
gwas:g1 gwas:hasKeyword "cancer" .
krr:x krr:sourceType "1" .
krr:x krr:ontologyFile "http://wasp.cs.vu.nl/larkc/ontology/medical-ontology/meshonto.owl" .
krr:x krr:threshold "6" .
krr:x krr:substringMatch "true" .
krr:x krr:semanticDistance "2" .
krr:x krr:directSubclasses "false" }Namely, a list of initial keywords can be provided by a SPARQL query, in which the keyword is stated as:
gwas:g1 gwas:hasKeyword "lung"
Before posting a SPARQL query into a workflow which contains the ?KeywordReasonerPlugin2, you have to specify the corresponding workflow like this:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix larkc: <http://larkc.eu/schema#> . # Define the plug-ins _:krPlugin a <urn:eu.larkc.plugin.reasoner.keywordreasoner.KeywordReasonerPlugin2> . _:plugin2 a <urn:eu.larkc.plugin.SOStoVBtransformer> . # connect the plugins _:krPlugin larkc:connectsTo _:plugin2 . # Define a path to set the input and output of the workflow _:path a larkc:Path . _:path larkc:hasInput _:krPlugin . _:path larkc:hasOutput _:plugin2 . # Connect an endpoint to the path <urn:eu.larkc.endpoint.sparql.ep1> a <urn:eu.larkc.endpoint.sparql> . <urn:eu.larkc.endpoint.sparql.ep1> larkc:links _:path .
Namely, you have to mention the keyword reasoner plugin2 (instead of the keyword reasoner plugin) as follows:
_:krPlugin a <urn:eu.larkc.plugin.reasoner.keywordreasoner.KeywordReasonerPlugin2> .
The following is an example of SPARQL query which uses the Normalized Google Distance (i.e.the semantic distance type is 1) with threshold 0.3 on the Linked Life Data (i.e. the source type is 0)
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX gwas: <http://www.gate.ac.uk/gwas#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX krr: <http://www.cs.vu.nl/krr#>
SELECT * WHERE {
gwas:x rdf:type gwas:Experiment .
gwas:x gwas:hasName "experiment1" .
gwas:x gwas:hasKeywordGroup gwas:g1 .
gwas:g1 gwas:hasKeyword "lung" .
gwas:g1 gwas:hasKeyword "cancer" .
krr:x krr:sourceType "1" .
krr:x krr:ontologyFile "http://wasp.cs.vu.nl/larkc/ontology/medical-ontology/meshonto.owl" .
krr:x krr:threshold "6" .
krr:x krr:substringMatch "true" .
krr:x krr:semanticDistance "2" .
krr:x krr:directSubclasses "false" }The ?KeywordReasonerPlugin3 is a keyword reasoner which can return a SPARQL query which contains not only the original SPARQL query, but also the extended set of keywords obtained by the keyword reasoner. Thus, this plugin can be integrated with the existing GWAS workflow, like this:
# Workflow Description @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix larkc: <http://larkc.eu/schema#> . # Define three plug-ins _:krPlugin a <urn:eu.larkc.plugin.reasoner.keywordreasoner.KeywordReasonerPlugin3> . _:plugin1 a <urn:eu.larkc.plugin.identifier.gwas.GWASIdentifier> . _:plugin2 a <urn:eu.larkc.plugin.SOStoVBtransformer> . # Connect the plug-ins _:krPlugin larkc:connectsTo _:plugin1 . _:plugin1 larkc:connectsTo _:plugin2 . # Define a path to set the input and output of the workflow _:path a larkc:Path . _:path larkc:hasInput _:krPlugin . _:path larkc:hasOutput _:plugin2 . # Connect an endpoint to the path _:ep a <urn:eu.larkc.endpoint.sparql> . _:ep larkc:links _:path .
The SPARQL query for the GWAS workflow above can be stated as:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX gwas: <http://www.gate.ac.uk/gwas#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX krr: <http://www.cs.vu.nl/krr#>
SELECT * WHERE {
gwas:x rdf:type gwas:Experiment .
gwas:x gwas:hasName "experiment1" .
gwas:x gwas:hasKeywordGroup gwas:g1 .
gwas:g1 gwas:hasKeyword "lung" .
gwas:g1 gwas:hasKeyword "cancer" .
gwas:x gwas:searchInRif "false" .
gwas:x gwas:useUMLS "false" .
gwas:x gwas:searchMode "1" .
gwas:x gwas:dateConstraint "20110412" .
gwas:x gwas:hasSnpId "rs1051730" .
gwas:x gwas:hasSnpId "rs8034191" .
gwas:x gwas:hasSnpId "rs3117582" .
gwas:x gwas:hasSnpId "rs4324798" .
gwas:x gwas:hasSnpId "rs401681" .
krr:x krr:sourceType "1" .
krr:x krr:ontologyFile "http://wasp.cs.vu.nl/larkc/ontology/medical-ontology/meshonto.owl" .
krr:x krr:threshold "6" .
krr:x krr:substringMatch "true" .
krr:x krr:semanticDistance "2" .
krr:x krr:directSubclasses "false" }Here is a result of the GWAS workflow which uses the ?KeywordReasonerPlugin3:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="s"/>
<variable name="p"/>
<variable name="o"/>
</head>
<results>
<result>
<binding name="s">
<uri>http://linkedlifedata.com/resource/hapmap/snp/rs4324798</uri>
</binding>
<binding name="p">
<uri>urn:snpHasScore</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">0.10000000149011612</literal>
</binding>
</result>
<result>
<binding name="s">
<uri>http://linkedlifedata.com/resource/hapmap/snp/rs401681</uri>
</binding>
<binding name="p">
<uri>urn:snpHasScore</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">0.4000000059604645</literal>
</binding>
</result>
<result>
<binding name="s">
<uri>http://linkedlifedata.com/resource/hapmap/snp/rs8034191</uri>
</binding>
<binding name="p">
<uri>urn:snpHasScore</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">0.10000000149011612</literal>
</binding>
</result>
<result>
<binding name="s">
<uri>http://linkedlifedata.com/resource/hapmap/snp/rs1051730</uri>
</binding>
<binding name="p">
<uri>urn:snpHasScore</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">0.4000000059604645</literal>
</binding>
</result>
<result>
<binding name="s">
<uri>http://linkedlifedata.com/resource/hapmap/snp/rs3117582</uri>
</binding>
<binding name="p">
<uri>urn:snpHasScore</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">0.4000000059604645</literal>
</binding>
</result>
</results>
</sparql>Here is an example of the output of the ?KeywordReasonerPlugin. Namely, the extended keywords of
- the initial keyword group (i.e., Group 1) are stated as keywords in Group 2, like these:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX gwas: <http://www.gate.ac.uk/gwas#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX krr: <http://www.cs.vu.nl/krr#>
SELECT * WHERE {
gwas:x gwas:hasKeywordGroup gwas:g2 .
gwas:g2 gwas:hasKeyword "bronchi" .
gwas:g2 gwas:hasKeyword "blood air barrier" .
gwas:g2 gwas:hasKeyword "pulmonary alveoli" .
gwas:g2 gwas:hasKeyword "extravascular lung water" .
gwas:x rdf:type gwas:Experiment .
gwas:x gwas:hasName "experiment1" .
gwas:x gwas:hasKeywordGroup gwas:g1 .
gwas:g1 gwas:hasKeyword "lung" .
gwas:g1 gwas:hasKeyword "cancer" .
gwas:x gwas:searchInRif "false" .
gwas:x gwas:useUMLS "false" .
gwas:x gwas:searchMode "1" .
gwas:x gwas:dateConstraint "20110412" .
gwas:x gwas:hasSnpId "rs1051730" .
gwas:x gwas:hasSnpId "rs8034191" .
gwas:x gwas:hasSnpId "rs3117582" .
gwas:x gwas:hasSnpId "rs4324798" .
gwas:x gwas:hasSnpId "rs401681" .
krr:x krr:sourceType "1" .
krr:x krr:ontologyFile "http://wasp.cs.vu.nl/larkc/ontology/medical-ontology/meshonto.owl" .
krr:x krr:threshold "6" .
krr:x krr:substringMatch "true" .
krr:x krr:semanticDistance "2" .
krr:x krr:directSubclasses "false" }
Source Code
The source code of both the ?KeywordReasoner and the ?KeywordReasonerPlugin are maintained by Gaston Tagni and Zhisheng Huang. Contact them if you have questions.
the source code of the ?KeywordReasoner including the GUI tools can be downloaded from code
the source code of the LarKC plugin that uses the ?KeywordReasoner can be downloaded from code
References
- D7b.3.1a "Version 1 Iteration Report"
