Data Repository
This page describes the LinkedLifeData data source analysis process. It is used as main page for discussions between WP7a and WP7b about selection and transformation of individual structured databases to RDF.
Selected list of data sources
See also WP7b data repository
The list summarizes the data sources and their current status. Some of the data sources may be very big with respect to the information variety. In dataset column we denote which types of data are focus of our interest. The status column is "evaluation" (its structure and data is revised), "under development" (the transformation process is in implementation phase), "completed" (ready to be deployed on ?LinkedLifeData), "revised" (problems are detected and need to be fixed).
Database |
Dataset |
Schema |
Description |
Status |
Uniprot |
Curated entries |
Original by the provider |
Protein sequences and annotations |
completed |
Entrez-Gene |
Complete |
Custom RDF schema |
Genes and annotation |
completed |
iProClass |
Complete |
Custom RDF schema |
Protein cross-references |
completed |
Gene Ontology |
Complete |
Schema by the provider |
Gene and gene product annotation thesaurus |
completed |
BioGRID |
Complete |
BioPAX 2.0 (custom generated) |
Protein interactions extracted from the literature |
completed |
National Cancer Institute - Pathway Interaction Database |
Complete |
BioPAX 2.0 (original by the provider) |
Human pathway interaction database |
completed |
The Cancer Cell Map |
Complete |
BioPAX 2.0 (original by the provider) |
Cancer pathways database |
completed |
Reactome |
Complete |
BioPAX 2.0 (original by the provider) |
Human pathways and interactions |
completed |
?BioCarta |
Complete |
BioPAX 2.0 (original by the provider) |
Pathway database |
completed |
KEGG |
Complete |
BioPAX 1.0 (original by the provider) |
Metabolic pathways |
completed |
?BioCyc |
Complete |
BioPAX 1.0 (original by the provider) |
Metabolic pathways |
completed |
NCBI Taxonomy |
Complete |
Custom RDF schema |
Organisms |
completed |
Medline |
Complete |
Custom schema |
Medline citations |
under development (to be verified only) |
UMLS |
SNODMED (no significant effort to include other also) |
Custom schema |
Meta-thesaurus |
completed |
TODO: Remove the differences between the schemata used by LifeSKIM application and these provided with ?LinkedLifeData.
TODO: Add extra column to group the knowledge sources for the different ?LinkedLifeData variants (e.g., PIKB)
Transformation to RDF
The section is aimed to the ?LinkedLifeData development contributor. It discuss how the different sources could be recreated. We should consider additional service to allow download of already generated data sources.
Database name |
Last process release |
Download link |
ORDI descriptor |
RDF schema |
Converter |
Short comment |
?UniProt |
14.0 |
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf |
no (distributed in RDF) |
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/core.owl |
|
The converter filters non-curated entries. Revise the changes in the blank node generation |
Entrez-Gene |
? |
no (database dump) |
we need also the db schema and script to import the data |
|||
?GeneOntology |
5.631 |
no (database dump) |
we need also the db schema and script to import the data |
|||
Taxonomy |
? |
no (database dump) |
we need also the db schema and script to import the data |
|||
UMLS |
2008AA |
no (database dump) |
we need also the db schema and script to import the data |
|||
?DrugBank |
? |
the converter reads the first two fields. the dump has formatting problems |
||||
BioGRID |
2.0.39 |
no (database dump) |
the database is aligned to BioPAX schema |
TODO: Complete the list with Medline and all BioPAX sources that requires transformations.
TODO: Revise the code of the ORDI descriptors/converters and upload the new ones to LarKC version control @ ?SourceForge.
