Information extraction

The page contains information about the information extraction process to be applied over ?LinkedLifeData resources. Our understanding is that the documents and their meta-data to be analyzed will be also part of knowledge base.

Corpora

To perform correct information extraction we need a good evaluation criteria to measure the precision/recall.

Corpus

Comment

Size

Converter to Gate

GENIA

2000 medline abstracts: terms “blood cells”, “human”, “transcription factors”

no

Biocreative I

15000 (including testing data)

yes

Biocreative II

15000 from ?BioCreative I + 5000 new

yes

Penn BioIE CYTP450

1100 Files, (370784) Base Annotations, (53875) Specific Annotations, (0) Relations, (1147) Chains

no

Penn BioIE Malignancy

1157 Files, (341767) Base Annotations, (31886) Specific Annotations, (12) Relations, (1251) Chains

no

Possible biomedical entities to be extracted

Entity

Type

GENIA

Biocreative I

Biocreative II

Penn BioIE CYTP450

Penn BioIE Malignancy

DNA

Named Entity

+ (no mapping to database entries)

RNA

Named Entity

+ (no mapping to database entries)

Cell Line

Named Entity

+ (no mapping to database entries)

Cell Culture

Named Entity

+ (no mapping to database entries)

Gene

Named Entity

+ (no mapping to database entries)

+ (document level mappings to Entrez-Gene; fly, mouse, yeast)

+ (document level mapping to Entrez-Gene; human)

+ (categories: gene-protein?, gene-rna?, gene-generic?)

CYP450?

Named Entity

?

Substance

Named Entity

any protein, chemical etc.

Quantative Measurements

Named Entity

+ (units, value, quantity)

+ (units)

Malignancy

Named Entity

+

Biological Process (hierarchy)

Relation

+

Artificial Process

Relation

+

Corelation (coocurence?)

Relation

+

Gene Variantions

Relation

+ (type, location, state-original, event)

Protein Protein Interaction

Relation

+

POS

Lexical

+

+

+

Token

Lexical

+

+

+

Sentence

Lexical

+

+

+

+

+

Treebank

Lexical

+

+

+

LarkcProject/WP7a/IE (last edited 2009-04-09 07:18:30 by ?VassilMomtchev)