Pointers to work on InformationExtraction here.
For an to IE introduction see this Encyclopaedia of Language and Linguistics article, or the GATE IE pages.
- Aiming to extract large numbers of facts (millions) from the Web:
"The extraction starts from as few as 10 seed facts, requires no additional input knowledge or annotated text, and emphasizes scale and coverage by avoiding the use of syntactic parsers, named entity recognizers, gazetteers, and similar text processing tools and resources".
The first paper extracts relations of a given kind (e.g. birthyear of people), million-fact-aaai06.pdf
the second paper acquires additional attributes to be harvested next million-fact-www07.pdf
(and judging by the authors, Google is interested in relational data (ie beyond just words) as well!)
