General plan:

The "Self-Computing Semantic Web", V.0

The "billion triple challenge" <http://challenge.semanticweb.org/> is a
new competition for the Semantic Web community. Participants are asked
to do "something interesting" with their dataset. The challenge lies in
the fact their dataset is very large. It contains a billion "RDF triples"
(each triple is a fact consisting of a relation between two things,
with both things and the relation denoted by URIs).

Current RDF stores already find it difficult to just *store* that
dataset, let alone *do* things with it. One of the interesting things
to do with the billion facts is to infer even more facts: the new facts that
follow from the given ones, but that are not explicitly listed.

The plan is to build a simple mechanism for distributed RDFS inferencing
(ie use a distributed setting to do reasoning that cannot be done in a
centralised setting). The idea is roughly to spread the facts over many
nodes, let each node do local inferencing, upload to results somewhere
centrally, and then exchange some part of the facts between nodes so as
to trigger more inferences, and keep doing this until the process
stabilises (when all facts have been derived).

Algorithm sketch

  1. each computing node (actually: process) consists of three continuous threads of operation:
    1. input queue manager: ensures data is waiting in input queue for processing, polls file server or peers if below threshold
    2. data processor: takes elements from input queue, loads into OWLIM, computes closure, and empties results into output queue
    3. output queue manager: sends data when peers poll, periodically moves data to final-result queue (on file server)
  2. input data (ntriple files) is stored on shared file server
    1. if run on different clusters (DAS-3 contains five clusters), split data and distribute over file-servers (head nodes) at each cluster, to decrease bottleneck at file-server
    2. possibly: copy data to local disk before starting computation
  3. input data split into fixed-sized chunks (say 5K triples)
  4. all computing nodes (actually: processes) start by selecting n (say 5) random chunks from file server
  5. each node starts closure computation (OWLIM); we time the computation (cost), and count how many inferred triples (benefit) are added
    1. the benefit is stored as quality score of all triples in the loaded chunk (if some chunk leads to many inferences, its triples are assumed to be good)
  6. after inferencing, the node computes utility = benefit/cost
    1. if utility > threshold, load another chunk, else empty reasoner contents into output queue

    2. the output queue is prioritised by triple's quality score: other peers will get higher-quality triples first
    3. low-quality triples go to final-result queue (aggregation directory on file-server)

Notes

Outline of the basis steps in the scenario

(interleaved with discussion)

  1. a central store distributes the billion RDF triples among distributed nodes.

    • Stefan: computing all inferences over a large dataset is a difficult problem already, even if done locally. Are you planning to use existing native repositories for the local reasoning?

      • Frank: Yes. This will be stretching the individual DAS nodes to their limits (only 2Gb RAM each). We'll have to see if this really works...
      • Eyal: If not, we just split the data further until it's doable (more, smaller, nodes). Let's start trying to do this with Sesame + OWLIM, otherwise we roll our own (but I'd rather not). I hope Sesame can do it (RDFS), but I fear a bit about OWL (OWLIM) and the possibility to get the difference between explicit/inferred statements out of Sesame. I can try this soon (without the DAS stuff).
  2. each node locally computes all inferences (or: as many as it can)
  3. each nodes sends the conclusions back to the central store, ginving the central store a subset of all possible conclusions
  4. each node then sends some of its facts to other nodes
  5. each node receives some new facts from other nodes (most likely after first removing some of its own derived facts in order to reclaim space)
  6. with these newly received facts, each node can repeat from step 2.
    • Stefan: Do you think that this simple deliberate sending and receiving of facts (randomised?) will be sufficient to ensure saturation and almost completeness? Do we run the risk of cyclic propagation? Or do we need a control structure to avoid that always the same stuff is sent?

      • Frank: The plan was indeed to start naively first. If the selection is randomised, then the chance on cyclic propagation is near 0. Work by Spyros has been about smarter routing (to get the right facts to the right ontologies), but the plan here would be to see if dumb routing works as well (sometimes dumber is better, and certainly as a first step?)

Some further observations

(again interleaved with discussion)

Available hardware platforms for running this on over the summer 2008

BillionTripleChallenge2008 (last edited 2008-09-05 07:59:58 by EyalOren)