data: BTC datasets, ?small sample
General plan:
The "Self-Computing Semantic Web", V.0 The "billion triple challenge" <http://challenge.semanticweb.org/> is a new competition for the Semantic Web community. Participants are asked to do "something interesting" with their dataset. The challenge lies in the fact their dataset is very large. It contains a billion "RDF triples" (each triple is a fact consisting of a relation between two things, with both things and the relation denoted by URIs). Current RDF stores already find it difficult to just *store* that dataset, let alone *do* things with it. One of the interesting things to do with the billion facts is to infer even more facts: the new facts that follow from the given ones, but that are not explicitly listed. The plan is to build a simple mechanism for distributed RDFS inferencing (ie use a distributed setting to do reasoning that cannot be done in a centralised setting). The idea is roughly to spread the facts over many nodes, let each node do local inferencing, upload to results somewhere centrally, and then exchange some part of the facts between nodes so as to trigger more inferences, and keep doing this until the process stabilises (when all facts have been derived).
Algorithm sketch
- each computing node (actually: process) consists of three continuous threads of operation:
- input queue manager: ensures data is waiting in input queue for processing, polls file server or peers if below threshold
- data processor: takes elements from input queue, loads into OWLIM, computes closure, and empties results into output queue
- output queue manager: sends data when peers poll, periodically moves data to final-result queue (on file server)
- input data (ntriple files) is stored on shared file server
- if run on different clusters (DAS-3 contains five clusters), split data and distribute over file-servers (head nodes) at each cluster, to decrease bottleneck at file-server
- possibly: copy data to local disk before starting computation
- input data split into fixed-sized chunks (say 5K triples)
- all computing nodes (actually: processes) start by selecting n (say 5) random chunks from file server
- each node starts closure computation (OWLIM); we time the computation (cost), and count how many inferred triples (benefit) are added
- the benefit is stored as quality score of all triples in the loaded chunk (if some chunk leads to many inferences, its triples are assumed to be good)
- after inferencing, the node computes utility = benefit/cost
if utility > threshold, load another chunk, else empty reasoner contents into output queue
- the output queue is prioritised by triple's quality score: other peers will get higher-quality triples first
- low-quality triples go to final-result queue (aggregation directory on file-server)
Notes
- plot loading/inferencing time of OWLIM as function of data size (is it exponential?)
- split 'data processor' thread into 2-4 threads (data parallelism), since we have 2-4 cores on each machine
- when input queue empty (no food in the system): empty reasoner contents into output queue (put food into system)
- write output queue to disk periodically (all data does not fit in combined memory)
- when output queue full (producing faster than we're consumed): swap to disk and/or empty low-priority triples into result bin (file-server)
Outline of the basis steps in the scenario
(interleaved with discussion)
a central store distributes the billion RDF triples among distributed nodes.
Stefan: computing all inferences over a large dataset is a difficult problem already, even if done locally. Are you planning to use existing native repositories for the local reasoning?
- Frank: Yes. This will be stretching the individual DAS nodes to their limits (only 2Gb RAM each). We'll have to see if this really works...
- Eyal: If not, we just split the data further until it's doable (more, smaller, nodes). Let's start trying to do this with Sesame + OWLIM, otherwise we roll our own (but I'd rather not). I hope Sesame can do it (RDFS), but I fear a bit about OWL (OWLIM) and the possibility to get the difference between explicit/inferred statements out of Sesame. I can try this soon (without the DAS stuff).
- each node locally computes all inferences (or: as many as it can)
- each nodes sends the conclusions back to the central store, ginving the central store a subset of all possible conclusions
- each node then sends some of its facts to other nodes
- each node receives some new facts from other nodes (most likely after first removing some of its own derived facts in order to reclaim space)
- with these newly received facts, each node can repeat from step 2.
Stefan: Do you think that this simple deliberate sending and receiving of facts (randomised?) will be sufficient to ensure saturation and almost completeness? Do we run the risk of cyclic propagation? Or do we need a control structure to avoid that always the same stuff is sent?
- Frank: The plan was indeed to start naively first. If the selection is randomised, then the chance on cyclic propagation is near 0. Work by Spyros has been about smarter routing (to get the right facts to the right ontologies), but the plan here would be to see if dumb routing works as well (sometimes dumber is better, and certainly as a first step?)
Some further observations
(again interleaved with discussion)
- After sufficiently long time, this process will result in all possible conclusions arriving at the central store (under some conditions that are not too hard to satisfy: length of premisses is shorter then size of store, no initial facts are ever removed)
- This gives a nice anytime behaviour (the set of facts at the central store grows monotonically (and probably with diminishing growth)
- Communication between nodes is limited (no communication needed during each inference cycle), and can be asynchronous (no handshaking or replies are needed)
the central store (Eyal calls it "the Mothership" <http://msp95.photobucket.com/albums/l137/GrahamNificent/MotherShip.jpg>
has to be able to store a very large number of facts: the billion given facts plus their deductive closure. - Empirically, we have observed a 2-3 factor blow-up (ie 3 billion triples in total). These can probably be stored in a classical database, and queries as single triples, but not in join-queries. We might have to ask Martin Kersen at CWI for help.
Stefan: why don't we do the "full monty", and implement the mothership using our evolutionary method?
Frank: I would be worried this would be combining two submissions in one: the first to do inference by parallel distribution, the second to do retrieval by evolutionary methods. I think it would be better (both for PR and also intellectually), to keep them seperate?'
- Eyal: Plus, I'd be surprised if the evolutionary method would help here.
- facts from different sources can only be combined if we manage to do "instance unification". This is in itself a non-trivial task, but can fortunately also be done locally at each node.
Stefan: this surprises me. i thought that the most interesting unification is cross-collection? Isn't it the DBLP authors we want to combine with DBpedia? To me this is very much a global task, which also needs to be done in a network kind of way: I send you my instances, you check whether you have anything similar...
Frank: probably just a misunderstanding through my bad phrasing: of course the instance unification must be done between instances from different parts of the dataset. I tried to say that these different parts of the dataset will "meet" each other on a local peer (after sufficient mixing), and the local peer needs to do the instance unification between the instances that are found locally. Agree? '
- Eyal: yes
- We'll have to decide how to deal with representing multiple instances (either by duplication or by normalisation)
- notice that we will be treating the DAS as a set of nodes that do mostly independed computation (ie the less we are using of the high-speed bandwidth, the better). For more realistic experiments, we should be using physically distributed WAN nodes, but a cluster is a (much) easier way to run the experiment.
Stefan: This way we use the parallelism only in running individual repositories on different machines (unless I am mistaken).
- Frank: agree (although I don't understand why you call this "only")
Stefan: Would it not be much more sexy if we would run the evolutionary search of the mothership on the DAS, and use an IRIS architecture for distributing the repositories?
Frank: (I guess you mean "IBIS architecture"). I'm not sure if this would be more sexy, but it would certainly be a different experiment. In fact, I think the experiment would be so different that it had better be an entirely separate experiment. We would have to decide which one to do (unless we can do both, with Eyal doing one and Christophe doing the other...?).
- Again, I wouldn't mix these. You can do the evolutionary algorithm distributed over IBIS, but it's separete from this story. I wouldn't mix them, and I'd focus on the DAS now, without evolution. For Christophe, it wouldn't be trivial to move to IBIS: the evolutionary framework contains a distributed version, but not using IBIS, he might contribute patches to enable that, but it would be a longer story.
Available hardware platforms for running this on over the summer 2008
Stuttgart cluster: max. 430 machines, 8 cores each, 2Gb memory per core, single shared disk access over Infinet
DAS-3 @ VUA: max 83 nodes, 2 cores each, 4Gb memory per node, 250Gb local disk per node, shared disk access over 10Gb ethernet
