When asked about the possibility of getting streaming output from SPARQL endpoints, and whether the current SPARQL protocol would have to be adjusted for that, Andy Seaborne (HP Labs, Bristol) said the following (reproduced here with his permission).


Client-side control [of output, FvH] (cursors, paging, various other names) is really a bundle of different requirements, from client API issues to network flow control. Different aspects can be addressed in different ways and many of the mechanisms already exist, some of which do not require standardization because they are local issues.

The SPARQL protocol can be used in a streaming fashion for results - ARQ already does this. The assumption [is often] that because it’s a single XML document, it must be processed as a lump. This is not so - ARQ uses a SAX parser when receiving results generates the result set row-by-row as they arrive and the first results can be returned to the client before the last results are even calculated by the query engine. On the server side, some of our stores (TDB especially) is streaming and so, end-to-end, the entire query answering is streamed. If your provider of SPARQL query processing does not do this, then get your provider to fix their system!

When a large set of results is involved, the flow-control of the HTTP/TCP connection can be used to regulate data rates. This is what HTTP already does - simple leave it to TCP/IP and it works on the web very well.

Stream parsing of the results is a common technique in XML processing, using SAX, and that works here as well because the result set format is designed that way - complete rows get sent together and when a row is completed no more information can be added later in the result set.

The client does not need to be involved, or changed, to specifically say "next 100 results" although it may be a more useful API for the application. A local API could present results like that by locally buffering but this is not an issue for the protocol itself which can already support the feature through streaming.

Indeed, the explicit paging style is nowadays considered a bad thing. It severely impacts the server design because the server is now required to keep state to track the cursors across requests. Servers are usually the bottlenecks in systems and processing should be placed client-side where possible (for small devices, have a proxy device that is capable). With explicit paging, a misbehaving client can create a denial of service effect to other clients by not consuming data fast enough, causing the server to need more resources to support that client, reducing resources for other clients. Servers crashing and restarting can also be an issue across multiple transport-level requests. HTTP does not require an explicit session-layer but paged results may do.

In DB servers resources=RAM. There are concurrency and update issues as well. As evidence, I note that MySQL and PostgreSQL both do not providing paging by default and when paging only support one request per client at a time. There are other restrictions as well. It is avoided in high performance websites.

There is an issues with large graphs (non-SELECT queries). Given the discussion above on streaming processing, I hope you can see that streaming can already be achieved and client-side paging is a matter of client API for a result set. But with large graphs the on-the-wire serialization can get in the way.

Firstly, at a high level, it is unlikely to be good enough to stream triples and not some higher level unit like all triples about a given subject. Secondly, RDF/XML is less than ideal because the parser is so hard. Jena's parser does stream triples (it's SAX based as well) and it's only some rules about bNode labels that stop it scaling arbitrarily. But because the parser is so hard to write at all, adding streaming to the list of requirements makes it a bigger investment than many toolkits can afford; they may be wanting to use a DOM-based (non-streaming) parser for simplicity.

N-Triples is useable for triple streaming. We have, internally, been discussion streaming RDF tuples (triples being 3-tuples) and have design that is as easy to work with as N-triples but provides some compression at the term level (prefixed names like Turtle, use of same-as-before markers, blank nodes don't have the same label rules). It can be further compressed by a gzip stream; it's simple to implement.


Response from Barry:

This is very interesting. The fundamental point is that with XML we have the option of two parsing models, DOM or SAX.

As Andy says, if one considers things from the perspective of the TCP endpoint, i.e. the socket, then one writes a SPARQL query in and reads the results out. This is (literally) a stream of bytes that can be fed straight in to a SAX parser and the events (e.g. next variable-binding) can be fired up to the application layer, albeit asynchronously.

The important thing to learn from this, is that we should not think of making a request to a sparql endpoint as a synchronous function call, i.e. pass the query wait for (all of) the response.

I'm not convinced that this makes things simple for the server-side, where one would typically have a pool of worker threads operating asynchronously from the threads servicing the TCP endpoints. (The server programmer has to provide a mechanism that connects a socket close event on one thread to the abort of a query answering task on another thread). But then again, this is not harder than most other server tasks.

Also, terminating a TCP session to indicate that the client has had enough results seems a little harsh. In the situation where many sequential SPARQL query requests are made, the client then has the additional task of re-establishing the TCP session each time (along with the risk of using up all the sockets, because they remain in the WAIT state after closing).

The only issue for me is: Does our data layer API allow for the streaming processing of triples/query results?

Vassil can comment on this better than me, but I think the answer is 'yes'. Looking at http://wiki.larkc.eu/LarkcProject/WP5/DataLayerAPI it seems query results are (or can be) streamed. Answering a query would be something like this:

SPARQLService service =
     DataFactory.createSPARQLEnabledGraph(...).getSPARQLEndpoint();

 VariableBinding result = service.executeSelect(query);

 CloseableIterator<Binding> bindings = result.iterator();

 // Iterate through 'bindings' until exhausted or call close().

Using the 'closable' iterator allows a worker thread inside the API to parse results and put them in to a queue, that is accessed on the other side via the iterator. This interface can neatly hide 'streaming' behaviour, allowing the client side programmer to operate in a single-threaded mindset. However, without looking, I don't know if this is actually how the implementation behaves. Vassil?


from Vassil:

Yes, the closeableiterators allows you to implement streaming processing of data and to clean up properly all internal resources (preserve memory leaks). On the server side a HTTP chunking is already implemented, you check test it on http://www.linkedlifedata.com/ server by sending a query via the SPARQL endpoint:

Response:

HTTP/1.1·200·OK(CR)(LF)
Date:·Thu,·05·Mar·2009·08:24:10·GMT(CR)(LF)
Server:·Apache-Coyote/1.1(CR)(LF)
Vary:·Accept(CR)(LF)
Content-Disposition:·attachment;·filename=query-result.trig(CR)(LF)
Content-Type:·application/x-trig;charset=UTF-8(CR)(LF)
Content-Language:·en-US(CR)(LF)
Connection:·close(CR)(LF)
Transfer-Encoding:·chunked(CR)(LF) <--- The transfer is chunked.
(CR)(LF)

I think the original issue mentioned in the WP5 call was related not to SPARQL streaming, but SPARQL anytime behavior.

AndySeaborneOnStreamOutputFromSparQL (last edited 2009-03-06 09:49:34 by FrankVanHarmelen)