Feedbacks on SPARQL 1.1 Support For Aggregates

Authors: Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle, Michael Grossniklaus

Date: 18.2.2010

Introduction

In this wiki page, we [1] would like to comment on the support for aggregates in SPARQL 1.1. In the last two years we have been working on an extension to SPARQL for continuous querying over streams of RDF (namely C-SPARQL [2]). Central to stream processing is support for aggregates. For this reason we have defined and implemented [3] our own support for aggregates in SPARQL which is orthogonal to the other stream processing features of C-SPARQL. We believe such extension can be of general interest for SPARQL 1.1 WG.

In the rest of the mail we first introduce support for aggregates in C-SPARQL, then we compare C-SPARQL and SPARQL 1.1 support for aggregates. We show that:

  1. C-SPARQL syntax for aggregates appears more compact and handy than the SPARQL 1.1 one,
  2. all SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL, and
  3. there are queries that can be expressed in C-SPARQL but it is unclear whether they can be expressed in SPARQL 1.1.

Support For Aggregates In C-SPARQL

Aggregation clauses in C-SPARQL are added at the end of the query, and have the following syntax:

 AggregateClause --> ( "AGGREGATE {(" var "," Function "," Group ")" [Filter] "}" )*
 Function --> "COUNT" | "SUM (" var ")" | "AVG (" var ")" | "MIN (" var ")" | "MAX (" var ")"
 Group --> var | "{" var ( "," var )* "}"

Every aggregation clause has the following three parts:

Every clause may also have an optional fourth part, a FILTER clause.

Example Of Simple Support For Aggregates In C-SPARQL And SPARQL 1.1

Data:

@prefix : <http://books.example/> .

:auth1  :name "Alice Foo", :writesBook :book1; :book2 .
:auth2  :name "Bob Bar", :writesBook :book2 .

The following query counts the number of books written by an author and returns the name and the number of books.

SELECT ?name ?book ?numberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :writesBook ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }

The semantics of a query containing aggregates consists in adding new variable bindings computed by the WHERE clause to the existing regular variable bindings. For each of the new variables introduced by the AGGREGATE clauses, one new variable binding is added. The query result constructed in this way may be further filtered by a standard FILTER clause, which may refer to all the variables introduced in the WHERE and AGGREGATE clauses.

Our C-SPARQL extension is based on the conviction that in the context of RDF, knowledge should be extended rather than shrunk. Therefore, we propose to generate additional variable bindings and use them to annotate any existing variable binding that contributed to the aggregate value.

Results:

?name       | ?numberOfBooks
------------------------------
"Alice Foo" |            "2"
"Bob Bar"   |            "1"

This is in contrast to the conventional SQL grouping semantics that replaces all aggregated tuples with a single tuple representing the aggregate value. In this respect, we believe that our approach to aggregation is more aligned with the baseline of the SPARQL semantics.

Judging fro the example in Section 10 of the SPARQL 1.1 draft, the query above can be expressed in SPARQL 1.1 in the following way:

SELECT ?name ?numberOfBooks
WHERE {
     ?auth :name ?name .     
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :writesBook ?book .
             }
             GROUP BY ?auth
     }
}

In the C-SPARQL language all the variables used in the aggregation function or in the grouping set of AGGREGATE clauses must appear also in the SELECT clause, since aggregation happens after standard SPARQL query evaluation. In SPARQL 1.1 the constraint is not specified.

Wrapping up, queries of this kind can be expressed both in SPARQL 1.1 and C-SPARQL, but C-SPARQL syntax appears more compact and handy.

Aggregates Supported In SPARQL 1.1 Are Also Supported In C-SPARQL

Given the current SPARQL 1.1 support for aggregates, it appears that all SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL.

For instance, the following query is the example of SPARQL 1.1 support for aggregates appearing in Section 9 of the current draft.

PREFIX  <http://books.example/>
SELECT (SUM(?lprice) AS ?totalPrice)
WHERE {
  ?org :affiliates ?auth .
  ?auth :writesBook ?book .
  ?book :price ?lprice .
}
GROUP BY ?org
HAVING (SUM(?lprice) > 10)

Such a query in C-SPARQL will be written as follows.

PREFIX  <http://books.example/>
SELECT ?totalPrice
WHERE {
  ?org :affiliates ?auth .
  ?auth :writesBook ?book .
  ?book :price ?lprice .
}
AGGREGATE { (?totalPrice, SUM(?lprice), {?org}) FILTER ( ?totalPrice > 10) }

Queries That Can Be Expressed In C-SPARQL But It Is Unclear Whether They Can Be Expressed In SPARQL 1.1

Given the current SPARQL 1.1 support for aggregates, it is unclear whether the following C-SPARQL queries can be expressed in SPARQL 1.1

Query: the average number of books written by authors that wrote at least 5 books.

SELECT ?name ?book ?numberOfBooks ?averageNumberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
AGGREGATE { (?averageNumberOfBooks, AVG, {?numberOfBooks} ) }

A possible way to express it in SPARQL 1.1 is illustrated hereafter, but no examples in the current draft show that this is possible.

SELECT ?name ?surname ?book ?numberOfBooks (AVG(?numberOfBooks) AS ?averageNumberOfBooks)
WHERE {
     ?auth :hasSurname ?surname .
     ?auth :hasName ?name .
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :wrote ?book .
    }
             GROUP BY ?auth
            HAVING (?numberOfBooks > 5)
}

More complex sequences of aggregation are supported in C-SPARQL, such as

SELECT ?name ?surname ?book ?numberOfBooks ?averageNumberOfBooks, ?auth, ?organization
WHERE {
    ?auth :name ?name .
    ?auth :surname ?surname .
    ?auth :wrote ?book .
    ?auth :affiliated ?organization .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
AGGREGATE { (?affiliationBooks, SUM(?numberOfBooks), {?organization} ) FILTER (?affiliationBooks > 50)}

A possible way to express it in SPARQL 1.1 is illustrated hereafter, but as above no examples in the current draft show that this is possible.

SELECT ?name ?surname ?book ?numberOfBooks
WHERE {
   ?auth :hasSurname ?surname .
   ?auth :hasName ?name .
   {
    SELECT ?affiliation (SUM(?numberOfBooks) as ?affiliationBooks)
    WHERE {
     ?auth :affiliated ?organization .
     {
      SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
      WHERE {
       ?auth :wrote ?book .
      }
      GROUP BY ?auth
     HAVING (?numberOfBooks > 5)
    }
    GROUP BY ?organization
   HAVING (?affiliationBooks > 50)
   }

In C-SPARQL, evaluation of multiple aggregation with filtering clauses is possible. For instance, one can ask for the research topics for which the Italian authors are more than the Swiss ones.

SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
    ?book :topic ?topic .
    ?auth :hasNationality ?nat .
}
AGGREGATE { FILTER(?nat = 'IT') (?numberOfItalianAuthors, COUNT, {?topic} ) }
AGGREGATE { FILTER(?nat = 'CH') (?numberOfSwissAuthors, COUNT, {?topic} ) FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)}

A possible way to express it in SPARQL 1.1 is illustrated hereafter, but as above no examples in the current draft show that this is possible. Moreover, it requires the SPARQL 1.1 engine to decide the order of execution, whereas in C-SPARQL the order is given explicitly.

SPARQL 1.1 version that uses the FILTER clause (see line 24):

1. SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
2. WHERE {
4.     {
5.             SELECT ?topic (COUNT(?book) AS ?numberOfSwissAuthors)
6.             WHERE {
7.                     ?auth :wrote ?book .
8.                     ?book :topic ?topic .
9.                     ?auth :hasNationality ?nat .
10.                     FILTER(?nat = 'CH') .
11.             }
12.             GROUP BY ?topic
13.     }
14.     {
15.             SELECT ?topic (COUNT(?book) AS ?numberOfItalianAuthors)
16.             WHERE {
17.                     ?auth :wrote ?book .
18.                     ?book :topic ?topic .
19.                     ?auth :hasNationality ?nat .
20.                     FILTER(?nat = 'IT') .
21.             }
22.             GROUP BY ?topic
23.     }
24.     FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)    
25. }

Therefore, we believe that there are queries that can be expressed in C-SPARQL but not in SPARQL 1.1.

Computing Multiple Independent Aggregates At The Same Time

As we explained in the introduction, C-SPARQL was explicitly designed for processing RDF streams. The transient nature of streams poses the requirement to compute multiple (possibly independent) aggregates at the same time in the same query, because assuring that two independent queries process exactly the same data is very difficult. Therefore, multiple independent aggregations are also allowed within the same C-SPARQL query, with different grouping criteria and different partitions over the same set of bindings, thus pushing the aggregation capabilities beyond those of SQL.

The following query counts the number of books written by an author, counts the number of authors per book and returns the name, the book, the number of books and the number of authors.

SELECT ?name ?book ?numberOfBooks ?numberOfAuthors
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }
AGGREGATE { (?numberOfAuthors, COUNT, {?book} ) }

Results:

?name       | ?book | ?numberOfBooks | ?numberOfAuthors
---------------------------------------------------------
"Alice Foo" |    b1 |            "2" |              "1"
"Alice Foo" |    b2 |            "2" |              "2"
"Bob   Bar" |    b2 |            "1" |              "2"

Judging from SPARQL 1.1 draft, the query above can be expressed in SPARQL 1.1 in the following way:

SELECT ?name ?surname ?book ?numberOfBooks ?numberOfAuthors
WHERE {
     ?auth :hasSurname ?surname .
     ?auth :hasName ?name .
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :wrote ?book .
             }
             GROUP BY ?auth
     }
     {
             SELECT  ?auth (COUNT(?auth) AS ?numberOfAuthors)
             WHERE {
                    ?auth :wrote ?book .
             }
             GROUP BY ?book
     }
}

Conclusion

Clearly, the C-SPARQL notation is less cumbersome and more concise than the SPARQL 1.1 one. It is therefore, easier to express and understand aggregates in C-SPARQL than in the current draft of SPARQL 1.1, which we believe is an important factor in language design and adoption.

References

[1] http://dbgroup.elet.polimi.it/

[2] http://wiki.larkc.eu/c-sparql

[3] http://www.larkc.eu/wp-content/uploads/2008/01/larkc_d33-description-of-strategy-and-design-for-data-stream-management-approaches_final.pdf

c-sparql/sparql11-feedback (last edited 2010-02-18 12:05:09 by ?EmanueledellaValle)