SPARQL Grouping and Aggregation with no matches

Andy Seaborne

March 2020

Grouping and Aggregation is the combination of GROUP BY and one of the aggregation functions such as COUNT or MAX. When there are no matches to the WHERE clause, the spec and one of the original workign groups tests don’t agree. In the case of aggregation and no GROUP BY, the spec needs a fix to get the expected answer.

Description

Informally, what happens in grouping and aggregation is that the matches from the WHERE clause are paritioned by calculating the group key from the GROUP BY expression for each row , with each row going into a collection of rows with the same group key. We end up with a dictionary mapping group key to a collection of those rows with the same group key. Each grouping of rows has at least one row in it because to create dictionary entry a row with that group key must have been encountered.

When there is aggregation, the aggregation function (COUNT, MAX, etc) is called on each dictionary entry with the collection of rows as argument.

If there are no rows as a result of the WHERE clause, there are no pairs of group key row collection.

There is a special case of aggregation when there is no GROUP BY but there is an aggregate function. All the rows are collected together with a special group key of 1. The dictionary has one entry.

There is a problem in the specfication here - it puts in a special GROUP BY expression but when there are no rows from the WHERE clause, it does not get called. A possible solution is given below.

From this we can see that:

  • When group+aggregate, there are the same number of rows as the grouping alone.
  • If there is aggrgation and no group, there is one row in result.

Grouping and Aggregation, no matching rows

GROUP BY, and GROUP BY with aggregation have the same number of rows.

There is a disprepency between the SPARQL tests and the specification. The test aggregates-agg-empty-group runs on an empty graph.

PREFIX ex: <http://example.com/>
SELECT ?x (MAX(?value) AS ?max)
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

Beause it is an empty graph, the WHERE clause has zero rows, and the group step is zero rows.

But the results are given as:

<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="x"/>
    <variable name="max"/>
  </head>
  <results>
    <result> </result>
  </results>
</sparql>

which is one row, with no values, not zero rows.

-----------
| x | max |
===========
|   |     |
-----------

(the original [test results]((https://www.w3.org/2009/sparql/docs/tests/data-sparql11/aggregates/agg-empty-group.srx) have no <variable name="x"/> - that’s corrected in test manifest with these results.)

It should be:

-----------
| x | max |
===========
-----------

that is, zero rows, because

SELECT ?x
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

has zero rows, the grouping is an empty dictionary (no group keys), and (MAX(?value) AS ?max) is not called at all.

Aggregation, no grouping, no matching rows

Compare with the case where there is no GROUP BY

SELECT  (MAX(?value) AS ?max)
WHERE {
    ?x ex:p ?value
} 

-------
| max |
=======
|     |
-------

where there is supposed to be one row.

Which aggregate is used shouldn’t affect the number of rows, and MAX({}) is defined to be an evaluation error. The COUNT case is clearer:

SELECT ?x (COUNT(*) AS ?count) 
WHERE { ?x ex:p ?value }
GROUP BY ?x

gives

-------------
| x | count |
=============
-------------

and

SELECT  (COUNT(*) AS ?count)
WHERE { ?x ex:p ?value }

gives

---------
| count |
=========
|     0 |
---------

The difference is that when there is an empty outcome from GROUP BY, there is an empty dictionary and the aggregation function, MAX or COUNT isn’t called at all. With no GROUP BY, there is a dictionary entry with an empty collection of rows and the case Max({}) = error or Count({}) = 0 covers the case of a call where there zero rows to aggregate.

SELECT ?x
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

and no matches of ?x ex:p ?value is zero rows.

Spec text

These are the relevant parts of the spec and errata:

Section 18.5.1 https://www.w3.org/TR/sparql11-query/#aggregateAlgebra

Group, produces a dictionary with group key ListEval(exprlist, μ) to () rows { μ' | ...}:

Group(exprlist, Ω) = { ListEval(exprlist, μ) → { μ' | μ' in Ω, ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }

Aggregation, produces a dictionary group key to aggregate value:

Aggregation(exprlist, func, scalarvals, { key1→Ω1, ..., keym→Ωm } )
   = { (key, F(Ω)) | key → Ω in { key1→Ω1, ..., keym→Ωm } }

Then AggregateJoin puts the results into a query solution.

During evaluation (after errata-query-11):

eval(D(G), Aggregation(exprlist, func, scalarvals, Group(exprlist, P)))
    = Aggregation(exprlist, func, scalarvals, eval(D(G), Group(exprlist, P))) 

Calling the aggregate function happens within the F(Ω) calling func.

Execution when grouping has no rows

Executing when the grouping has no rows, we get:

Group(exprlist, Ω) = { }

because μ in Ω is empty. Then aggregation is on the empty dictionary:

Aggregation(exprlist, func, scalarvals, { } )
   =  { (key, F(Ω)) | key → Ω in { } }

which is also { } because there is no (key → Ω) in { }.

The query result is zero rows.

If GROUP BY produces zero rows, aggregation produces zero rows.

The “no group” case

Use of aggregates when there is no GROUP BY is different.

In section [18.2.4.1 Grouping and Aggregation}(https://www.w3.org/TR/sparql11-query/#sparqlGroupAggregate) puts in a dummy key of (1) (a exprlist of one element which is the value 1).

But if there are no rows for the patten matching.

Group(exprlist, { }) = { }

regardless of the exprlist - the dictionary entry using key 1 is not created.

But we expect

SELECT (COUNT(*) AS ?C)
WHERE {
    ?x ex:p ?value
}

(no matches to the WHERE) to be one row with ?C being 0.

If it were MAX, then MAX({}) happens, which evaluates to an error so we have one row, no columns.

In particular, aggrgeate with no group is not like writing:

PREFIX ex: <http://example.com/>
SELECT ?x (COUNT(*) AS ?count)
WHERE {
    ?x ex:p ?value
} GROUP BY (1)

which is “no rows” again.

Possible Fix, “no group” case

Somehow, “no group by, with aggregates” needs to be made special.

One way is to make “aggregate, no group” become a distinct, special case of Group() that creates the right dictionary.

This can be done in 18.2.4.1, Step Aggregates.

We make the dummy key special marker 'group-all', then group all rows into one dictionary entry, even if the row collection is empty.

This safe because 'group-all' isn’t an exprlist and so can not be written in a query using GROUP BY.

We could use any marker value in place of 'group-all' as long as it is not an exprlist. 1, not (1), would work or {}. The group key of 1 is introduced when Group is evaluated as that group keys are RDF terms as before.

The only time Group('group-all', P) appears is when there is an aggregation step and there is no GROUP BY.

If Q contains GROUP BY exprlist
   Let G := Group(exprlist, P)
Else If Q contains an aggregate in SELECT, HAVING, ORDER BY
   Let G := Group((1), P)
Else
   ...

becomes:

If Q contains GROUP BY exprlist
   Let G := Group(exprlist, P)
Else If Q contains an aggregate in SELECT, HAVING, ORDER BY
   Let G := Group('group-all', P)
Else
   ...

Then Group handles the case Group('group-all', Ω) by putting in the group key of 1 that the spec uses:

Group('group-all', Ω) = { (1 → Ω) }
Group(exprlist, Ω) = { ListEval(exprlist, μ) → { μ' | μ' in Ω, ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }

exprList is never 'group-all' so these two evaluation cases don’t overlap.