# Substitution of Variables in SPARQL

Andy Seaborne

October 2019

The SPARQL 1.1 algebra operation
“substitute” evaluates
a graph pattern where there is a specific are variables given by a solution
mapping. The operation is used to in the evaluation of `EXISTS` and
`NOT EXISTS` operations.

This document list problems that have been identified with the
`substitute` operation and proposes an improved substitution evaluation
process that addresses these problems based on the concept of
Correlated Subquery found in SQL.

## Summary

The fundamental problem is that a variable can not be simply replaced by a value (an RDF Term) in all places in a graph pattern. There are some places where SPARQL function forms and “AS” assignment require a variable. The proposal here is to take the variable binding from the input solution, retaining the variable in the graph pattern, and disallowing cases that can reset the variable binding as is already the case elsewhere in SPARQL.

## Identified Issues

This section describes the issues identified on the SPARQL Exists Community group mailing list public-sparql-exists/2016Jul/0014.

- Issue-1: Some uses of EXISTS are not defined during evaluation.
- Issue-2: Substitution happens where definitions are only for variables.
- Issue-3: Blank nodes substituted into BGPs act as variables.
- Issue-4: Substitution can flip MINUS to its disjoint-domain case.
- Issue-5: Substitution affects disconnected variables.

### Issue 1: Some uses of EXISTS are not defined during evaluation

The evaluation process in the specificiation is defined for graph patterns but there are situations where the evaluation is of an alegbra form not listed.

For example:

```
FILTER EXISTS { SELECT ?y { ?y :q :c } }
```

and

```
FILTER EXISTS { VALUES ?y { 123 } }
```

The argument to
`exists`
is not explicitly listed as a “Graph Pattern” in the table of SPARQL algebra symbols in
section 18.2
when the argument to `EXISTS` is a
GroupGraphPattern
containing just a
subquery
or just
InlineData.

### Issue 2: Substitution happens where definitions are only for variables

There are places in the SPARQL syntax and algebra where variables are allowed but not RDF terms (constant values).

Example:

```
FILTER EXISTS { BIND ( :e AS ?z )
{ SELECT ?x { :b :p :c } }
}
```

Both positions “AS ?z” and “SELECT ?x” must be variables.

In the algebra, this affects

`extend`(related to the use of`AS`in SPARQL syntax)- in line data
(related to the use of
`VALUES`) `BOUND`

### Issue 3: Blank nodes substituted into BGPs act as variables

In the evaluation of basic graph patterns (BGPs) blank nodes are replaced by RDF terms from the graph being matched and variables are replaced by a solution mapping from query variables to RDF terms so that the basic graph pattern is now a subgraph of the graph being matched.

Simply substituting a variable with a blank node in the `EXISTS`
evaluation process does not cause the basic graph pattern to be
to be restricted to subgraphs containing that blank node as an RDF term
because it is mapped by an
RDF instance mapping
before checking that the BGP after mapping is a subgraph of the
graph being queried.

Note that elsewhere in the evaluation of the SPARQL algebra, a solution mapping with a binding from variable to blank node, does treat blank nodes as RDF terms. They are not mapped by an RDF instance mapping.

Example:

```
SELECT ?x WHERE {
?x :p :d .
FILTER EXISTS { ?x :q :b . }
}
```

against the graph `{ :c :p :d . :e :q :b }`
the substitution for `EXISTS` produces
`BGP(:c :q :b)` which then
matches against `:e :q :b` because the `_:c` can be mapped to `:e` by
the RDF instance mapping that is part of pattern instance mappings in
18.3.1.

### Issue 4: Substitution can flip MINUS to its disjoint-domain case

In

```
SELECT ?x WHERE {
?x :p :c .
FILTER EXISTS { ?x :p :c . MINUS { ?x :p :c . } }
}
```

on the graph `{ :d :p :c }`
the substitution from 18.6 ends up producing

```
Minus( BGP( :d :p :c ), BGP( :d :p :c ) )
```

which produces a non-empty result because the two solution mappings for the Minus have disjoint domains and 18.5 dictates that then the result is not empty.

### Issue 5: Substitution affects disconnected variables

In

```
SELECT ?x WHERE {
BIND ( :d AS ?x )
FILTER EXISTS { BIND ( :e AS ?z )
SELECT ?y WHERE { ?x :p :c } }
}
```

the substitution from 18.6 ends up producing

```
Join ( Extend( BGP(), ?z :e ),
ToMultiSet(
Project( ToList( BGP( :d :p :c ) ), { ?y } )
))
```

The `?x`

inside the `SELECT ?y`

is not projected out so it is a “different” `?x`

than the outer one - changing it to another other unused name in the same query
would not normally affect the query results.

## An Improved “substitute” Operation

Evalauting
`substitute` is
performed for a given solution mapping. For example, the `EXISTS`

operation
evaluates to `true`

if a graph pattern has one or more matches given the variable bindings
of a solution mapping. We call this solution mapping the current row
in this description.

This section proposes an alternative mechanism. Rather than replace each
variable by the value it is bound to in the current row, this alternative
mechanism makes the whole of the current row available at any point in the
evaluation of an `EXISTS` expression. It uses the current row to
restrict the binding of variables at the points where variable bindings are
created during evaluation of `EXISTS` to be those from the current row.
It makes illegal syntactic constructs that could lead to an attempt to rebind a
variable from the current row through using the `AS` syntax.

Section “Addressing Issues” describes how this alternative
definition of `substitute` addresses each of the issues identified above.

There are 3 parts to the proposal:

- Place the current row mapping variables to values (the RDF terms) so that the variables always have their values from the current row. This is the replacement for syntactic substitution in the original definition.
- Renaming inner scope variables so that variables that are only used within a sub-query are not affected by the current row. This reflects the fact that in SPARQL such variables are not present in solutions mappings outside their sub-query.
- Disallow syntactic forms that set variables potentially already present in the current row. SPARQL solutions mappings can only have one binding for a variable and the current row provides that binding.

### Renaming

Within sub-queries, variables with the same name can be used but do not appear in the overall results of the query if they do not occur in the projection in the sub-query. Such inner variables are not in-scope when they are not in the output of the projection part of the inner SELECT expression.

```
SELECT * {
?s :value ?v .
FILTER EXISTS {
{ SELECT (count(*) AS ?C) {
?s :property ?w .
}
}
FILTER ( ?C < ?v )
}
}
```

Here, the `?s` is not mentioned in the projection in
`SELECT (count(*) AS ?C)`. Replacing `?s` by, for example,
`?V1234` in the sub-query does not change the overall results.

```
SELECT * {
?s :value ?v .
FILTER EXISTS {
{ SELECT (count(*) AS ?C) {
?V1234 :property ?w .
}
}
FILTER ( ?C < ?v )
}
}
```

Such variable usages can be replaced with a variable of a different name, if that name is not used anywhere else in the query, and the same results are obtained in the sub-query. A sub-query always has a projection as its top-most algebra operator.

To preserve this, any such variables are renamed so they do not coincide with
variables from the current row being filtered by `EXISTS`.

The SPARQL algebra “project” operator has two components, an algebra expression and a set of variables for the projection.

**Definition: Projection Expression Variable Remapping**

For a projection algebra operation P with set of variables PV, define a partial mapping F from V, the set of all variables, to V where:

F(v) = v if v in PV

F(v) = v1 where v is a variable mentioned in the project expression
and v1 is a fresh variable

F(v) = v otherwise.

`PrjMap(P,PV)`to be the algebra expression P (and the subtree over which the projection is defined) with F applied to every variable of the algebra expression P over which P is evaluated.

This process is applied throughout the graph pattern of `EXISTS`:

**Definition: Variable Remapping**

For any algebra expression X define the Variable Remapping PrjMap(X):

PrjMap(X) = replace all project operations `project(P PV)` with `project(PrjMap(P,PV) PV)` for each projection in X.

`EXISTS`.

Applying the renaming steps inside a sub-query does not change the solution mappings resulting from evaluating the sub-query. Remapping is only applied to variables not visible outside the sub-query. Renaming a variable in a SPARQL algebra expression causes the variable name used in bindings from evaluating the algebra expression to change. Since these are only variables that are not visible outside the sub-query, because they do not occur in the projection, the result of the sub-query is unchanged. SPARQL algebra expressions can not access the name of a variable nor introduce a variable except by remapping. Remapping is only applied to variables not visible outside the sub-query.

### Limitations on Assignment

SPARQL syntactic forms that attempt to bind a variable through the use of
`AS` that might already be in a solution mapping are forbidden in
SPARQL: this is covered in the syntactic restrictions of
19.8 Grammar,
notes 12 and 13.

This proposal adds the restriction that any variables in a current
row, the set of variables
in-scope
of the expression containing EXISTS, can not be assigned with the `extend`
algebra function linked to the `AS` syntax.

In addition, any use of `VALUES` in the EXISTS expression must not
use a variable in the current row.

### Restriction of Bindings

The proposal is to retain the variables from the current row, not substitute them for RDF terms, before evaluation, and also to restrict the binding of the solution to the RDF term of the current row. This occurs after renaming.

Binding for variables occur in several places in SPARQL:

- Basic Graph Pattern Matching
- Property Path Patterns
- The evaluation of algebra form
`Graph(var,P)`involving a variable (from the syntax`GRAPH ?variable {…}`)

Note that other places where solution mappings add variables are in
`extend` function (connected to the `AS` syntax)
and `a multiset` from `VALUES` syntax.
Limitations on Assignment
forbid this being of variables of the current row.

Restricting the RDF Terms for a variable binding is done using inline data that is joined with the evalaution of the basic graph pattern, property path or graph match.

**Definition: Values Insertion**

For solution mapping μ, define Table(μ) to be the multiset formed from μ.

Table(μ) = { μ }

Card[μ] = 1

Define the Values Insertion function `Replace(X, μ)` to
replace each occurence Y of a
Basic Graph Pattern,
Property Path Expression,
Graph(Var, pattern)
in X with join(Y, Table(μ)).

### Evaluation of EXISTS

The evaluation of `EXISTS` is defined as:

**Definition: Evaluation of Exists**

Let μ be the current solution mapping for a filter and X a graph pattern,
define the Evaluation of Exists `exists(X)`

exists(X) = true if eval(D(G), Replace(PrjMap(X), μ) is a non-empty solution sequence.

exists(X) = false otherwise

## Addressing Issues

This section addresses each issue identified, given the proposal above.

### Issue 1: Some uses of EXISTS are not defined during evaluation

This can be handled by handling solution sequences as graph patterns where needed by adding toMultiSet as is done for SubSelect in 18.2.2.6 Translate Graph Patterns with a a correction to the text at the end of Section 18.2 introductory paragraph.

query-errata-N: "Section 18.2 Translation to the SPARQL Algebra" intro (end): ToMultiSet can be used where a graph pattern is mentioned below because the outcome of evaluating a graph pattern is a multiset. Multisets of solution mappings are elements of the SPARQL algebra. Multisets of solution mappings count as graph patterns.

### Issue 2: Substitution happens where definitions are only for variables

Rather then replace a variable by its value in the current row, the new
mechanism makes the binding of variable to value available. The variable
remains in the graph pattern of `EXISTS` and the evaluation.

### Issue 3: Blank nodes substituted into BGPs act as variables

By making the current row, which can include blank nodes, available, and not modifying the BGP by substitution, no blank nodes are introduced into the evalaution of the BGP. Instead, the possible solutions is restricted by the current row.

### Issue 4: Substitution can flip MINUS to its disjoint-domain case

Issue 4 is addressed because variables are not removed from the domain of
`MINUS`. This propsoal does not preserve all uses of `MINUS`
expressions; the problem identified in issue 4 is considered to be a bug in the
SPARQL 1.1 specification.

### Issue 5: Substitution affects disconnected variables

Issue 5 is addressed by noting that variables inside sub-queries which are not projected can be renamed without affecting the sub-query results. Whether to preserve that invariant or allow the variables to be set by the current row is a choice point - this design preserves the independence of disconnected variables.

## Notes

The proposal described in this document does not cover use of variables from
the current row in a `HAVING` clause.