RDF Patch – Describing Changes to an RDF Dataset

RDF Patch is a file format for recording changes made to an RDF dataset. It can be used for replicating changes between multiple copies of the same dataset or as an incremental backups format.

Use Cases

This section outlines some use cases for RDF patch.

HTTP PATCH

The HTTP PATCH method, introduced in RFC 5789 (March 2010), is for modification of a web resource by sending a description of the changes to the resource. Changes are limited to that resource.

SPARQL Update may not be suitable and follows a service oriented model of operation. A more focused and light weight solution is given by RDF Patch. For example, where the web resource is an RDF graph, not an RDF dataset.

No new web resources can be created by RDF patch. Data can only be added and removed from an existing graph or dataset.

Incremental backup

With a large dataset, a full backups can take a significant amount of time and space; if taken on a live, then it may introduce significant load. By keeping an RDF Patch file to record changes, there is an incremental backup that can be applied to the last full backup. Such a record can be much smaller, and not a significant load on the system during peak times.

By taking periodic full backups and recording changes as an RDF Patch as incremental backups between the full backup points, an up-to-date version of the data can be restored, including the latest changes.

Replica Maintenance

Suppose an SPARQL server is replicated for performance or resilience reasons. Each server has a complete copy of the database. While there are many other issues to consider, one aspect is to be able to update both databases in a consistent manner - that is, blank nodes within the dataset are treated identically.

Data Staging

One deployment architecture is to have a master database for staging the data which is not available to the public/application-facing publication servers. The master database is updated, then the changes propagated to replicas that support the public/application-facing read-only workload. The staging database may be subject to checking and validation of changes before release to the production publishing servers. In order to keep the production servers exactly as the master database, an RDF Patch file can be used because it allows blank nodes to be replicated.

Illustrative Example

This examples delta adds a foaf:knows relationship, and changes the foaf:name triple for Bob.

 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> .
 A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" .
 D <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .

In RDF patch, the first column is the operation marker. A means "add" and D means "delete".

The order in the file is significant - the changes happened in the order given so the first triple added with "Robert" as Bob's name is removed by the later delete when the name is changed from "Robert" to "Bob".

RDF Patch Details

An RDF Dataset is defined as a a set of a unnamed graph and zero or more named graphs. For recording changes to these graphs, we record triples added or deleted to the default graph, and the quad for a named graph. The "RDF Patch" format is based on a general format for "RDF tuples", which is not another serialization format for RDF or RDF Datasets.

The format is rows of RDF terms (URIs, literals, bNodes), with each row ending in a "." (DOT). Prefix names, SPARQL-styles variables and keywords (bare word without a ":" in them) are allowed.

Variables are not used by RDF Patch but other uses of RDF Tuples may define their use.

# starts a comment and it runs until the end of line. Note that RDF Tuple rows are delimited by DOT and may have white space between the RDF terms, including newlines.

RDF Tuples uses the following keywords:

Keyword	Meaning
`R`	Repeat term from previous row
`U`	Term undefined (not used by RDF Tuples)

RDF Patch puts a keyword in the first item of each row:

Keyword	Meaning
`A`	Add
`D`	Delete

This is followed by 3 or 4 RDF terms; 3 for a triple, 4 for a quad.

Using RDF Tuples, RDF Patch, has the following characteristics:

The unit being managed is an RDF Dataset, which is a collection of RDF graphs and used by SPARQL for query and for update. RDF Patch can also be used to record changes to a single graph.
It is streaming - changes can be applied as an RDF Patch file or stream read; it is not necessary to wait until the whole set of changes is seen before applying any of it. Indeed, the patch must be applied in the order in the file to correctly apply changes.
Blank nodes are recorded as _:id where id can be the system-internal identifier for the blank node. This means that changes involving blank nodes can be replayed to get an identical datasets to the one where the changes were originally made. This is RDF Patch's interpretation of the _:abc syntax; other uses of RDF Tuples may have different scoping rules for blank node labels.

Abbreviation, Compression and Readability

To remove repetition, the value of an element in a row can be the keyword R, meaning it is the same as the previous row. It is an error if there is no such row. This means repeated subjects, predicates or objects of changes do not need to re-output for each quad. This may allow significant compression at very low CPU-cost. General compression, like gzip streams, may still be usefully applied.

If a element in a row is undefined in some row, a U is used to mark this (not used in RDF Patch)

Prefixed names are allowed (e.g. rdf:type) using Turtle style @prefix and expansion rules.

Relative URIs are not allowed. There is no @base.

Further RDF Patch Examples

Changes to a Graph

A <http://example/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .

This is the same as:

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
A <http://example/bob> foaf:name "bob" .

A slightly longer change: To add multiple triples:

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
D <http://example/bob> foaf:name "bob" .
A <http://example/bob> foaf:name "Bob" .
A R foaf:knows <http://example/alice> .
A R R <http://example/charlie> .

In the last row, we add a triple

<http://example/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/charlie> .

The repeat marker can be used in any position and across adds and deletes

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
A <http://example/alice> foaf:knows <http://example/charlie> .
A <http://example/bob> R R .

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
D <http://example/bob> foaf:name "Robert" .
A R R "Bob" .

Use with HTTP PATCH

HTTP PATCH (RFC 578) introduced a mechanism for sending changes to a web resources. The SPARQL Graph Store Protocol mentions the use of HTTP Patch. RDF Patch is a suitable alternative to full SPARQL Update for the HTTP request body.

  PATCH /resource HTTP/1.1
  Host: example.org
  Content-Type: application/rdf-patch

  A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
  A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> .
  A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" .
  D <http://example.org/bob> R "Robert" .
  A R R "Bob" .

Named Graphs

Named graphs are changed by specifying quads

A <http://example/bob> foaf:name "Bob" <http://example/graphName> .

Canonical Patches

A further way in which the size of a diff file can be reduced is to only record changes that do have an effect of a change. An RDF Dataset is set of quads, no duplicates, so if a quad is added that is already in the dataset, then no record is needed as no effective change occurs. Similarly for deletes, if the quad does not exist, a delete action has no effect.

A RDF patch that only records actual changes is said to be a canonical patch.

However, testing before every insert or delete can be expensive or otherwise impractical and so this style is not required:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

Has the same effect as:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

If there was no such quad in the dataset before the delta started.

The quads don't need to be adjacent:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/other> <http://example/p> 123 <http://example/g2> .
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

becomes

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/other> <http://example/p> 123 <http://example/g2> .

When written in canonical form the RDF Patch can be run backwards, to undo a change. This only works when real changes are recorded because otherwise knowing a triple is added does not mean it was not there before.

Notes

This section contains working notes for the development of this document.

Simplicity of design leads to widespread adoption, which is important for the core use of managing chanage to RDF data. All features need to balance their utility for a partcular use case with the increase in complexity and implementation costs.

Line Mode

The grammar does not specific that tuples must be one per line. Long lines can be split. If there is no reason to require one tuple per line this is convenient even in a machine oriented format.

But there some advantages in a strictly one-line-erp-tuple format.

Bulk processing with command line text processing tools is also much simpler if one tuples per line is required.
Some features below require markers in the data stream so requiring them at the start-on-line is convenient.

Proposal: One tuple per line.

Metadata

There are two metadata needs: linking and inline.

Linking

Linking can be achieved with an L (orLINK) record. However, the link may be unavailable or is subject to change after the patch is created.

Inline

See this discussion

Inline metadata is not subject to such changes. It is made a little more complicated because RDF-patch is not an RDF syntax (the handling of blank nodes is different).

Proposal: A single line of METADATA starts or stops a block of characters that is passed to a Turtle parser. Blank nodes are handled as per RDF; full Turtle readability idioms apply. The block must not contain the string METADATA at the start of line.

Transaction Boundaries

The RDF patch format can be used to record one set of related changes, a group of related changes or as a stream of changes. When used to record one set of relate changes, the changes should be applied atomically - they all happen or none of them happen.

@@ As part of a section in a more formal style, add the atomicity requirement.

We define atomicity as a "SHOULD" because we do not exclude the use of a data format just because of specific system sisues such as light-weight implementation or usages that do not need this feature.

While (ACID) transactions do give atomicity, they are not the only way and are apprropriate for large systems.

The requirement on the RDF Patch format is to have markers for the beginning and end of sections of changes.

Proposal: Add BEGIN and END to indicate a block of changes. A block should be applied atomically. This maybe a transaction - it depends on the nature of the destination.

The start of file is an implicit BEGIN and the end of file is an implicit END.

Alignment Errors

Error handling : from tweet

Need to define what happens when R does not align to the previous tuple (presumable an error - define recovery for stream use).

Binary Format

An alternative wire format for efficient processing.

(Need to quantify the gains, if any).

Bulk Operations

In addition to D operations on a single quad, have a "bulk delete", either as a different operation or as, for example, D <http://example/s> ANY ANY ANY ..

c.f. SPARQL DROP ALL and DELETE WHERE {...}.

Introduction