RDF Patch – Describing Changes to an RDF Dataset

RDF Patch is a file format for recording changes made to an RDF dataset. It can be used for replicating changes between multiple copies of the same dataset or as an incremental backups format.

Introduction

RDF Patch is a file format for recording changes made to an RDF dataset. It can be used for replicating changes between multiple copies of the same dataset or as an incremental backups format.

The design emphasizes deployment concerns such as scalability and efficient processing.

Use Cases

This section outlines some use cases for RDF patch.

HTTP PATCH

The HTTP PATCH method, introduced in RFC 5789 (March 2010), is for modification of a web resource by sending a description of the changes to the resource. Changes are limited to that resource.

SPARQL Update may not be suitable and follows a service oriented model of operation. A more focused and light weight solution is given by RDF Patch. For example, where the web resource is an RDF graph, not an RDF dataset.

No new web resources can be created by RDF patch. Data can only be added and removed from an existing graph or dataset.

Incremental backup

With a large dataset, a full backups can take a significant amount of time and space; if taken on a live, then it may introduce significant load. By keeping an RDF Patch file to record changes, there is an incremental backup that can be applied to the last full backup. Such a record can be much smaller, and not a significant load on the system during peak times.

By taking periodic full backups and recording changes as an RDF Patch as incremental backups between the full backup points, an up-to-date version of the data can be restored, including the latest changes.

Replica Maintenance

Suppose an SPARQL server is replicated for performance or resilience reasons. Each server has a complete copy of the database. While there are many other issues to consider, one aspect is to be able to update both databases in a consistent manner - that is, blank nodes within the dataset are treated identically.

Data Staging

One deployment architecture is to have a master database for staging the data which is not available to the public/application-facing publication servers. The master database is updated, then the changes propagated to replicas that support the public/application-facing read-only workload. The staging database may be subject to checking and validation of changes before release to the production publishing servers. In order to keep the production servers exactly as the master database, an RDF Patch file can be used because it allows blank nodes to be replicated.

Illustrative Example

This examples delta adds a foaf:knows relationship, and changes the foaf:name triple for Bob.

 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> .
 A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" .
 D <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
 A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .

In RDF patch, the first column is the operation marker. A means "add" and D means "delete".

The order in the file is significant - the changes happened in the order given so the first triple added with "Robert" as Bob's name is removed by the later delete when the name is changed from "Robert" to "Bob".

RDF Patch Details

An RDF Dataset is defined as a a set of a unnamed graph and zero or more named graphs. For recording changes to these graphs, we record triples added or deleted to the default graph, and the quad for a named graph. The "RDF Patch" format is based on a general format for "RDF tuples", which is not another serialization format for RDF or RDF Datasets.

The format is rows of RDF terms (URIs, literals, bNodes), with each row ending in a "." (DOT). Prefix names, SPARQL-styles variables and keywords (bare word without a ":" in them) are allowed.

Variables are not used by RDF Patch but other uses of RDF Tuples may define their use.

# starts a comment and it runs until the end of line. Note that RDF Tuple rows are delimited by DOT and may have white space between the RDF terms, including newlines.

RDF Tuples uses the following keywords:

KeywordMeaning
R Repeat term from previous row
U Term undefined (not used by RDF Tuples)

RDF Patch puts a keyword in the first item of each row:

KeywordMeaning
A Add
D Delete

This is followed by 3 or 4 RDF terms; 3 for a triple, 4 for a quad.

Using RDF Tuples, RDF Patch, has the following characteristics:

Abbreviation, Compression and Readability

To remove repetition, the value of an element in a row can be the keyword R, meaning it is the same as the previous row. It is an error if there is no such row. This means repeated subjects, predicates or objects of changes do not need to re-output for each quad. This may allow significant compression at very low CPU-cost. General compression, like gzip streams, may still be usefully applied.

If a element in a row is undefined in some row, a U is used to mark this (not used in RDF Patch)

Prefixed names are allowed (e.g. rdf:type) using Turtle style @prefix and expansion rules.

Relative URIs are not allowed. There is no @base.

RDF Tuples Format

@@Make formal

Outline grammar

This uses the grammar and tokens from RDF 1.1 Turtle and SPARQL 1.1.

White space is allowed between tokens and may be needed to distinguish tokens.

Comments are start with '#' and run to the end of the line.

Term     = iri | BlankNode | literal
RowItem  = WORD | Term | VAR
Row      = RowItem* DOT

WORD    = PN_CHARS_BASE PN_LOCAL without allowing ':'    # Must start with a letter.
VAR     = (from SPARQL)

RDF Patch Use of RDF Tuples

@@Make formal

RDF patch uses RDF Tuples for triples or quads (N-Quad order). It requires the first item in a row to be a keyword (A or D).

After the first keyword RowItem, there are 3 or 4 RowItems which are either RDF Terms or the keyword R, for an RDF term repeated from the last use in this row position.

Blank node labels are treated as encoded internal system identifiers. System may refuse to perform changes if they do not recognize the label format.

RDF Patch does not allow VAR.

Further RDF Patch Examples

Changes to a Graph

A <http://example/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .

This is the same as:

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
A <http://example/bob> foaf:name "bob" .

A slightly longer change: To add multiple triples:

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
D <http://example/bob> foaf:name "bob" .
A <http://example/bob> foaf:name "Bob" .
A R foaf:knows <http://example/alice> .
A R R <http://example/charlie> .

In the last row, we add a triple

<http://example/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/charlie> .

The repeat marker can be used in any position and across adds and deletes

@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
A <http://example/alice> foaf:knows <http://example/charlie> .
A <http://example/bob> R R .
@prefix  foaf: <http://xmlns.com/foaf/0.1/> .
D <http://example/bob> foaf:name "Robert" .
A R R "Bob" .

Use with HTTP PATCH

HTTP PATCH (RFC 578) introduced a mechanism for sending changes to a web resources. The SPARQL Graph Store Protocol mentions the use of HTTP Patch. RDF Patch is a suitable alternative to full SPARQL Update for the HTTP request body.

  PATCH /resource HTTP/1.1
  Host: example.org
  Content-Type: application/rdf-patch

  A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" .
  A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> .
  A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" .
  D <http://example.org/bob> R "Robert" .
  A R R "Bob" .

Named Graphs

Named graphs are changed by specifying quads

A <http://example/bob> foaf:name "Bob" <http://example/graphName> .

Canonical Patches

A further way in which the size of a diff file can be reduced is to only record changes that do have an effect of a change. An RDF Dataset is set of quads, no duplicates, so if a quad is added that is already in the dataset, then no record is needed as no effective change occurs. Similarly for deletes, if the quad does not exist, a delete action has no effect.

A RDF patch that only records actual changes is said to be a canonical patch.

However, testing before every insert or delete can be expensive or otherwise impractical and so this style is not required:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

Has the same effect as:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

If there was no such quad in the dataset before the delta started.

The quads don't need to be adjacent:

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/other> <http://example/p> 123 <http://example/g2> .
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .

becomes

A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
A <http://example/other> <http://example/p> 123 <http://example/g2> .

When written in canonical form the RDF Patch can be run backwards, to undo a change. This only works when real changes are recorded because otherwise knowing a triple is added does not mean it was not there before.

Blank Nodes

Both RDF-syntax-style (document scoped labels, new blank nodes created) and store-scoped (existing blank nodes) are needed. RDF provides skolemization as a mechanism to reference specific blank nodes when needed.

RDF Patch uses the special URI-like encoding <_:label> for directly identifying a blank node in a store.

Notes

This section contains working notes for the development of this document.

Simplicity of design leads to widespread adoption, which is important for the core use of managing chanage to RDF data. All features need to balance their utility for a partcular use case with the increase in complexity and implementation costs.

Line Mode

The grammar does not specific that tuples must be one per line. Long lines can be split. If there is no reason to require one tuple per line this is convenient even in a machine oriented format.

But there some advantages in a strictly one-line-erp-tuple format.

  • Bulk processing with command line text processing tools is also much simpler if one tuples per line is required.

  • Some features below require markers in the data stream so requiring them at the start-on-line is convenient.

Proposal: One tuple per line.

Metadata

There are two metadata needs: linking and inline.

Linking

Linking can be achieved with an L (orLINK) record. However, the link may be unavailable or is subject to change after the patch is created.

Inline

See this discussion

Inline metadata is not subject to such changes. It is made a little more complicated because RDF-patch is not an RDF syntax (the handling of blank nodes is different).

Proposal: A single line of METADATA starts or stops a block of characters that is passed to a Turtle parser. Blank nodes are handled as per RDF; full Turtle readability idioms apply. The block must not contain the string METADATA at the start of line.

Transaction Boundaries

The RDF patch format can be used to record one set of related changes, a group of related changes or as a stream of changes. When used to record one set of relate changes, the changes should be applied atomically - they all happen or none of them happen.

@@ As part of a section in a more formal style, add the atomicity requirement.

We define atomicity as a "SHOULD" because we do not exclude the use of a data format just because of specific system sisues such as light-weight implementation or usages that do not need this feature.

While (ACID) transactions do give atomicity, they are not the only way and are apprropriate for large systems.

The requirement on the RDF Patch format is to have markers for the beginning and end of sections of changes.

Proposal: Add BEGIN and END to indicate a block of changes. A block should be applied atomically. This maybe a transaction - it depends on the nature of the destination.

The start of file is an implicit BEGIN and the end of file is an implicit END.

Alignment Errors

Error handling : from tweet

Need to define what happens when R does not align to the previous tuple (presumable an error - define recovery for stream use).

Binary Format

An alternative wire format for efficient processing.

(Need to quantify the gains, if any).

Bulk Operations

In addition to D operations on a single quad, have a "bulk delete", either as a different operation or as, for example, D <http://example/s> ANY ANY ANY ..

c.f. SPARQL DROP ALL and DELETE WHERE {...}.

Internet Media Type, File Extension and Macintosh File Type

The Internet Media Type / MIME Type for RDF Patch is "application/rdf-patch".

The character set RDF Patch is "UTF-8".

It is recommended that RDF Patch files have the extension ".rdfp" (all lowercase) on all platforms.

It is recommended that RDF Patch files stored on Macintosh HFS file systems be given a file type of "TEXT".