RDF Patch is a file format for recording changes made to an RDF dataset. It can be used for replicating changes between multiple copies of the same dataset or as an incremental backups format.
Introduction
RDF Patch is a file format for recording changes made to an RDF dataset. It can be used for replicating changes between multiple copies of the same dataset or as an incremental backups format.
The design emphasizes deployment concerns such as scalability and efficient processing.
Use Cases
This section outlines some use cases for RDF patch.
HTTP PATCH
The HTTP PATCH method, introduced in RFC 5789 (March 2010), is for modification of a web resource by sending a description of the changes to the resource. Changes are limited to that resource.
SPARQL Update may not be suitable and follows a service oriented model of operation. A more focused and light weight solution is given by RDF Patch. For example, where the web resource is an RDF graph, not an RDF dataset.
No new web resources can be created by RDF patch. Data can only be added and removed from an existing graph or dataset.
Incremental backup
With a large dataset, a full backups can take a significant amount of time and space; if taken on a live, then it may introduce significant load. By keeping an RDF Patch file to record changes, there is an incremental backup that can be applied to the last full backup. Such a record can be much smaller, and not a significant load on the system during peak times.
By taking periodic full backups and recording changes as an RDF Patch as incremental backups between the full backup points, an up-to-date version of the data can be restored, including the latest changes.
Replica Maintenance
Suppose an SPARQL server is replicated for performance or resilience reasons. Each server has a complete copy of the database. While there are many other issues to consider, one aspect is to be able to update both databases in a consistent manner - that is, blank nodes within the dataset are treated identically.
Data Staging
One deployment architecture is to have a master database for staging the data which is not available to the public/application-facing publication servers. The master database is updated, then the changes propagated to replicas that support the public/application-facing read-only workload. The staging database may be subject to checking and validation of changes before release to the production publishing servers. In order to keep the production servers exactly as the master database, an RDF Patch file can be used because it allows blank nodes to be replicated.
Illustrative Example
This examples delta adds a foaf:knows relationship, and changes the foaf:name triple for Bob.
A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" . A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> . A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" . D <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" . A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .
In RDF patch, the first column is the operation marker. A means "add" and D means "delete".
The order in the file is significant - the changes happened in the order given so the first triple added with "Robert" as Bob's name is removed by the later delete when the name is changed from "Robert" to "Bob".
RDF Patch Details
An RDF Dataset is defined as a a set of a unnamed graph and zero or more named graphs. For recording changes to these graphs, we record triples added or deleted to the default graph, and the quad for a named graph. The "RDF Patch" format is based on a general format for "RDF tuples", which is not another serialization format for RDF or RDF Datasets.
The format is rows of RDF terms (URIs, literals, bNodes), with each row ending in a "." (DOT). Prefix names, SPARQL-styles variables and keywords (bare word without a ":" in them) are allowed.
Variables are not used by RDF Patch but other uses of RDF Tuples may define their use.
# starts a comment and it runs until the end of line. Note that RDF Tuple rows are delimited by DOT and may have white space between the RDF terms, including newlines.
RDF Tuples uses the following keywords:
Keyword | Meaning |
---|---|
R | Repeat term from previous row |
U | Term undefined (not used by RDF Tuples) |
RDF Patch puts a keyword in the first item of each row:
Keyword | Meaning |
---|---|
A | Add |
D | Delete |
This is followed by 3 or 4 RDF terms; 3 for a triple, 4 for a quad.
Using RDF Tuples, RDF Patch, has the following characteristics:
- The unit being managed is an RDF Dataset, which is a collection of RDF graphs and used by SPARQL for query and for update. RDF Patch can also be used to record changes to a single graph.
- It is streaming - changes can be applied as an RDF Patch file or stream read; it is not necessary to wait until the whole set of changes is seen before applying any of it. Indeed, the patch must be applied in the order in the file to correctly apply changes.
- Blank nodes are recorded as _:id where id can be the system-internal identifier for the blank node. This means that changes involving blank nodes can be replayed to get an identical datasets to the one where the changes were originally made. This is RDF Patch's interpretation of the _:abc syntax; other uses of RDF Tuples may have different scoping rules for blank node labels.
Abbreviation, Compression and Readability
To remove repetition, the value of an element in a row can be the keyword R, meaning it is the same as the previous row. It is an error if there is no such row. This means repeated subjects, predicates or objects of changes do not need to re-output for each quad. This may allow significant compression at very low CPU-cost. General compression, like gzip streams, may still be usefully applied.
If a element in a row is undefined in some row, a U is used to mark this (not used in RDF Patch)
Prefixed names are allowed (e.g. rdf:type) using Turtle style @prefix and expansion rules.
Relative URIs are not allowed. There is no @base.
RDF Tuples Format
@@Make formal
Outline grammar
This uses the grammar and tokens from RDF 1.1 Turtle and SPARQL 1.1.
White space is allowed between tokens and may be needed to distinguish tokens.
Comments are start with '#' and run to the end of the line.
Term = iri | BlankNode | literal RowItem = WORD | Term | VAR Row = RowItem* DOT WORD = PN_CHARS_BASE PN_LOCAL without allowing ':' # Must start with a letter. VAR = (from SPARQL)
RDF Patch Use of RDF Tuples
@@Make formal
RDF patch uses RDF Tuples for triples or quads (N-Quad order). It requires the first item in a row to be a keyword (A or D).
After the first keyword RowItem, there are 3 or 4 RowItems which are either RDF Terms or the keyword R, for an RDF term repeated from the last use in this row position.
Blank node labels are treated as encoded internal system identifiers. System may refuse to perform changes if they do not recognize the label format.
RDF Patch does not allow VAR.
Further RDF Patch Examples
Changes to a Graph
A <http://example/bob> <http://xmlns.com/foaf/0.1/name> "Bob" .
This is the same as:
@prefix foaf: <http://xmlns.com/foaf/0.1/> . A <http://example/bob> foaf:name "bob" .
A slightly longer change: To add multiple triples:
@prefix foaf: <http://xmlns.com/foaf/0.1/> . D <http://example/bob> foaf:name "bob" . A <http://example/bob> foaf:name "Bob" . A R foaf:knows <http://example/alice> . A R R <http://example/charlie> .
In the last row, we add a triple
<http://example/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/charlie> .
The repeat marker can be used in any position and across adds and deletes
@prefix foaf: <http://xmlns.com/foaf/0.1/> . A <http://example/alice> foaf:knows <http://example/charlie> . A <http://example/bob> R R .
@prefix foaf: <http://xmlns.com/foaf/0.1/> . D <http://example/bob> foaf:name "Robert" . A R R "Bob" .
Use with HTTP PATCH
HTTP PATCH (RFC 578) introduced a mechanism for sending changes to a web resources. The SPARQL Graph Store Protocol mentions the use of HTTP Patch. RDF Patch is a suitable alternative to full SPARQL Update for the HTTP request body.
PATCH /resource HTTP/1.1 Host: example.org Content-Type: application/rdf-patch A <http://example.org/bob> <http://xmlns.com/foaf/0.1/name> "Robert" . A <http://example.org/bob> <http://xmlns.com/foaf/0.1/knows> <http://example/alice> . A <http://example.org/alice> <http://xmlns.com/foaf/0.1/name> "Alice" . D <http://example.org/bob> R "Robert" . A R R "Bob" .
Named Graphs
Named graphs are changed by specifying quads
A <http://example/bob> foaf:name "Bob" <http://example/graphName> .
Canonical Patches
A further way in which the size of a diff file can be reduced is to only record changes that do have an effect of a change. An RDF Dataset is set of quads, no duplicates, so if a quad is added that is already in the dataset, then no record is needed as no effective change occurs. Similarly for deletes, if the quad does not exist, a delete action has no effect.
A RDF patch that only records actual changes is said to be a canonical patch.
However, testing before every insert or delete can be expensive or otherwise impractical and so this style is not required:
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> . A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
Has the same effect as:
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
If there was no such quad in the dataset before the delta started.
The quads don't need to be adjacent:
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> . A <http://example/other> <http://example/p> 123 <http://example/g2> . A <http://example/s> <http://example/p> <http://example/o> <http://example/g> .
becomes
A <http://example/s> <http://example/p> <http://example/o> <http://example/g> . A <http://example/other> <http://example/p> 123 <http://example/g2> .
When written in canonical form the RDF Patch can be run backwards, to undo a change. This only works when real changes are recorded because otherwise knowing a triple is added does not mean it was not there before.
Blank Nodes
Both RDF-syntax-style (document scoped labels, new blank nodes created) and store-scoped (existing blank nodes) are needed. RDF provides skolemization as a mechanism to reference specific blank nodes when needed.
RDF Patch uses the special URI-like encoding <_:label> for directly identifying a blank node in a store.
Notes
This section contains working notes for the development of this document.
Simplicity of design leads to widespread adoption, which is important for the core use of managing chanage to RDF data. All features need to balance their utility for a partcular use case with the increase in complexity and implementation costs.
Line Mode
The grammar does not specific that tuples must be one per line. Long lines can be split. If there is no reason to require one tuple per line this is convenient even in a machine oriented format.
But there some advantages in a strictly one-line-erp-tuple format.
Bulk processing with command line text processing tools is also much simpler if one tuples per line is required.
Some features below require markers in the data stream so requiring them at the start-on-line is convenient.
Proposal: One tuple per line.
Metadata
There are two metadata needs: linking and inline.
Linking
Linking can be achieved with an L (orLINK) record. However, the link may be unavailable or is subject to change after the patch is created.
Inline
See this discussion
Inline metadata is not subject to such changes. It is made a little more complicated because RDF-patch is not an RDF syntax (the handling of blank nodes is different).
Proposal: A single line of METADATA starts or stops a block of characters that is passed to a Turtle parser. Blank nodes are handled as per RDF; full Turtle readability idioms apply. The block must not contain the string METADATA at the start of line.
Transaction Boundaries
The RDF patch format can be used to record one set of related changes, a group of related changes or as a stream of changes. When used to record one set of relate changes, the changes should be applied atomically - they all happen or none of them happen.
@@ As part of a section in a more formal style, add the atomicity requirement.
We define atomicity as a "SHOULD" because we do not exclude the use of a data format just because of specific system sisues such as light-weight implementation or usages that do not need this feature.
While (ACID) transactions do give atomicity, they are not the only way and are apprropriate for large systems.
The requirement on the RDF Patch format is to have markers for the beginning and end of sections of changes.
Proposal: Add BEGIN and END to indicate a block of changes. A block should be applied atomically. This maybe a transaction - it depends on the nature of the destination.
The start of file is an implicit BEGIN and the end of file is an implicit END.
Alignment Errors
Error handling : from tweet
Need to define what happens when R does not align to the previous tuple (presumable an error - define recovery for stream use).
Binary Format
An alternative wire format for efficient processing.
(Need to quantify the gains, if any).
Bulk Operations
In addition to D operations on a single quad, have a "bulk delete", either as a different operation or as, for example, D <http://example/s> ANY ANY ANY ..
c.f. SPARQL DROP ALL and DELETE WHERE {...}.
Internet Media Type, File Extension and Macintosh File Type
The Internet Media Type / MIME Type for RDF Patch is "application/rdf-patch".
The character set RDF Patch is "UTF-8".
It is recommended that RDF Patch files have the extension ".rdfp" (all lowercase) on all platforms.
It is recommended that RDF Patch files stored on Macintosh HFS file systems be given a file type of "TEXT".