RDF Binary encoding using Thrift

Binary RDF is useful for efficient processing and transfer rather than relying on text-based formats. Text based formats are more expensive to write and to parse.

The text oriented syntaxes (e.g. Turtle) provide human readability while the line-oriented syntaxes (e.g. N-Triples provide reasonable universal dump and machine exchange formats on the web. N-triples, despite being larger than Turtle, is generally reported to be faster to parse.

Apache Thrift, and Google’s Protocol Buffers are data formats designed for efficient exchange of data between co-operating processes. This data exchange may be via disk or network. The formats are design for efficient processing and not human readability.

RDF Binary is a uses Apache Thrift for a binary encoding for fast machine encoding and decoding. Apache Thrift provides libraries for reading and writing the encoding in a wide variety of programming languages.

Extended RDF Terms

The datastructures of graph/datasets and SPARQL result sets (e.g. in JSON) build on RDF Terms. RDF Thrift defines RDF Terms in Thrift and adds some forms to give a single set of basic building blocks.

  • RDF Terms – IRIs, literals and blank nodes
  • Prefixed names – shortened IRIs for compression. See prefixed names in Turtle or SPARQL.
  • Named variables (as in SPARQL)
  • ANY – a wildcard often found in RDF APIs for finding triples etc.
  • UNDEF – explicit indication that something does not have a value (e.g. for SPARQL results).

A possible future (meta) term is

  • REPEAT – a marker to indicate the same term is to be used as the previous “row”

In addition, there is a way to declare a prefix and the associated IRI.

These basic elements are used for

  • triples and quads for graphs and datasets
  • rows of values, for SPARQL result sets

This makes for reuse of code to process these different datastructures and simplicity, efficiency and easy of implementation with the assumption that each datastructure built on top is careful about erroneous use of these elements.

Prefix Declarations

Prefix declarations can be inserted at any point where a triple or quad is expected and apply from that the end of the declaration in the data stream.

Graphs and Datasets

Content type: application/rdf+thrift
File extensions: rt, trdf
  • Triple (3 RDF terms)
  • Quad (4 RDF terms)
  • Prefix declaration.

A graph is a stream comprised only of prefix declarations and triples.

A dataset is a stream comprised only of prefix declarations, triples and quads. (A triple is in the default graph; quads go into named graphs.)

This format is like N-Triples and N-Quads, except encoded in binary, and with prefixed names as a way to write IRIs. This means the prefix declarations travel with the triples and quads as they do with Turtle and TriG.

Prefix declarations can be inserted at any point where a triple or quad is expected and apply from that the end of the declaration in the data stream.

SPARQL Result Set

Content type: application/sparql-results+thrift
File extensions: srt
  • Header row: a list of variables.
  • Data row: a list of RDF terms to match the header.
  • Prefix declaration.

A SPARQL Result Set has one row of variables (the header row) and zero or more rows of data rows. The header row is mandatory.

All data rows are the same length as the header row.

Prefix declarations can be inserted at any point where a row is expected and apply from that the end of the declaration in the data stream.

A data row uses UNDEF to indicate that a variable not set in the result set for this

In addition, a data row may use the term REPEAT to indicate that the same RDF term from the previous row is used again. REPEAT is illegal in the first data row.

(REPEAT subject to deployment testing that it does provide useful data reduction.)

Details of Thrift Encoding

This section details the encoding in Thrift.

Encoding values

Use of encoding literals by value is optional.

Some literals can be encoded using their values, rather than their lexical form and datatype. This can lead to a reduction is space. It can result in the changes to the term encoded.

The exact lexcial form is not retained, so "001"^^xsd:integer, "+1"^^xsd:integer and "1"^^xsd:integer are all encoded as the value 1 and when read, will result in the same RDF term ("1"^^xsd:integer is the XSD canonical form).

Some derived dataypes are lost - xsd:long, xsd:int, xsd:short, xsd:byte are encoded as their integer value and will become xsd:integer when read back in.

Input Datatype Value space Outcome Datatype
xsd:integer Integer xsd:integer
xsd:long Integer xsd:integer
xsd:int Integer xsd:integer
xsd:short Integer xsd:integer
xsd:byte Integer xsd:integer
xsd:decimal Decimal xsd:decimal
xsd:double Double xsd:double

Whether this matters, depends on the data. For use as a failthful, general purpose database dump format, value encoding should not be used.

Prefixes

The use of prefixes can reduce the size of the data because it replaces common character sequences with a smaller string.

There are no detailed syntax rules for prefixes, unlike Turtle and SPARQL where, for example, the local part can not include a # (it must be \#) and the local part can’t start with a ..

In RDf Thrift, the prefix part is any string, the local part is any string. The reconsistuted URI is the concatenation of the URI for the prefix and the local part.

There are no escape sequences in either part. Neither % nor \ are special.

Thrift encoding of RDF Terms

RDF Thrift uses the Thrift compact protocol.

struct RDF_IRI {
1: required string iri
}

struct RDF_PrefixName {
1: required string prefix ;
2: required string localName ;
}

struct RDF_BNode {
1: required string label
}

struct RDF_Literal {
1: required string  lex ;
2: optional string  langtag ;
3: optional string  datatype ;
4: optional RDF_PrefixName dtPrefix ;
}

struct RDF_Decimal {
1: required i64  value ;
2: required i32  scale ;
}

struct RDF_VAR {
1: required string name ;
}

struct RDF_ANY { }

struct RDF_UNDEF { }

struct RDF_REPEAT { }

union RDF_Term {
1: RDF_IRI          iri
2: RDF_BNode        bnode
3: RDF_Literal      literal
4: RDF_PrefixName   prefixName 
5: RDF_VAR          variable
6: RDF_ANY          any
7: RDF_UNDEF        undefined
8: RDF_REPEAT       repeat
9: RDF_Triple       tripleTerm  # RDF-star

# Value forms of literals.
10: i64             valInteger
11: double          valDouble
12: RDF_Decimal     valDecimal
}

Thrift encoding of Triples, Quads and rows.

struct RDF_Triple {
1: required RDF_Term S
2: required RDF_Term P
3: required RDF_Term O
}

struct RDF_Quad {
1: required RDF_Term S
2: required RDF_Term P
3: required RDF_Term O
4: optional RDF_Term G
}

struct RDF_PrefixDecl {
1: required string prefix ;
2: required string uri ;
}

Thrift encoding of RDF Graphs and RDF Datasets

union RDF_StreamRow {
1: RDF_PrefixDecl   prefixDecl
2: RDF_Triple       triple
3: RDF_Quad         quad
}

RDF Graphs are encoded as a stream of RDF_Triple and RDF_PrefixDecl.

RDF Datasets are encoded as a stream of RDF_Triple, RDF-Quad and RDF_PrefixDecl.

Thrift encoding of SPARQL Result Sets

A SPARQL Result Set is encoded as a list of variables (the header), then a stream of rows (the results).

struct RDF_VarTuple {
1: list<RDF_VAR> vars
}

struct RDF_DataTuple {
1: list<RDF_Term> row
}