RDF and IRI Syntax

Andy Seaborne

February, 2020

RDF uses IRIs to refer to resources.

This page pulls together the syntax requirements for IRIs into one place. There are several documents, mostly RFCs, that define IRIs and specific IRI schemes. Getting IRIs right means data is more likely to be readable by other systems when the data is published.

Just because an IRI passes all the syntax rules, it does not make it a good choice.

URIs are defined by RFC 3986 and IRIs in the companion RFC 3987 which gives the modifications necessary for the wider range of Unicode characters in URIs.

URI schemes can add constraints on the URI syntax; for example, RFC 7230 defines the http and https shemes.

This article introduces the terminology ‘RDF Reference’ to put all the implications into one definition.

TOC

IRIs

This article will use URI and IRI interchangeably. “IRIs are a generalization of URIs that permits a wider range of Unicode characters.

The RDF Concepts document says:

“IRIs in the RDF abstract syntax MUST be absolute, and MAY contain a fragment identifier.”

“Relative IRIs must be resolved against a base IRI to make them absolute.”

As of RFC 3986, relative IRIs are called “relative references”.

Definition of URI syntax

In RFC3986 section-4.1 defines “URI”.

A “URI” is URI-reference after it has been resolved.

The relevant part of the grammar in RFC3986 appendix-A is:

   absolute-URI  = scheme ":" hier-part [ "?" query ]

   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

We’ve already got to one important point - an absolute URI. An absolute URI has a URI scheme, and does not have a fragment.

An absolute URI with a fragment is just “URI”.

What we want is the full URI rule and also require it uses the “// authority” rule if a scheme involves that component.

Specifc URI schemes can add additional requirments. HTTP (RFC 7230) requires an http URI to have the part "//" authority; the urn scheme does not have an authority.

URN (RFC 8141) does not have an authority part - it uses the path-rootless production. It does additionally require the path to have two colons, the NID part must be at least two characters, and the NSS part at least one character.

Relative References

RDF syntax may use relative references. The process of parsing of a document means that any relative reference is converted to an URI to ensure it identifies the same resource everywhere. This is called resolving against a base URI - there is always a base.

Relative references are short cuts, for the full URI after resolving against the base.

By the time they get to RDF abstrat syntax (the datastructure), relative references should have been converted.

Example

By the definitions in RFC 3986:

    http:abcd
    http:/abcd

are absolute URIs - it has a URI scheme, it does not have a fragment.

But it is not an “http-URI” which is defined in RFC7230 sec 2.7.1 because http-URI defines additional syntax requirements. An http-URI requires the "//" authority path-abempty choice in hier-part.

With non-strict resolution, and an valid HTTP base URI, the examples won’t appear in RDF abstract syntax.

Notes about IRIs/URIs

  • Space (U+0200) is not legal in an IRI.
  • @, ~, (, ) and : are legal in the path, query and fragment components.
  • { and } are not legal in IRIs.
  • [ and ] are only legal as IPv6 address delimiters.
  • Encoding with “%hex” is not an escape mechanism. %20 puts three characters into the IRI: %, 2, 0.

Normalization

Normalization of the syntax (RFC 3998, section 6.2.2 gives some simple rules to make it easier to compare URIs as strings.

By the URI syntax the two characters after % must be legal hex digits (%ST is a syntax error). Normalization prefers the letters “A-F” to be uppercase.

HTTP

RFC7230 sec 2.7.1

  http-URI = "http:" "//" authority path-abempty [ "?" query ] [ "#" fragment ]

so the http URI scheme adds a requirement that there must be a “//” and authority (host and port), followed by an absolute path (starts with “/”) or is absent (empty string, no “/”).

URN

The syntax of the urn URI scheme

      assigned-name = "urn" ":" NID ":" NSS
      NID           = (alphanum) 0*30(ldh) (alphanum)
      ldh           = alphanum / "-"
      NSS           = pchar *(pchar / "/")

The older RFC 2141 allowed “X-…” as NID.

While a URN allows r-component, q-component and f-component, the latter being a URI fragment, usually it is just the assigned-name form used for resource identification.

The urn scheme only applies to ASCII, not the addition characters of IRIs.

An NID must be at least 2 characters, and the first and last must be alphanumeric.

A NSS must be at least one character.

It is not uncommon to see urn:x:... in test data - unfortunately, that isn’t a legal URI in the urn scheme because the “x” is the NID part and is too short.

By the general URI syntax, the URI path component is the “NID:NSS” part of the URN.

UUIDs

The correct way to use UUIDs (RFC 4122) is urn:uuid:0e5f5ff6-6c80-4786-84b9-4c121bb3ae9e. Hex digits should be lower case.

There was a proposal for a “uuid” scheme, it is is sometimes seen but it was only ever a draft.

“file” URI scheme

The file URI scheme had for a long time been only loosely defined in RFC 1738 section 3.10. Common usage was beyond the definition; character set issues were unclear.

RFC 8089 is a formal defintion compatible with the URI syntax for RFC 3986. It includes common usage such as relative filenames (relative to their point of use), for example file:directory/file.txt, can be used.

While for RDF the file schema is of limited use, the file URI scheme is useful when working with local files, for example using RDF for configuration files on the local machine. Such URIs will be of the form file:/// (that is, 3 ‘/’) using the file scheme convention that “localhost” can be dropped.

RDF References

It woudl be useful to pull all these considerations together into a distinct piece of terminolgy

“RDF References”:

  • It is a URI

That means after resolving against a base.

It has a scheme name. It may have query and fragment parts. There is always a “path” even if it is the empty string.

  • It SHOULD NOT have a “user:password@”, the user-password part of authority.
    This is deprecated in RFC 3986 section 3.2.1 and any “userinfo” is further restricted RFC 7230 section 2.7.1.

  • It follows the additional restrictions of the URI scheme.

This can be tricky for the parser to check if it does not know the scheme, but when generating URIs, the software generating a URI should follow the URI scheme.

  • The scheme-specific rules for http, https and urn schemes are required:

    • If ‘http:’ it follows the HTTP scheme rule: http-URI
    • If ‘urn:’, it matches the requirment for “urn:2+chars:1+char”
  • Hex in %-encoding SHOULD be uppercase.

Gotchas

In Turtle related synatxes, there are two places where “partial URIs” are used.

PREFIX u: <urn:uuid:>

u:66d5b9e2-5abe-49be-bfc9-1ed0d997e07f

but urn:uuid: is not a legal URN.

BASE <urn:>

## Resolves to urn:uuid:66d5...
<uuid:66d5b9e2-5abe-49be-bfc9-1ed0d997e07f>

but urn: is not a legal URNs.

Links to relevant documents:

Defined by:
URI RFC 3986
IRI RFC 3987
http: RFC 7230
urn: RFC 8141
urn:uuid: RFC 4122
did: W3C DID
file: RFC8089


Registries: