The World Development Indicators (WDI) dataset that we are converting to RDF is already prepared in SDMX-ML Compact format. A dataset described in SDMX-ML format will have at least two XML files, a data file and a data structure definition file (DSD). To produce RDF we need to use both these files.
The observations in the data file are associated with coded values. For example, a series of yearly observations may be under a series heading where FREQ=”A” SERIES=”NY_ADJ_NNTY_KD_ZG” and REF_AREA=”AFG”. Abbreviating these values makes the SDMX XML more compact and efficient, but unless we are familiar with the data, we can only guess that “AFG” might stand for Afghanistan, the frequency is Annual and the data refers to “Adjusted net national income (annual % growth)”. The descriptive text for all the abbreviated fields is held in the DSD file. The DSD file also defines the structure of dataset, including what concepts (dimensions) and observation statuses (footnotes) the dataset uses.
Modelling the RDF
Often the most difficult aspect of exposing datasets via RDF is modelling the data. The Publishing Statistical Data working group (of which Mimas is part) has already created a vocabulary mapping SDMX objects to the corresponding entities in RDF. The diagram below is an extract from the RDF Data Cube vocabulary (http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html#outline) and shows an outline of the vocabulary.
To utilise this framework we have created RDF classes that reference the SDMX objects for datasets, key-families, observations and so forth.
SDMX has been designed to encompass a large diversity of statistical data, the WDI dataset uses only a small subset of the available SDMX. What we aim to do is convert the SDMX objects in the source XML to their equivalent RDF objects. As the SDMX-ML source files are a form of XML and the RDF output we are producing is a number XML files, we have chosen to use XSLT transformations to do the conversion. XSLT transformation is a standard technique for manipulating XML, it provides mechanisms for pattern matching in XML, collecting and rewriting data.
On the face of it, the work would seem to be straightforward however, we have encountered some difficulties in the process. SDMX has no requirement to use unique field names or key family (code list names) because each pair of files can be considered in isolation. In a collection of RDF data, unique identifiers are a key principle so duplicate object names cannot be used. We use URIs to identify objects and it is good practice to use a human readable URI rather than an abbreviation or code. This forces us to be consistent in the capitalisation we use for identifiers and prevent us from using characters that are illegal in URIs (or that change the meaning, such as “/”, “\” or “:”).
Having created and tested a transformation process we are now ready to begin converting the full WDI dataset and we hope to have this completed within the next week. The Mimas Linked Data working group is currently looking at implementing a triple store and we hope the dataset will eventually reside there. However, for the short term we are using simple flat files on a web server.
To show how the WDI RDF can be linked out we will be spending the final few weeks of the project building a demonstrator product. This will likely take the form of some kind visualization.