This blog post explains in detail the technical process we used to convert SDMX-ML data to RDF for exposure to the web of Linked Data. For the background information about the project and datasets we are using please see this blog post.
Often the most difficult aspect of exposing datasets via RDF is modeling the data. The Publishing Statistical Data working group (of which Mimas is part) has created a vocabulary mapping SDMX objects to the corresponding entities in RDF.
This framework defines the objects corresponding to the SDMX file structure. To convert our data into RDF we had two main tasks:
- Creation of unique URIs to identify our SDMX objects including Code Lists, Coded Values, Datasets and Observations. No URIs existed for the data so we defined these ourselves. The URIs are of two types. Firstly, we need URIs for each dataset. For this dataset we have chosen to have individual RDF files for each WDI series. The URI for a series is the file itself. The second type is a URI that references a part of a document using the hash (#) notation. Our observations will be identified as parts of the series file. In a similar way, the coded values will be identified as part of our code list file.The URIs are referenced by other parts of the RDF output. Therefore the URI for our code lists will be used in our dataset RDF. It is necessary therefore to determine a schema for URIs that is used consistently. The following are examples of the URI structure we have used.
There are four codelist files and one file for each WDI series.
- Outputting RDF data.
Outputting the data is done using XSLT to convert several SDMX-ML files into the target RDF format. Since the Data Cube vocabulary has already defined the SDMX features, the XSLT required to make SDMX into RDF is fairly straightforward. RDF can be read in a number of formats such as XML, Turtle, N3 and JSON. On online converter can be used to convert our RDF/XML to any of the supported formats. We used any23.org to convert examples in Turtle to RDF/XML during development.
The DSD file is used to generate the RDF objects to represent code lists and coded values for the data cube, entities such as countries, observation frequency and units.
Since the source SDMX data and DSD file are both XML formats and we have used the XML flavour of RDF as our output, using XSLT (Extensible Stylesheet Language Transformations) to convert the data.
Downloading the Source Data
We downloaded the source World Development Indicators data from the World Bank’s Databank service. There are two parts to an SDMX-ML dataset:
- A single data structure definition (DSD) file, which denotes the structure of the dataset
- Numerous data files in SDMX-ML Compact format.
The observations in the data file are associated with coded values from the DSD. For example, a series of yearly observations would be coded as FREQ=”A” SERIES=”NY_ADJ_NNTY_KD_ZG” REF_AREA=”AFG”. Abbreviating these values makes the SDMX-ML more compact and efficient. The descriptive text for all the abbreviations is held in the DSD file.
Preparation of the SDMX-ML
The size of the original dataset causes some issues; the SDMX-ML file is loaded and processed as an XML tree of nodes for a dataset this size this takes up large amounts of memory. To mitigate this the dataset was broken up into smaller files for processing.
A second issue is the size and redundancy of the SDMX-ML format. For example, our target files will only contain data in English, although we also have structural information in French. For every coded value in the structure the language attribute (xml:lang=”en”) is unnecessary. The SDMX Data contains prefixed elements that specify what part of the SDMX standard the file is using. We are only converting the current format, so this level of detail is not needed for the RDF and complicates the XSLT transformations. To address this we first simplified it by converting it to an intermediate XML format. Then a further transformation was done to incorporate the required RDF namespaces and URIs for the supporting RDF framework files (SDMX-RDF).
Preparation of DSD
The DSD file is transformed into two intermediate files: code lists and codes.
The code lists file is used later as the principal instruction for the XSLT process. This file contains only the code list IDs and English names. There are four code lists defined in the SDMX:
- Frequency code list (observation frequency)
- Reference area code list (countries / areas)
- Units multipliers code list
- Series code lists for World Development Indicators (all field names / dimensions)
The codes file contains three values for each coded value in simple XML which is used subsequently as a lookup:
- Code list name (e.g. Frequency code list)
- Code (e.g. A)
- Description (e.g. Annual)
Check URI validity
A principle of RDF is that entities must be unique (and so it follows that URIs should be unique). According to best practice, URIs should be constructed with human readable names, so we checked that no two descriptions were duplicated.
Creation of code lists and coded values in RDF.
The RDF for each code list contains a series of definitions for the code list and then individual definitions for each coded value within the list. The purpose of these structures is to identify the SDMX code list with the Code list and Concept objects provided in the Data Cube framework. Two URIs exist for each code list. The first relates to the SDMX concept and is capitalised, the other has an initial lowercase letter. There is one RDF file for each code list. It is important to ensure that the code list and coded value names do not have characters that are incompatible with URIs and file paths, so checks were done to ensure the file names were compatible.
Validation of SDMX Files.
The files were checked for completeness and that they were well-formed.
Conversion of Dataset Files to RDF.
To allow RDF to be downloaded for individual series an RDF file was created for each series. Firstly, the XSLT matches on the root and puts together the dataset information. This includes namespaces, dimensions, measures and the DSD. This allows each observation to reference the SDMX-RDF objects and the required code list, code, measure and unit objects that we have already built.
For efficiency and memory conservation, one lookup was performed per series rather than per observation.
The code for the download can be downloaded here.