Final Product Post (ESDS) – DOAP

DOAP Header VALUE
Unique Project Tag mimasld
Full name of project Mimas Linked Data (ESDS)
Short project description Mimas aims to increase expertise gained in dealing with Linked Data, fully engage with other initiatives in this area and to make data from production services available.
Long Project Description. The publishing statistical data working group was establishing following a collaborative workshop hosted by the ONS in February 2010.  It has drafted a model that allows SDMX, the established standard for aggregate statistical data and metadata, to be mapped to RDF.  We propose applying this model to the World Bank World Development Indicators (WDI), a popular dataset which has recently been made openly available, enabling it to be exposed as Linked Data.Applying the SDMX-RDF Model in Practice

The WDI are an authoritative and comprehensive time series dataset, which, whilst relatively compact in size, is very popular with ESDS users.  It is comprised of over 700 development indicators, including social, economic, financial, natural resources, and environmental indicators. Data run from 1960 onwards, and cover most countries in the world.  Having recently been made openly available, published in SDMX, and simple in structure, the WDI is an ideal case for us to test the SDMX-RDF model in practice.

Contributing to the Linked Open Data Cloud

Whilst some aggregate statistical resources are already available as Linked Data (e.g. Eurostat), there is a gap in the area of international development statistics, the World Bank is the foremost authority on this type of data, exposing the WDI as Linked Data will fill this gap.

Demonstrating the Value of Open Data

By demonstrating what can be done with open statistical data resources the project could provide momentum to the open data movement, in particular to the calls for other IGOs, such as the IMF and OECD, to open access to their highly valuable data stores.  This has particular benefits to researchers outside the UK (including those in the developing world) who don’t have access to data though services such as ESDS.

Please list the primary products that will be delivered from this project that other Higher Education Institutions will want to reuse? Linked Data version of the World Bank World Development Indicators Dataset
Secondary Tangible Product
JISC Website Keywords Open Technology, Standards
Name of lead institution? The University of Manchester
Department where project is primarily located Mimas
Postcode where the project team is primarily based? M13 9PL
Name of person(s) responsible for JISC project documentation and reporting? Ross MacIntyre
Email of person responsible for project documentation and reporting? ross.macintyre@manchester.ac.uk
Phone / Skype for Person responsible for project documentation +441612757181 / skype: ross.macintyre
Names and roles of all people working on the project team? Mimasld Principal Co-ordinator: Ross MacIntye
ESDS Co-ordination: Paul Murphy and Jackie Carter
Names and roles of any and all project partners (commercial, consultants or other HEIs) who will be doing *paid* work for the project. None
Emails of all the team members, consultants, partners and any other person who will be working on or with the project regardless of costed participatory status (please include all emails of the people listed above). Ross.MacIntyre@manchester.ac.uk, Jackie.Carter@manchester.ac.uk, Paul.Murphy-2@manchester.ac.uk
Number of “named” end users who will be testing or using your software outputs? Not identified at this stage
URL link to an image of all project team members. http://mimasld.wordpress.com/about/
Project gmail account for access to this form ross.macintyre@manchester.ac.uk, paul.murphy-2@manchester.ac.uk
Project blog URI? http://mimasld.wordpress.com/
RSS2 or ATOM feed for project blog? http://mimasld.wordpress.com/feed/
URL of the code repository for versioned source code produced by project, e.g. GoogleCode, GitHub, Sourceforge, etc. http://esdsw2.mc.manchester.ac.uk/WDI/code/code.zip
URL for where your step-by-step instructional documentation will be drafted. http://www.esds.ac.uk/international/access/LDabout.asp
OSS license you will you be using for the code generated from project?  GPL
What is the phone number, skype handle, twitter and/or a picture on the Web of your JISC programme manager? +44 (0) 7891 50 1194, Skype – david.flanders, TwitterImage
Have you installed an Analytics Engine (Google Analytics or Piwik) on your project blog, code repository and any other project web presence?  Google Analytics
Please provide the initials of the person who filled out this form along with your thoughts about how this form could have been better? RM, PM
Number of “named” end users whom you have already contacted and gotten their agreement to participate in testing the outputs of the project? 0
Creative Commons Licence used for project presentations and documentation? All written or audio-visual material made during Mimas’ Linked Data project will be made available under a Creative Commons Attribution 3.0 licence.
Creative Commons Licence used for project content? World Bank terms and conditions have to be used for the WDI content.  CC used for all other content.
Project Start Date 1-Apr-2011
Project End Date 31-Jul-11
What is the total amount of money awarded to the project in your Grant Letter? £44146
Name of Institutional Budget Manager Neil Chetham E: neil.chetham@manchester.ac.uk | T: 0161 275 0171
Link to ‘Final Product / Prototype’ Post http://www.esds.ac.uk/international/access/LDaccess.asp
PIMS URL for Project https://pims.jisc.ac.uk/projects/view/2061
Link to Final Approved Published Budget
Completion of the Final Sign-off Survey & Completion Form
Programme Manager Notebook Page on the Project
Posted in Uncategorized | Tagged , , , , , , , , , , , , | 1 Comment

Final Product Post (ESDS) – Table of Contents

The following are the blog posts relating to this ESDS work package of the project.

About us and our aims

About the Data Cube (SDMX to RDF) model

About SDMX-RDF in other domains

About the World Development Indicators

Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

Final Product Post (ESDS) – The World Development Indicators as Linked Data

The World Bank World Development Indicators are widely recognised as the most authoritative and comprehensive source of data on international development.  The data cover all countries for the period 1960-present.

ESDS International is making the WDI dataset available as Linked Data via the ESDS International website.

The screen shot below shows how series and code list RDF can be selected and downloaded.

Downloading WDI as Linked Data

The data for each series can also be downloaded via URI.  The URI for each series can be determined by selecting the series, as above, which will download the series RDF file, the URI for the series is referenced inside this file.

This work forms part of the MimasLD project, which is funded under the jiscEXPO program and aims to expose Mimas hosted content as Linked Data.

Detailed information about how we converted the dataset to Linked Data is in this blog post.

License

The World Development Indicators data is made available under the World Bank Terms of Use

Code published on this blog is licensed under Creative Commons license – Attribution-ShareAlike 2.0 Generic

Posted in Uncategorized | Tagged , , , , , , , , , , , | 1 Comment

Final Product Post (ESDS) – Reusable Components

In this final blog post for the ESDS work package of the MimasLD project we have detailed the most reusable parts of the project.

World Development Indicators as Linked Data

Our first output, as detailed in this blog post, is the World Bank World Development Indicators in RDF format.

We know the World Development Indicators to be the most important and authoritative source of data on international development, it is especially valued by economists, social scientists, charities, and business.  We hope that by exposing the dataset as Linked Data it will become even more useful to these groups.  As the dataset only became openly available last year we hope this demonstrates to other data producers the value that can be added to datasets by the wider data community once they are made open.

We are in the process of creating a tutorial on linking out data from the WDI dataset and this will be available here in the next few weeks.

The SDMX to RDF Process

We have produced an in-depth blog post detailing exactly how we achieved the SDMX to RDF conversion, as part of this post we have shared the XSLT we have created to do our transformations.  The purpose of this is to help others who are considering exposing SDMX as Linked Data.  We anticipate those who are likely to find this most useful are the intergovernmental organisations that produce aggregate statistical datasets.

Much of this process will also be used in the dissemination of census aggregate statistics.  Work on this area is continuing as part of the CAIRD project and updates are being posted to the Census Dissemination Unit blog.

Posted in Uncategorized | Tagged , , , , , , , , , , , | 1 Comment

Final Product Post (ESDS) – The SDMX to RDF Process

This blog post explains in detail the technical process we used to convert SDMX-ML data to RDF for exposure to the web of Linked Data.  For the background information about the project and datasets we are using please see this blog post.

Approach

Often the most difficult aspect of exposing datasets via RDF is modeling the data.  The Publishing Statistical Data working group (of which Mimas is part) has created a vocabulary mapping SDMX objects to the corresponding entities in RDF.
This framework defines the objects corresponding to the SDMX file structure.  To convert our data into RDF we had two main tasks:

  1. Creation of unique URIs to identify our SDMX objects including Code Lists, Coded Values, Datasets and Observations. No URIs existed for the data so we defined these ourselves. The URIs are of two types. Firstly, we need URIs for each dataset. For this dataset we have chosen to have individual RDF files for each WDI series.  The URI for a series is the file itself.  The second type is a URI that references a part of a document using the hash (#) notation. Our observations will be identified as parts of the series file. In a similar way, the coded values will be identified as part of our code list file.The URIs are referenced by other parts of the RDF output. Therefore the URI for our code lists will be used in our dataset RDF. It is necessary therefore to determine a schema for URIs that is used consistently. The following are examples of the URI structure we have used.
    Codelists: http://esdsw2.mc.manchester.ac.uk/wdi/code/
    Datasets: http://esdsw2.mc.manchester.ac.uk/wdi/data/
    There are four codelist files and one file for each WDI series.
  2. Outputting RDF data.
    Outputting the data is done using XSLT to convert several SDMX-ML files into the target RDF format.  Since the Data Cube vocabulary has already defined the SDMX features, the XSLT required to make SDMX into RDF is fairly straightforward. RDF can be read in a number of formats such as XML, Turtle, N3 and JSON. On online converter can be used to convert our RDF/XML to any of the supported formats. We used any23.org to convert examples in Turtle to RDF/XML during development.

The DSD file is used to generate the RDF objects to represent code lists and coded values for the data cube, entities such as countries, observation frequency and units.

Since the source SDMX data and DSD file are both XML formats and we have used the XML flavour of RDF as our output, using XSLT (Extensible Stylesheet Language Transformations) to convert the data.

Downloading the Source Data

We downloaded the source World Development Indicators data from the World Bank’s Databank service.  There are two parts to an SDMX-ML dataset:

  • A single data structure definition (DSD) file, which denotes the structure of the dataset
  • Numerous data files in SDMX-ML Compact format.

The observations in the data file are associated with coded values from the DSD.  For example, a series of yearly observations would be coded as FREQ=”A” SERIES=”NY_ADJ_NNTY_KD_ZG” REF_AREA=”AFG”. Abbreviating these values makes the SDMX-ML more compact and efficient.  The descriptive text for all the abbreviations is held in the DSD file.

Preparation of the SDMX-ML

The size of the original dataset causes some issues; the SDMX-ML file is loaded and processed as an XML tree of nodes for a dataset this size this takes up large amounts of memory.  To mitigate this the dataset was broken up into smaller files for processing.

A second issue is the size and redundancy of the SDMX-ML format. For example, our target files will only contain data in English, although we also have structural information in French.  For every coded value in the structure the language attribute (xml:lang=”en”) is unnecessary.  The SDMX Data contains prefixed elements that specify what part of the SDMX standard the file is using.  We are only converting the current format, so this level of detail is not needed for the RDF and complicates the XSLT transformations.  To address this we first simplified it by converting it to an intermediate XML format.  Then a further transformation was done to incorporate the required RDF namespaces and URIs for the supporting RDF framework files (SDMX-RDF).

Preparation of DSD

The DSD file is transformed into two intermediate files: code lists and codes.

The code lists file is used later as the principal instruction for the XSLT process.  This file contains only the code list IDs and English names. There are four code lists defined in the SDMX:

  • Frequency code list (observation frequency)
  • Reference area code list (countries / areas)
  • Units multipliers code list
  • Series code lists for World Development Indicators (all field names / dimensions)

The codes file contains three values for each coded value in simple XML which is used subsequently as a lookup:

  • Code list name (e.g. Frequency code list)
  • Code (e.g. A)
  • Description (e.g. Annual)

Check URI validity

A principle of RDF is that entities must be unique (and so it follows that URIs should be unique).  According to best practice, URIs should be constructed with human readable names, so we checked that no two descriptions were duplicated.

Creation of code lists and coded values in RDF.

The RDF for each code list contains a series of definitions for the code list and then individual definitions for each coded value within the list. The purpose of these structures is to identify the SDMX code list with the Code list and Concept objects provided in the Data Cube framework.  Two URIs exist for each code list.  The first relates to the SDMX concept and is capitalised, the other has an initial lowercase letter.  There is one RDF file for each code list.  It is important to ensure that the code list and coded value names do not have characters that are incompatible with URIs and file paths, so checks were done to ensure the file names were compatible.

Validation of SDMX Files.

The files were checked for completeness and that they were well-formed.

Conversion of Dataset Files to RDF.

To allow RDF to be downloaded for individual series an RDF file was created for each series.  Firstly, the XSLT matches on the root and puts together the dataset information.  This includes namespaces, dimensions, measures and the DSD.  This allows each observation to reference the SDMX-RDF objects and the required code list, code, measure and unit objects that we have already built.

For efficiency and memory conservation, one lookup was performed per series rather than per observation.

The code for the download can be downloaded here.

Posted in Uncategorized | Tagged , , , , , , , , , , , | 2 Comments