Dec 18 2012

Trialling DataCite for chemistry lab notebooks and repository data services

Steve Hitchcock

To use research data we need to be able locate and cite it. DataCite is a service for identifying and citing data. The British Library’s DataCite service is being trialled through DataPool at the University of Southampton with a view to making an institutional agreement for the service. First to try the service here are Philip Adler and Simon Coles, who report on how well metadata describing entries in chemistry lab notebooks and repositories maps to the DataCite schema.

Recently, we have undertaken trials of the DataCite service operated in the UK by the British Library for minting DOIs (digital object identifiers). These were based on use cases in chemistry concerning the use of an Electronic Lab Notebook (ELN), LabTrove, portions of which can and should be referenced using a DOI when being referred to, particularly from journal articles. Additionally we checked the suitability of minting DataCite DOIs for records in eCrystals – an (institutional) data repository based on the EPrints system.

Four use cases have been identified and tested, referencing:

  1. an entire Lab Notebook
  2. a subset of the entries in a Lab Notebook
  3. a single entry in a Lab Notebook
  4. an entry in the e-Crystals system

The key part of the work is identifying whether or not suitable metadata can be located, so that it can be placed in an XML framework conformant with the DataCite XML schema. Currently the only mechanism for generating the XML is, somewhat laboriously, by hand but, given the successful outcome of our trials, we will automate this process within each system at a later time.

Case 1: Referencing an entire Lab Notebook

The key metadata accompanies each post, so this is for the most part a mapping exercise between the two kinds of metadata. However, the trickiest of these is the publication date. In traditional publication circles, this date is definite – parts of a single journal issue, for instance, would not all be published at different dates. However, in the case of LabTrove the individual entries that make up a complete record (or indeed, a category, as in case 2) can have a range of dates. In guidelines DataCite asks for the most appropriate date based on a citation perspective. This does not necessarily clarify things in this case. For the purposes of experiment, however, I have used the date of the most recent entry in the record being referenced. There is precedent for this in the RSS protocol used as a publishing XML schema elsewhere.

A good feature of the schema is that the <dates> optional field allows dates to be entered each time an entry collection is ‘updated’, i.e. each time a new entry is posted. Another neat aspect is the <relatedIdentifiers> option, which allows each of the records that make up the collection to be linked to the collection itself. The relationship between different resources can be described semantically using the attribute relationType.

Use Case 2: Referencing an arbitrary collection of records

The only adjustment required for this use case is that the items being referenced have a means of being grouped. Happily, LabTrove comes with the ability to tag things within categories, within date, etc. Other than this, there is no difference in procedure between this and the method for use case 1. Additionally, there is a semantic facility in the XML schema which allows ‘related identifiers’- permitting the inclusion of the URL of each record in the XML metadata.

Use Case 3: Referencing a specific record in a Lab Notebook

LabTrove-based blog example: Pictet-Spengler route to Praziquantel Synthesis of intermediates and derivatives of PZQ

Once again, this is a simple mapping exercise, made simpler than the previous two examples by the fact there is no ambiguity about the date information associated with the record.

Use Case 4: Referencing an eCrystals record

Once again, this is a simple mapping exercise, since the data are all presented in the eCrystals record, and there is no ambiguity about any of the data. Some of the information in the XML schema is open to field-dependent interpretation, however (in particular, the ‘roles’ section in the schema), and this could use some clarification within the accompanying documentation.