Dec 20 2012

Connecting research data roadmaps and business cases: the IDMB example for the University of Southampton

Steve Hitchcock

The sausage in the roll or the wafer-thin ham in the sandwich, as promised in the last post this is the alternative to the ubiquitous benefits-evidence slides presented by each project represented at the JISC MRD workshop in Bristol. This presentation connects the development of roadmaps with the business case and policy for making progress with research data management (RDM) at an institutional level.

This was presented by Steve Hitchcock, but draws heavily on a report from the Institutional Data Management Blueprint (IDMB) Project, which began the work on research data management (RDM) at the University of Southampton now being taken on by DataPool. Mark Brown, Oz Parchment and Wendy White, co-authors of that report, are therefore the true authors of this presentation. Comment and interpretation are mine.

This version provides the notes for each slide used to inform the commentary for the presentation. It might be worth opening the Slideshare site (adverts notwithstanding) to switch between the slide notes below and the graphic slides – clicking on View on Slideshare in the embedded view will open these in a separate browser window.

Slide 2 Taking the IDMB example with others, connecting roadmaps with the business case and policy seems like a logical sequence, but in practice this is not always the case. At Southampton we have a roadmap and an official institutional research data policy, but the business case is still to be approved. Other institutions appear to have begun with a policy. Here we will focus on the roadmap and business case rather than policy.

Slide 3 If the IDMB project elaborated the roadmap, DataPool represents progress along the first part (18 months) of the first phase (3 years) of the plan, and is beginning to fill in components of the map, as can be seen by the links in this slide.

Slide 4 For reference, this is a recent poster designed to show graphically the full scope of the DataPool Project. It shows the characteristic tripartite approach of this and comparable JISC institutional RDM projects: policy, training, and technical infrastructure (data repository and storage services).

Slide 5 This middle phase of the Southampton RDM roadmap looks like it may have been the trickiest part of the map to elaborate. It’s not imminent and depends on outcomes from the first stage; on the other hand, it’s not that far away that we don’t need to be aware and making plans for it. As seen in this extract, it is essentially describing refinements of many of the expected developments from stage 1.

Slide 6 If looking ahead is trickier than framing immediate work, this final phase looking up to 10 years ahead might have been hardest to describe. It is, however, more aspirational in tone and less inclined to deal with specifics, and seems more appropriate for adopting that approach.

Slide 7 A recent and interesting comparison with the Southampton RDM roadmap is that from Edinburgh University. Edinburgh has a target completion date of early 2014, a startlingly short roadmap compared with a 10Y example. The two are not directly comparable, of course. The Edinburgh case looks to be a well specified, well structured and comprehensive first phase and can be commended for that. Whether it is achievable within the time and resources specified we cannot judge yet. The illustration reproduced here is a helpful representation of the plan – at least, it is once you’ve read the plan.

Slide 8 This extract connects the first progress report of the DataPool Project, by then-PI Mark Brown, with the roadmap and policy. It makes the clear point that research funder requirements (EPSRC, RCUK) had an important influence on adoption of the policy at an executive level, even if some discussion at this JISC MRD Benefits Meeting was around whether supporting compliance with such requirements can usefully be presented to researchers as a ‘benefit’.

Slide 9 Other JISC MRD projects that have roadmaps have similarly emphasised the importance of EPSRC requirements on the production of the roadmap.

Slide 10 Now we move on to the second part of the talk, the business case. The data.bris project from Bristol University was presenting in the same session at this event, so we will spare the detail here, but this extract from a recent blog post by the project illustrates some of the imponderables, Donald Rumsfeld-style, of forming a business case for RDM.

Slide 11 We are heading towards the critical part of this presentation, the financial numbers. First some context. This case covers just the technical infrastructure – IT services – not the wider factors outlined by data.bris. This business model has been updated and presented at the University of Southampton and, as we have already indicated is currently undergoing further revision with a view to official acceptance. The assumption stated here is not based on the university’s current research data policy, which requires a record of all data produced in the course of research at the institution rather than full data deposit. The university can’t be said, therefore, to have stopped short, so far, of accepting the business case for supporting the costs of the policy. The data on usage of storage services and projected usage are the basis for the financials that follow.

Slide 12 In the style of the financial services industry, given there are a number of uncertain factors to accommodate in projections of the growth of storage requirements, this chart attempts to draw upper and lower bounded curves to underpin the calculations.

Slide 13 This illustration also comes directly from the IDMB report. Allowing that the metadata should ideally attach to both active and archive layers, the cost factors introduced here are access bandwidth latency and storage technology. The basic choices considered are between more expensive and faster access disk storage, and slower tape stores.

Slide 14 Now we get to the actual financial numbers resulting from this analysis. The number that stands out is Y3 in the disk-based scenario, which not only rises above £1M for the first time but gets closer to £2.4M. Subsequent annual costs shown here remain above £1M for this scenario. The slower tape-based costs are always lower.

Slide 15 Having identified the numbers, the critical decision is how to pay for it. This was an important issue for the second DataPool Steering Group meeting recently. A full free-at-point-of-use service may be the simplest if most expensive option for the institution, but it has been strongly argued that RDM must be viewed as a direct cost of research, and funded accordingly. The dilemma for institutions is how much to invest in infrastructure directly, compared with leaving projects to raise additional costs for data management and risking research bids becoming less competitive than those from institutions with more generous direct support.

Slide 16 In summary, roadmaps are useful for focussing discussion on research data management at an institutional level, and for engaging other stakeholders across all disciplines. Given that a roadmap should be based on prior consultations with those stakeholders, it follows that subsequent interaction with the roadmap should lead to further consultation. The roadmap must therefore be used as a living document. Southampton has not yet finalised its business case for supporting RDM, but it has established a process through engaging with the roadmap in the first instance.

Dec 20 2012

DataPool benefits-evidence table

Steve Hitchcock

JISC, funder of DataPool, of other projects in research data management, and many more projects on widening use of digital technology in education, tends to focus on areas close to practical exploitation. On the R&D spectrum, it is typically towards the development end. For project managers, therefore, there is an emphasis on procedures and tools to increase the impact of practical outcomes – evaluation, sustainability, exit strategies, technology transfer, etc.

Another planning tool being adopted in the Managing Research Data Programme (MRD) 2011-13, of which DataPool is a part, is benefits-evidence analysis. As this description suggests, the idea is to elaborate prospective benefits of a project, and then identify the evidence that will demonstrate whether or not the benefit has been realised. It is as much about informing the process of getting to the results, and identifying which results are important and achievable, as the results themselves.

Hence, JISC MRD projects were invited to Bristol for a 2-day programme workshop at the end of November to present their benefits-evidence slides. If this sounds a little repetitive, it is but not uninteresting, especially as in preparing for the workshop all projects had essentially to engage in the same analysis, and were therefore armed not just with their own slide but ready to comment on others.

For project managers used to working towards outputs (products or services arising from the project) and outcomes (effects of the outputs on users in the target community), benefits are another factor. Hence, the JISC MRD programme has recruited a team of evidence gatherers, to work with and assist projects to hone and refine the benefits they are working towards and the consequent evidence measures. “Those are more outputs than benefits” I was advised, fairly, during open discussion on some ‘benefits’ in my slide. But then I had seeded the slide with points to discuss rather than a definitive list, and unwittingly extended the project’s previously discussed benefits.

So after the workshop I was grateful for the advice of Laura Molloy, evidence gatherer for DataPool, on aligning our pre- and post-workshop benefits lists.

After all that effort it would be a remiss not to reveal our benefits-evidence table that emerged from the process. For the record, here are the benefits DataPool will seek to demonstrate in its final months into early 2013.

DataPool: Benefits-Evidence

1 Improved RDM skills across the target community, including researchers and professional support staff Qual reporting on effectiveness of training events.
Feedback from training courses and deskside consultations, DMP and email help services.
More staff running RDM support services, increased service offer.
2 Greater visibility and use of institution’s research data / research outputs through sharing, collaboration, reuse Qual case study describing improved dataset exposure.
Qual evidence of DMP engagement, including early indications of access routes.
* Quant indication of increase in dataset downloads.
No. of datasets stored in data repository.
Accesses of open datasets vs closed datasets vs shared datasets.
3 Sustained institutional support for RDM / sustainability for RDM infrastructure at institution No. of training opportunities introduced.
Scope of: deskside consultations, DMP support service.
Results from case studies – engagement with existing data facilities.
Assessment of added value for institution of using institutional storage over other options – report.
4 Improved use/uptake of RDM infrastructure Quant account of ‘bid preparation consultations’, inc. qual narrative of referrals to data policy and DMP help.
Case study on working with data policy – feedback on uptake of policy.
Quant tracking of higher attendance at training.
Accesses to RDM guidance documents.
No. of deskside consultations.
* Quant indication of improved uptake of institutional storage and deposit options.
No. of large data projects switching to institutional data service.
5 Time / costs saved by improved RDM infrastructure Identifying early cost-benefits – combined case studies report, inc large data projects, open data, imaging, disciplinary efficiencies.
Assessment of added value for institution of using institutional storage over other options – report (see 3).

* This evidence not expected to be available during DataPool Project, following launch of RDM repository service by project end, but will be collected in ongoing work at Southampton University on institutional RDM. Table by Steve Hitchcock for DataPool, in collaboration with Wendy White, Dorothy Byatt. We gratefully acknowledge the feedback and suggestions from Laura Molloy, JISC evidence gatherer.

The University of Southampton has a 10 year roadmap for research data, of which DataPool represents the first stretch of road, so there is a commitment to go further, but the clearer the steer from DataPool the faster the progress afterwards.

As a little light relief from projects’ benefits-evidence slides, a presentation on the Southampton roadmap and business plan was given at the Bristol workshop. That will be covered in a separate post.

How will you know which benefits have been achieved as the project moves forward? This post is tagged with the label ‘benefits’. All updates reporting evidence from the table above will use this tag. Tags can be found in the column immediately to the right of this one, and up, from this point in the post.

This is how other JISC MRD projects are tackling these challenges and what benefits-evidence are being targetting:

Dec 18 2012

Trialling DataCite for chemistry lab notebooks and repository data services

Steve Hitchcock

To use research data we need to be able locate and cite it. DataCite is a service for identifying and citing data. The British Library’s DataCite service is being trialled through DataPool at the University of Southampton with a view to making an institutional agreement for the service. First to try the service here are Philip Adler and Simon Coles, who report on how well metadata describing entries in chemistry lab notebooks and repositories maps to the DataCite schema.

Recently, we have undertaken trials of the DataCite service operated in the UK by the British Library for minting DOIs (digital object identifiers). These were based on use cases in chemistry concerning the use of an Electronic Lab Notebook (ELN), LabTrove, portions of which can and should be referenced using a DOI when being referred to, particularly from journal articles. Additionally we checked the suitability of minting DataCite DOIs for records in eCrystals – an (institutional) data repository based on the EPrints system.

Four use cases have been identified and tested, referencing:

  1. an entire Lab Notebook
  2. a subset of the entries in a Lab Notebook
  3. a single entry in a Lab Notebook
  4. an entry in the e-Crystals system

The key part of the work is identifying whether or not suitable metadata can be located, so that it can be placed in an XML framework conformant with the DataCite XML schema. Currently the only mechanism for generating the XML is, somewhat laboriously, by hand but, given the successful outcome of our trials, we will automate this process within each system at a later time.

Case 1: Referencing an entire Lab Notebook

The key metadata accompanies each post, so this is for the most part a mapping exercise between the two kinds of metadata. However, the trickiest of these is the publication date. In traditional publication circles, this date is definite – parts of a single journal issue, for instance, would not all be published at different dates. However, in the case of LabTrove the individual entries that make up a complete record (or indeed, a category, as in case 2) can have a range of dates. In guidelines DataCite asks for the most appropriate date based on a citation perspective. This does not necessarily clarify things in this case. For the purposes of experiment, however, I have used the date of the most recent entry in the record being referenced. There is precedent for this in the RSS protocol used as a publishing XML schema elsewhere.

A good feature of the schema is that the <dates> optional field allows dates to be entered each time an entry collection is ‘updated’, i.e. each time a new entry is posted. Another neat aspect is the <relatedIdentifiers> option, which allows each of the records that make up the collection to be linked to the collection itself. The relationship between different resources can be described semantically using the attribute relationType.

Use Case 2: Referencing an arbitrary collection of records

The only adjustment required for this use case is that the items being referenced have a means of being grouped. Happily, LabTrove comes with the ability to tag things within categories, within date, etc. Other than this, there is no difference in procedure between this and the method for use case 1. Additionally, there is a semantic facility in the XML schema which allows ‘related identifiers’- permitting the inclusion of the URL of each record in the XML metadata.

Use Case 3: Referencing a specific record in a Lab Notebook

LabTrove-based blog example: Pictet-Spengler route to Praziquantel Synthesis of intermediates and derivatives of PZQ

Once again, this is a simple mapping exercise, made simpler than the previous two examples by the fact there is no ambiguity about the date information associated with the record.

Use Case 4: Referencing an eCrystals record

Once again, this is a simple mapping exercise, since the data are all presented in the eCrystals record, and there is no ambiguity about any of the data. Some of the information in the XML schema is open to field-dependent interpretation, however (in particular, the ‘roles’ section in the schema), and this could use some clarification within the accompanying documentation.

Dec 17 2012

To architect or engineer research data repositories

Steve Hitchcock

There cannot be many mature products where development meetings have not been interrupted with a rueful declaration that to make further progress “you wouldn’t start from here”. This encapsulates one key difference between the architect and engineer, the latter prepared to work with the set of tools provided, the other preferring to start with a blank sheet of paper or an open space.

In building research data repositories using two different softwares, Microsoft Sharepoint and EPrints, the DataPool Project is working somewhere between these extremes. Which approach will prove to be the more resilient for research data management (RDM)? In this invited talk for RDMF 9, the ninth in the DCC series of Research Data Management Forums, held in Cambridge on 14-15 November 2012, we will look at the relevant factors. As a project we are agnostic to repository platforms, and as an institutional-scale project we have to work with who will support the chosen platform.

The original Powerpoint slides are available from the RDMF9 site. This version additionally reproduces the notes for each slide used to inform the commentary from the presentation. It might be worth opening the Slideshare site (adverts notwithstanding) to switch between the slide notes below and the graphic slides – clicking on View on Slideshare in the embedded view will open these in a separate browser window

I thank Graham Pryor of DCC, organiser of RDMF9, for inviting this talk, and for suggesting this topic based, presumably, on the project blog post shown in slide 2. This post sets out some of the higher-level issues while avoiding the trap of setting up a straw man pitting Sharepoint versus EPrints.

Before we get into the detailed notes, here is the live Twitter stream for the DataPool presentation (retrieved from #rdmf9 hashtag on 15 Nov.).

@jiscdatapool Preparing to talk at #rdmf9. Have the 9 am slot
@MeikPoschen #rdmf9 2nd day: To architect or engineer? Lessons from DataPool on building RDM repositories, first talk by Steve Hitchcock #jiscmrd
@MeikPoschen JISC DataPool Project at Southampton, see #jiscmrd #rdmf9
@simonhodson99 Down to work at #rdmf9 at Madingley Hall – outside it’s misty, autumnal – inside it’s Steve Hitchcock, DataPool: to architect or engineer?
@simonhodson99 Steve Hitchcock argues that the DataFlow solution is one of the most innovative things to come through #jiscmrd #rdmf9
@simonhodson99 ePrints data apps available from ePrints Bazaar: #jiscmrd #rdmf9
@jtedds Hitchcock (Southampton) describes institutional drive to implement SharePoint type solution but can it compete with DropBox? #jiscmrd #rdmf9
@jtedds Trial integrations with DataFlow MT @simonhodson99 ePrints data apps available from ePrints Bazaar #jiscmrd #rdmf9
@John_Milner Hitchcock highlights the challenge of getting quality RDM while keeping deposit simple for researchers, not easy #RDMF9
@simonhodson99 Perennial question of the level of detail required in metadata: with minimal metadata will the data be discoverable or reusable? #rdmf9
@simonhodson99 Is SharePoint a sufficient and appropriate platform for active data management? Sustainable? One size fits all? #rdmf9

Are the Twitter contributions a fair summary? We return to the slide commentary to find out.

Slide 3 The blog post highlighted in slide 2 included this architectural diagram, produced by Peter Hancock, director of the iSolutions IT services provider at the University of Southampton. Although it leans heavily towards referencing Sharepoint, it can be viewed as a high-level reference model, analogous to the OAIS in digital preservation, and therefore as a model that can embrace other repository types.

Slide 4 Before we get into the detail of the presentation, here is a poster-based summary of the DataPool Project. It has a tripartite approach characteristic of similar institutional projects in the JISC MRD programme, covering data policy, training and, the area of interest here, building a data repository. It is worth noting as well, in this context, that the development partners shown in the row beneath the tripartite elements effectively represent ways of getting data in and out of the RDM service adopted, and are relevant factors in the repository design.

Slide 5 Here is how the different repository platforms might line up on a broad spectrum of Architected vs Engineered. This is a rough-and-ready approach to illustrate the basic point. Also included is DataFlow, from the University of Oxford, perhaps the most innovative repository platform to have emerged for RDM. Given its originality, it appears towards the architected end of the spectrum. We could not claim that Sharepoint is a new software platform in the same way as DataFlow, but from an RDM perspective you don’t get anything out of the box – you have to start from scratch and ‘architect’ an RDM solution. What developers can do is try and ‘engineer’ the designed RDM element with the IT services already provided in Sharepoint. EPrints first appeared in 2001 to manage research publications. It has offered a ‘dataset’ deposit type since 2007, so provides a ready-made solution for an RDM repository, and can be ‘engineered’ to enhance that solution. As the slide notes, other RDM repository platforms are available. In the following slides we will explore the features of our three highlighted RDM platforms, starting with DataFlow.

Slide 6 DataFlow is a two-stage architecture for data management: an open (Dropbox-like) space for data producers (DataStage), and a managed and curated repository (DataBank), connected by a standard content transfer protocol, SWORD. While DataBank provides a bespoke data management service for Oxford, we have recently noted experiments to connect an open source version of DataStage with EPrints- and DSpace-based curated repositories, thus providing the yearned for Dropbox functionality apparently so in demand with research data producers.

Slide 7 This is an example screenshot from the DataStage-EPrints experimental arrangement used by the JISC Kaptur project. It shows the familiar Choose File-Upload button combination familiar to e.g. WordPress blog users, for uploading data. Uploaded data is then shown in a conventional file manager list.

Slide 8 To move data from DataStage to the curated repository, again shown in the experimental Kaptur implementation, uses this surprisingly simple SWORD client interface. If this seems insufficient description for a curated item, presumably a more detailed SWORD client could be substituted.

Slide 9 One basis for building a more comprehensive description, or metadata, for research data is this 3-layer model produced by the Institutional Data Management Blueprint (IDMB) Project, the project that preceded DataPool at the University of Southampton. This is quite a general-purpose and flexible model, perhaps with more flexibility than meaning. Structurally, nevertheless, we will see that this has some relevance to repository deposit workflow design.

Slide 10 The 3-layer metadata model can be seen quite clearly in the emerging user interface for data deposit built on Sharepoint. Here we see the interface for collecting project descriptions, used once per project and then linked to each data record produced by the project.

Slide 11 In the same style, here is the Sharepoint user interface for collecting data descriptions. One of the most noticeable features within both the Project and Data forms is the small number of mandatory fields (indicated with a red asterisk), just one on each form. Mandatory fields have to be filled in for the form to submit successfully. Most people will have experienced these fields; invariably when completing a Web shopping form these will be returned with red text warning. In this case you could feasibly submit a project or data description containing only a title. Aspects such as this are shortly to be subjected to user testing and review of this implementation.

Slide 12 Sharepoint has its detractors as an IT service platform, principally bemoaning its complexity-to-functionality ratio. Prof Simon Cox from Southampton University takes the opposite view passionately. This is an extract from his intervention at a DataPool Steering Group meeting (May 2012) putting the case for Sharepoint. It is a good way of understanding the wider strengths of Sharepoint, which may not be immediately apparent to users of particular Sharepoint services. Building the range of services suggested is a difficult and long-term project.

Slide 13 EPrints supports the deposit of many item types, including datasets since 2007. When you open a new deposit process in EPrints you will first be shown this screen, where you can select an item type such as ‘dataset’.

Slide 14 Selecting ‘dataset’ will take you to this next screen, which might look something like this from ePrints Soton, the Southampton Institutional Repository. This is not quite a default screen for standard EPrints installs; the workflow and fields have been customised in some areas by a repository developer.

Slide 15 EPrints users need not be restricted to standard interfaces or interfaces customised to a repository requirement. Interfaces in EPrints can be added to or amended by simply installing an app from the app store, or EPrints Bazaar. Unlike the Apple app store, with which it might optimistically be compared, EPrints apps are not selected to be installed by users but installation is authorised by repository managers. There are already two apps for those managers to choose to suit particular RDM workflow requirements: DataShare and Data Core. More data apps are expected to follow. EPrints is thus being engineered for flexibility in RDM deposit. In the following slides we will explore these first two data apps.

Slide 16 DataShare makes some minor modifications to the default EPrints workflow for deposit of datasets, highlighted with red circles here.

Slide 17 Data Core aims to implement a minimal ‘core’ metadata for datasets. Implementing this app will overwrite the default EPrints workflow, replacing it with the minimal set, approximately half of which is shown here (the remainder in the next slide). In addition, we have a short description of the design aims for Data Core, which are unavailable for Sharepoint data deposit and the DataShare app.

Slide 18 Taking both slides showing the Data Core deposit workflow, this is comparable, in extent, with the Sharepoint ‘data’ interface shown earlier, although it has a few more mandatory fields.

Slide 19 Another example of an EPrints data deposit interface has been developed by Research Data @Essex at the University of Essex. Like Data Core, the Essex approach has explicit design objectives, based on aligning with other metadata initiatives to support multi-disciplinary data. In other words, this does not simply expand or reduce the default EPrints workflow for data deposit, but starts with a new perspective. We have been liaising with its development team to investigate the possibility of building this approach into an Essex EPrints app for other repositories to share.

Slide 20 Here is a section of the Essex workflow, highlighting one area of major difference with the default workflow. It shows fields for time- and geographic-based information.

Slide 21 We’ve looked at getting data into the repository, but not yet how it is displayed as an output, or a data record from the repository. This is one example. It is not the most revealing record, but could be expanded.

Slide 22 Essex has cited specific design criteria for its research data repository. Additionally we have observed some characteristic features, indicated here. In particular, it is a data-only repository, without provision for other data-types offered by EPrints (shown in slide 13). The indication of mandatory fields adds a further layer of insight into the implementation of the design criteria.

Slide 23 So far in this presentation we have seen different implementations of data repository deposit interfaces, including DataFlow, Sharepoint, and multiple interfaces for EPrints. Where is this heading, and what are the common themes? Since we are exploring the difference between architecting and engineering these repositories, I was interested to see this national newspaper article about a major redevelopment of an area close to central London, Nine Elms, an area that interests me as I pass through it on regular basis. Phrases that stand out refer to the relationship between the planned new high-rise buildings. What does this have to do with data repositories?

Slide 24 Interoperability is the relationship between repositories and how they interact with services, such as search, through shared metadata. If repositories have “nothing in particular to do with anything around them” or “show little interest in anything around” them, then they will not be interoperable. If repositories stand alone rather than interoperate then they become less effective at making their contents visible. Open access repositories have long recognised the importance of interoperability, being founded on the Open Archives Initiative (OAI) over a decade ago, and efforts to improve interoperability continue with current developments. Shown here are some current interoperability initiatives from one morning’s mailbox. Data repositories will be connected to this debate, but so far it has not been a priority in the examples we have considered here.

Slide 25 One of the organisations listed on the previous slide, COAR, produced a report that outlines more comprehensively the scope of current interoperability initiatives for open access. While some solutions to the capture of research data seen here have reasonably been ‘architected’, that is, starting with a blank sheet to focus on the specific design needs of data deposit, these will need to catch up quickly with interoperability requirements, including most of those listed here. Data repositories ‘engineered’ on a platform such as EPrints, originally designed for other data types, do not obviously lack the flexibility to accommodate research data, and by virtue of having contributed to repository interoperability since the original OAI, already support most of the requirements shown here.

Slide 26 As for the DataPool Project, it will continue its dual approach of developing and testing both Sharepoint and EPrints apps. As a project it does not get to choose what is ultimately adopted to run the emerging research data repository at the University of Southampton. There are repository-specific factors that will determine that; but there are other organisational factors to take into account as well. Institutions seeking to build research data repositories that are clearly focussed on this range of factors are likely to have most success in implementing a repository to attract data deposit and usage.

This post has covered just one presentation, from DataPool, at RDMF9. The following two blog reports give a wider flavour of the event, the first exploring the architectural issues raised.

Julie Allinson, Some initial thoughts about RDM infrastructure @ York: “I’ll certainly carry on working up my architecture diagram, and will be drawing on the data coming out of our RDM interviews and survey to help flesh out the scenarios we need to support. But what I feel encouraged and even a little bit excited by is the comment by Kevin Ashley at the end of the RDMF9 event: that two years ago everyone was talking about the problem, and now people are coming up with solutions.”

Carlos Silva, RDMF9: Shaping the infrastructure, 14-15 November 2012: “Overall it was a good workshop which provided different points of view but at the same time made me realise that all the institutions are facing similar issues. IT departments will need to work more closely with other departments, and in particular the Library and Research Office in order to secure funding and make sustainable decisions about software.”

Dec 7 2012

DataPool Steering Group, second meeting

Steve Hitchcock

Monday 12 November marked the start of a busy week for DataPool, being the date of the project’s second Steering Group meeting and leading towards a presentation at the 9th meeting of the DCC Research Data Management Forum. In other words, the project was to address two of its key audiences, and had to prepare appropriate documentation for the purpose. We are pleased to share the documentation, starting here with that presented to the Steering Group ahead of its meeting, complementing the record of the first Steering Group meeting.

Collected documents for 2nd Steering Group meeting

Agenda, Steering Group meeting, 12 November 2012
Minutes of previous Steering Group meeting, 31 May 2012
Progress Report by Wendy White, DataPool PI (corrected 20 November 2012)

Introduction to the Progress Report. At the last Steering Group there was a clear emphasis on the importance of supporting cultural change and identifying institutional benefits to improving research data management practice. Recent policy developments from funders have aligned parameters for the accessibility of research data to strengthening requirements for research publications.  There is a focus on benefits- led activity, working with Funders and other external bodies on developing an integrated approach to improving research data management practice. The mid-phase of the project has been informed by this context as we have made progress on the key strands of the project:

  • Developing and rolling out service and training models to work with researchers
  • Planning an evidence-based programme of support for professional services staff providing these services
  • Multidisciplinary engagements
  • Investigating requirements for data storage and archiving
  • Testing the SharePoint and ePrints data catalogue components

PGR Thesis Model: mapping support from start to award, a work-in-progress, particularly with regard to the role of data in the examiners’ process

Note, two documents provided to the Steering Group were from ongoing work and were for current information rather than this record. These were a draft training needs questionnaire aimed at research support staff, and an update report on a 3D data survey at the University of Southampton.

Among the many issues discussed at the meeting, one noteworthy topic was funding models to support a storage strategy, i.e. once the costs have been mapped, does the funding come from grant funding bid applications or from institutional support infrastructure funds? We are particularly grateful to our external (i.e. outside Southampton) steering group members for the additional perspectives they bring, in this case for the valuable insights on the storage funding issue from research councils and data archives.

Members of the steering group present at the meeting (University of Southampton unless otherwise indicated): Wendy White (Chair, DataPool PI and Head of Scholarly Communication), Philip Nelson (Pro-VC Research), Mark Brown (University Librarian), Helen Snaith (National Oceanography Centre Southampton), Mylene Ployart (Associate Director, Research and Innovation Services), Louise Corti (Associate Director, UK Data Archive), Oz Parchment (iSolutions), Les Carr (Electronics and Computer Science), Simon Cox (Engineering Sciences), Graeme Earl (Humanities), Jeremy Frey (Chemistry), Dorothy Byatt, Steve Hitchcock (DataPool Project Managers). Apologies from: Adam Wheeler (Provost and DVC), Graham Pryor (Associate Director, Digital Curation Centre), Sally Rumsey (Digital Collections Development Manager at The Bodleian Libraries, University of Oxford).

Nov 2 2012

Oh no, not another presentation!

Steve Hitchcock

Continuing, and concluding, our brief ‘oh no’ series of presentations by DataPool at the recent JISC MRD (#jiscmrd) programme update workshop held in Nottingham on 24-25 October.

Projects were invited to volunteer short 10 mins talks at the meeting to fit specified session themes. Given the tripartite approach of DataPool, shown in our ‘oh no’ poster, Wendy White chose to present Policy and Guidance on this occasion (noting that we will be covering the Data Repository aspects – the third tripartite element of the project’s work – at the forthcoming RDMF9 meeting).

Earlier in 2012 the University of Southampton approved a Research Data Management (RDM) policy (slide 2). Clearly it is not enough simply to announce a policy with far-reaching and long-term implications such as this. There has to be support for its implementation, and particularly for those it is aimed at, in this case the university’s researchers and producers of research data. The first step towards this is the RDM Web site (slide 3), with a collection of guidance and briefing notes on how to manage research data effectively, covering issues such as planning, description, sharing, access, storage, and more.

The presentation goes on to outline the principles that shape this guidance and its continuing development, and the contexts in which it is presented across the university.

In the #jiscmrd meeting as whole there were so many presentations like this it wasn’t possible for one person to attend them all. What you got therefore is a selective quickfire update on companion projects in the programme. Even if you couldn’t catch everything, you were certain to learn something.

Oh no, not another presentation! Why would we have thought that?

Oct 24 2012

Oh no, not another poster!

Steve Hitchcock

Back in the early days of the Web there were fears that content would lose value through ease of sharing, copying, and piracy. It was then suggested by John Perry Barlow, co-founder of the Electronic Frontier Foundation, a digital rights organisation, that value would instead accrue to services and performance. Since then we have seen, for example, the transformation of the economy of the music industry from recording to performance, and growth in performance art. The same idea underlies academic poster papers (minus the art in our case).

Posters are performance. Above is the DataPool poster for a meeting of the JISC Managing Research Data (MRD) programme. We can post it here without fear of diminishing its value (!) because the Web reader

  1. can’t appreciate the scale (although if you ‘View on Slideshare’ you can see a slightly larger full-screen version)
  2. doesn’t get the performance or the interaction

As you can see from the poster, even the version here, we’ve thrown everything at it from the DataPool project. While I tend to be fairly comfortable with narrative storytelling, I am less confident with visual storytelling, as you may, just, be able to tell. Among all the posters at the meeting, I wonder which aspect will win out and attract most viewers. That’s probably obvious – with posters the visual wins every time, but the key is turning that attention into dialogue and shared understanding.

The meeting at which the poster will be displayed, a mid-term progress workshop, is for JISC projects in the MRD programme and selected invitees. If you will be at the workshop on 24-25 October in Nottingham, we will see you by the DataPool poster where we will be on hand to explain the project’s progress, and our curious, although probably not unique, scatter art style.

Oh no, not another poster! Why would we have thought that?

Oct 23 2012

Datapool presents at SxSC Creative Digifest

Gareth Beale

I recently presented the Datapool project’s plans for 3D and imaging data management research at #SxSC2 Creative Digifest. The event (organised by the University of Southampton Digital Economy USRG) was held with the aim of better understanding the impact digital technologies have upon our lives. Participants from several institutions came together to talk about their work, but also to talk more generally about the impact of digital technology on communities and individuals. It was the perfect place to present, but also to reflect upon, our work with the Datapool project.

The 3D and imaging strands of the DataPool project, led by Steve Hitchcock and administered by Gareth Beale and Hembo Pagi respectively, aim to develop a better understanding of how 3D and imaging data are currently handled at the University of Southampton: how they are created, how they are shared, how they are archived, and what this means for research and research culture.

A diverse range of technical and theoretical work was presented at #SXSC 2. The presentations served to highlight the highly innovative nature of contemporary research on digital themes, but they also placed repeated emphasis upon the need to understand how the growth of digital technology is affecting the way we live, think and work.

This need to understand the implications of digital technologies and to work in ways which are not only creative but also sustainable represents one impetus behind the Datapool project. It was fantastic to see so many people talking about how we manage our digital lives and to consider how different strategies might lead us in very different directions. It was important for the Datapool project to be at the centre of this discussion. We are left considering how some of the themes raised at the conference may relate to our digital working practice throughout the University.

Two of the talks which I found particularly interesting were Les Carr and Ramine Tinati talking about the Web Observatory. The idea that the web is sufficiently complex and poorly understood that it requires observation, as we might observe a complex natural phenomenon, is highly significant in thinking about relatively small scale data management on an institutional level. While we do not face many of the challenges faced by those seeking to understand the dynamics of an inherently social and dynamic global network, we must be aware that we are not simply looking at how people stucture their files. As research culture becomes increasingly digital and connected our data becomes socially significant. It will be very interesting to see, as we conduct our research, what the social landscape of Southampton’s 3D and imaging data looks like and whether as participants and observers we can develop a better understanding of the changes which are taking place.


Sep 27 2012

Surveying institutional data practices: the perils of ethics approval

Steve Hitchcock

“A discussion about the throttling of clinical knowledge exchange by well meaning but ill-informed ethics committees is a topic for another paper.” Goble et al., Accelerating scientists’ knowledge turns

Gareth Beale and Steve Hitchcock have been piloting a proposed study of three-dimensional data practices through institutional ethics clearance. Here they recount the experience so that others who are new to such processes – probably most of us – can save time, and some pain, and devote more energy to the study at hand.

Mesolithic stone tool captured in 3D

Mesolithic stone tool captured in 3D by CT scanning. The growing importance of image and 3D data in all areas of research require us to think deeply about how we treat these data.

So you want to find out how researchers at your institution generate and manage data. No problem. Set up a survey targetted at a well specified group. It will be fair and rigorous, the contributed information will be handled carefully to ensure anonymity of data so it won’t give secrets away or embarrass anyone. By the end we will have learned something that will shape our approach to data management and that we will share with the world by publishing the findings. Easy, done it before.

Well, things may not be quite as straightforward as you think. Current regulations require that all research, regardless of its nature, must be reviewed by an institutional ethics committee. Where humans are the subjects of the study this process can be lengthy and intricate. The need to guard against dubious research practice is a matter of great concern to all researchers. However, this post will argue that the system of ethics governance can cause delays and obstructions which threaten to hamper small ethically non-contentious research projects.

It’s not hard to imagine ethically dubious research practices. If you were the subject of a medical trial, say, you would want to know there was proper oversight to ensure full ethics compliance. In our particular case we are investigating practices in capturing and managing image and 3D data, beginning with 3D images. That does not immediately suggest big ethical dilemmas, but you might be surprised.

First, we are going to defend the ethics procedures here at the University of Southampton, even though hearts sank when we realised we would have to discover and learn this process. We have been helped through the process by numerous people intimately familiar with it. Once you understand the system it works well, and the process is logical and rigorous, from an ethics perspective. All submissions have been handled promptly, and responses are not obviously ‘ill-informed’. So what can go wrong?

  1. It can extend the timescale of your survey substantially. It might be first-time syndrome in our case, but the process has taken from mid-July and has just been approved. At Southampton there are online forms and six Word document templates to complete, so plenty of scope to take some wrong directions.
  2. You can end up committing yourself to an unworkable design or plan for your survey.
  3. You can tie yourself in knots over issues such as anonymity and confidentiality, to the extent that your capacity to publish data may be restricted, and unless you are careful, you can forget about open data.
  4. After the project has ended and staff may have moved on, you may find no one is authorised to access the data. So much for data verification.

How do we deal with the problems presented here? First, realise that this ethics process is here to stay, so we need to get used to it. In that case we need to treat ethics as integral to the study and not an additional hurdle to be jumped and then forgotten. That means starting with the design and planning of the study, rather than with the ethics process. The design will inform the ethics. In that way you will get consistency and hope to avoid unintended commitments in the ethics submissions that will later restrict your study.

However simple you think your study may be, the ethics process will present you with unexpected questions and tricky dilemmas, particularly when it comes to data dissemination, which is at the heart of research data management. Tackle these honestly, and try to envisage the longer-term consequences. Often the simplest approach to ethics may to limit and restrict studies, effectively to promise to do nothing with the data beyond your project or group. Ethics submission templates and questions may even be designed to lead you in this direction – easier to comply than confront these issues, especially if the ethics process is simply something you want to clear or avoid. Resist the temptation. Publication and open data are still possible and consistent with ethics clearance if you respect and present in your design the ethical principles of treating your subjects fairly.

There remains the issue of responsibility for confidential data. Survey results will typically contain some data that will remain confidential, notably identities of the subjects and how these might be linked to their contributions to the survey. It is important to remember that someone will have to be responsible for ensuring continued confidentiality as long as these data exist somewhere. As projects invariably come to an end and project staff may move on, this may not be as simple as it seems. Responsibility needs named individuals, and the means to authorise the passing of this responsibility to someone else. This is another process that projects will have to delegate effectively at conclusion, but first any studies subject to ethics approval have to specify names and a procedure that will enable this to happen.

The process of gaining ethical clearance to proceed with research is rigorous and has the capacity to shine a light on areas of your research that may not have seemed to be ethically significant when writing the proposal. However, the process is also time-consuming and perhaps unnecessarily complex where ethically uncontroversial projects are concerned.

We do not doubt that ethics clearance will ‘throttle’ some studies as Goble et al. suggest. Those studies with more difficult ethical issues to confront will find some ethics committees intractable, or researchers may be unwilling or unable to make the necessary compromises. Studies may be lost through the simple expedient of losing too much time on ethics clearance. We may have been lucky – we can still proceed despite the delay.

It’s not only the big issues that cost big time. Failure to align your marks perfectly in columns or rows of an MS Word table can cost you one extra round of the reviewing process, and a few more days. As can declaring that audio recordings will be made of interviews with subjects (a legitimate ethics issue, clearly), but failing to specify which medium (tape/digital/minidisc, etc.) will be used to record (less obviously a major ethics risk). Those extra days and rounds add up, as the documents circulate again between author, supervisor and reviewer. Don’t even think about going away!

As for our actual study, the investigation into 3D data portends some fascinating insights into a technology that is growing rapidly and is already sparking popular imagination:

Is this a common experience? Have others had similar experiences when confronting ethics processes for their research data surveys? Or are we at the precipice of change in the way we perform standard research surveys involving other people.

We were at once reassured, surprised and frustrated by our experiences with University of Southampton ethics governance. It was reassuring to observe the degree of attention that our research was receiving and heartening to receive detailed comments on how our research could be modified in order to conform to ethics guidelines. It would be worrying if research were not checked in this way. We were surprised by the complexity and intricate nature of the process to which our research was subjected, particularly given its relatively uncontroversial nature. The timescale over which the whole process took place was frustrating.

If ethics procedures are to be modified perhaps it should be to simplify and speed up the process, especially for those studies that might be quickly classified, through the process, as low risk. At Southampton there is such a classification, but that did not save us from an extended process.

Until that happens, or until all ethics submitters become more familiar and competent with the process, our experience may save others new to the process some time, and pain, so they have more energy to focus on their study.

Aug 2 2012

Southampton research data management policy: release and follow-up issues

Steve Hitchcock

The University of Southampton has a Research Data Management Policy, coordinated by the DataPool Project. Wendy White explains how the low-profile release of the policy is consistent with the project and the institution’s broader approach of iterative development towards managing research data, and identifies some issues that have emerged following the release.

Champagne remains on ice for the launch of the Southampton RDM policy (photo by James Cridland)

We have had our Research Data Management Policy and our core guidance available for a short while now from our new “one-stop-shop” web pages. We are taking stock of our approach so far and gathering early feedback from users of the site.

You may have missed our grand launch with full orchestra and smashing magnum of champagne. That’s because there wasn’t one – and not just because we can’t compete with the Olympics opening ceremony! The whole approach has been one of iterative development with staff from the academic community and services. Now the policy and guidance are widely available we will continue with this approach. This was influenced by hearing from colleagues working on earlier policy developments at Monash University and Purdue University, thanks to the JISC MRD International event held back in 2011. In the same spirt here are a few of our thoughts on our approach and emerging issues.

We hope that the integrated approach has helped:

  1. Develop institutional support that is complementary to funders’ guidance so researchers can meet local, institutional and discipline needs. If support feels fragmented and effort duplicated then we are not getting things right.
  2. Co-ordinate expertise across academic groups and services. The guidance has been produced with contributions from the Library, Research and Innovation Services, our IT service (iSolutions) and academic staff. Early queries have been about sharing knowledge, problem solving and agreeing approaches.
  3. Provide clear contacts and signposting. We have tried to provide contact for specific services contextually within the guidance, whilst having a single overarching point of contact. So far we have been Googled, referred to by word-of-mouth and by e-mail alert. We don’t mind how we are discovered – we’re here to help!

Emerging issues include:

  • Linking our definition of research data in the policy to more discipline-specific guidance so researchers can determine what data are “significant” and therefore in scope for storage and archival retention requirements. The more queries we get and examples we work with the better as we explore these issues.
  • Translating roles and responsiblities over the longer term – from written nominees on a document to support for effective decision making. Addressing the key risks without creating a burdensome process.
  • Providing appropriate long-term archival and storage options. We have already undertaken cost modelling as part of the 10 year roadmap produced by the IDMB (Institutional Data Management Blueprint) project, DataPool’s predecessor RDM project at the University of Southampton, but it is clear that more granular disciplinary case studies with details of specific requirements are required.
  • Data Management Planning support is welcomed. We have early encouraging feedback that academic staff do value additional support for this as bids are considered.
  • Imaging requirements are important. Our Life Sciences Institute and Computationally Intensive Imaging multidisciplinary groups have growing data challenges. We will be engaging with some requirements gathering/case analysis over the next few months.
  • Visualising data will be an important component for some research, so we will need to further explore the link between archival storage, workflow access and data visualisation. As the number of datasets that are openly accessible rises, how many of these would benefit from enhanced institutional support for data visualisation options?

It does feel like we are still at the start of an Olympic challenge. We are aiming for gold. However, we are happy to get into training and start with a few personal bests along the way!