Jul 13 2012

DataPool Steering Group, first meeting

Steve Hitchcock

Before the UK took a break for the Diamond Jubilee weekend, DataPool had an important diary date of its own at the end of May, the first meeting of the project steering group, effectively ending phase 1 of the project.

The steering group includes senior managers and academics from the University of Southampton, and experts in running research data repositories elsewhere. This post collects and links to the documents and evidence that were circulated prior to the meeting or presented at it, and which informed discussion. We conclude by summarising and highlighting some of the main steers and outcomes of the meeting that will direct the project going forward to phase 2.

Collected documents for the Steering Group meeting

DataPool service model at the University of Southampton

A forthcoming report will give more detail on the SharePoint and EPrints developments. Similarly, another post here will consider progress with the institutional RDM policy and the accompanying guidance information.

Main steers and outcomes

So what did we learn from the meeting? Among such an eminent gathering and across a wide-ranging discussion it would be hard to represent all views in this short report. Important issues raised will be pursued in the project. To give an indication of some of those directions, here are just three of the more immediate actions identified by the project managers:

  1. There was endorsement for the concise policy guidance notes and iterative approach to engagement and evaluating progress. First guidance notes are now available, and the collection will be extended.
  2. The case study-based training guide was received enthusiastically and regarded as an ‘approach that could evolve incrementally’. Further case studies based on this model will be identified, and used for postgraduate training in more areas.
  3. More detailed disciplinary/multidisciplinary cost modelling case studies are needed to build evidence to support bids for significant institutional investment.

Overall, the meeting expressed a view that the project is working along the right lines, and it was interesting to note from our external advisers that in many cases we are dealing with similar issues to those faced by others.

We are grateful and thank members of the steering group for their commitment and contributions. With their encouragement and direction DataPool is able to tackle the challenges ahead with conviction.

Members of the steering group present at the meeting (University of Southampton unless otherwise indicated): Mark Brown (Chair and University Librarian), Philip Nelson (Pro-VC Research), Adam Wheeler (Provost and DVC), Pete Hancock (iSolutions, Director), Helen Snaith (National Oceanography Centre Southampton), Mylene Ployart (Associate Director, Research and Innovation Services), Graham Pryor (Associate Director, Digital Curation Centre), Sally Rumsey (Digital Collections Development Manager at The Bodleian Libraries, University of Oxford), Louise Corti (Associate Director, UK Data Archive), Les Carr (Electronics and Computer Science), Simon Cox (Engineering Sciences), Graeme Earl (Humanities), Wendy White (Head of Scholarly Communication), Dorothy Byatt, Steve Hitchcock (DataPool Project Managers). Apologies from Jeremy Frey (Chemistry).


May 30 2012

Engaging with research data producers

Steve Hitchcock

Engaging, by bigvern

Institutional research data projects such as DataPool and others may focus on concrete outputs such as repositories or policies for research data. While these will be positive steps, ultimately these projects may be judged on the extent to which they can engage with research data producers and use that to inform the development of the outputs in the longer term. Here I simply want to connect two recent, and quite different, works that may help shape our thinking on engagement. This is a big topic that we will inevitably return to.

The two two works I refer to above are:

  1. A report Introducing research data, which includes a number of research data case studies at the University of Southampton, and is used as a guide for research students. This was conceived and developed by colleagues in DataPool.
  2. A section (‘What about the data?’) from a blog post Latest progress for RD@Essex.

In the Southampton report the case studies are preceded by sections on data categorisation (where data comes from, forms of data, and how it might be represented electronically) and the data lifecycle, which are used as templates for the case studies. The Essex work has a short data classification, compiled using a DAF-like approach, DAF (Data Asset Framework) being a well used tool that can help with user engagement. There are almost certainly other similar examples.

Both approaches – while clearly not identical in scope or scale, they have some points of alignment – begin to give us some insight into research data workflows. Workflows are the set of sequential processes by which data, in this case, and any intermediate forms and versions, are produced, stored and used. Most structured work involves workflow of some kind, but is typically of less interest when the workflow is well established. Interest in workflow grows when the workflow is in flux, which we believe it may be in many cases for research data.

How does this help engagement with data producers? First we need to identify and contact as many research data producers as we can within our institutions. Then we have to demonstrate an understanding of how they produce and use data, and structure the engagement to get some insights into how we can help them manage their data more effectively given the emerging requirements, policies and practices that will affect research data. At the heart of this will be modelling workflows that produce research data.

These classifications, categorisations and case studies can help us model research data workflows, which we can then use in turn as a framework to guide further engagement to find out more – about how workflows might be changing, how to optimise data management through all workflow stages, and how new institutional research data services can assist.

We want to differentiate ‘engaging’ from ‘advocacy’. Where the extended infrastructure required for research data repositories (e.g. storage, deposit interfaces) is not yet in place, we have a chance to engage with potential users to find out what services they want rather than trying to sell them a set of services we already have.

How are other research data projects engaging with data producers? We believe we have a strong basis for further engagement based on the Southampton report, but we are keen to learn from other examples such as RD@Essex.


May 28 2012

What can research data repositories learn from open access? Part 2

Steve Hitchcock

Open access is finally attracting high-level attention from national governments, but full open access has been a long time arriving despite extensive funding, development and the commitment of many people. As much of that effort switches towards the implementation of repositories to store, share and publish the research data that informs publications, we are considering what lessons might be learned from open access repositories, so that the path to effective data repositories might be shorter and less fraught. In part 1 the factors considered included policy, infrastructure, workflow and curation. Here in part 2 we look at rights and user interfaces.

2500 Creative Commons Licenses

2500 Creative Commons Licenses

Rights

Since open access is indelibly associated with publication, one of the primary impediments to providing open access is transfer of rights to publishers, a practice that has failed to adapt to the digital switch. Research data is not so encumbered now, and with care data creators can deploy rights more effectively because they begin in the digital era.

It has often been argued that open access repositories failed to adopt or compete with Web 2.0 services. Quite what this means is not clear, but one aspect might be social and user engagement for the purpose of growing content. Well-known services that became associated with Web 2.0 are YouTube and Flickr, so the case might be that OA repositories were not as successful in attracting content as these services. There is one key point that differentiates these services from open access: prior to Web photo and video services, there were no simple publication outlets for this type of content for non-professional or non-broadcast works. In the case of open access there are pre-exisiting publications, such as journals and conference proceedings. Open access repositories do not seek to eliminate the journals, but to supplement the access they provide. There is thus another party with a vested interest in ownership of this content.

This is why open access can get mired in discussions about rights. Creative Commons (CC) licences were designed for content to be shared on the Web and communicate how creators are prepared to share their rights with users to open and extend use of their content. For research papers, however, since long before the Web, publishers have required a transfer of rights from the author in return for publication – hence the ownership issue. Unlike CC, these rights can be used to lock down access and reuse, for commercial purposes.

There is a form of open access, ‘gold’ OA journals (in BOAI this is complementary to ‘green’ OA repositories), which may be accompanied by release of commercial rights using CC but often at a cost for publication. In other words, publication is paid for financially rather than in a transfer of rights. Such journals present this as an advantage over non-OA journals and OA repositories, and this can be beneficial for text mining and other applications. While this form of OA publishing has been growing in recent years, it remains to be seen how quickly it can replace or adapt key high-impact journals, and at what cost.

Broadly, research data are not yet subject to publication rights. Publications are a highly processed form of research data, in the form of tables and graphs, for example. Typically the data targeted by data repositories precedes the refined and summarised publication versions, and is therefore not covered by the same rights transfer. That could change if expanded publications requiring data deposit or third-party service providers seek to obtain rights in return for these services.

Strictly, while institutions where research has been performed inherently own the rights to that work, they have been reluctant to exercise those rights in ways that would restrict a researcher’s choice of publication, or to require or even advise authors on some retention of rights or amendments to rights agreements. Unlike with peer reviewed papers, where precedent is more strongly established, it is possible that institutions will seek to impose more control of rights where research data is concerned. Recently reported cases show how a university’s allocation of control of rights within research teams, the special case at Purdue University notwithstanding, can have consequences for publication. What data creators and authors will be concerned about is whether the exercise of those rights by institutions is commensurate with the services that are provided in return. There may be resistance if established academic freedoms are constrained and research impact is reduced as a result, but with the right services and effective exercise of rights impact can be increased by sharing research data openly.

The lesson of open access is that rights matter, that the traditional all-rights transfer for academic publication is no longer appropriate for or conducive to fully exploiting new forms of digital dissemination, but also that established practices can be slow to change. Institutions and authors should be careful not to let rights to research data slip away as they did for publications in another era, but equally they must be careful to work together to use those rights in ways that maximise the benefits and impact for them and for research.

User interfaces

Users of repository services are both those who provide the data and those who consume it. The features that define and characterise repositories are the interfaces through which users can perform these actions, but are these interfaces flexible or adaptable enough to serve all those who might want to use repositories for publications or data?

Within this analysis (including part 1) it has been suggested that OA repositories may have overlooked workflow, and Web 2.0 developments with regard to content growth, services and engagement with users. In fact, some helpful developments can be found buried deep within repository software, but to see where these might impact users more directly we have to look away from the familiar repository interfaces. This critical development is called SWORD (Simple Web service Offering Repository Deposit), and it will impact on data repositories as well, in ways that we have not yet seen implemented on a large scale, even for OA repositories.

As the name indicates, SWORD is focussed on one of the actions that a repository supports, deposit, that is, getting new content into a repository or updating content, this updating feature recently becoming available with SWORD version 2.

SWORD frees the user deposit interface from the repository software and the specific instance of a repository. As the number and types of repositories have grown, some authors may wish to deposit in more than one place. SWORD can help with that. If the repository deposit interface demands too many keystrokes (metadata), or does not allow all the metadata you want to record – too few keystrokes, SWORD can help there as well.

The deposit still needs to reach a repository (‘endpoint’) so SWORD and repository softwares are working together on this, not competing. All major repository softwares support SWORD, and the most recent releases support SWORD v2. What’s needed are more SWORD client interfaces, as there have been relatively few examples to date.

It is easy to see that data repositories can benefit from SWORD in the same way as open access – deposit in many places from a single interface. When it comes to scoping metadata within a deposit interface, given the wide disparity in describing different data types in different disciplines with metadata, SWORD begins to appear essential for data deposit. These are just the services we can anticipate now.

With SWORDv2 we can envisage taking deposit out of the forms-based deposit approach and into different applications. One that may work for data deposit is a DropBox-like application for file-based deposit. With this application ‘dropping’ a file to a specified directory in a file manager on a laptop, say, synchronises and copies subsequent versions of that file to a repository (Figure 3), or potentially to a remote storage service, which can be accessed by the user logging on to the storage site using any Web-connected device. Data can thus be accessed and shared, or published in open access repositories. Using SWORDv2, file manager-based services could be used for simple deposit of research data files in conjunction with storage services; with SWORD v2 these could also fulfil automated deposit cases.

Figure 3. Dragging an image copies it to the selected repository

The DataFlow workflow illustrated in part 1 uses SWORD as the transfer mechanism between the user’s local storage and the curated institutional storage, in essence using it to capture additional metadata.

Another demonstrated application of a SWORDv2-based interface works within desktop authoring tools, such as a word processor or other office applications.

What these applications portend is that data repositories can fill the workflow gap, which we recognised was missing from open access repositories, and which looks to be potentially more complex for data repositories. We can begin to support deposit of data to a schedule that need not be based on the same frequency and mode as publication but is more flexible. As well as needing more SWORD client interfaces, however, another open question is how repository softwares designed for publication can adapt to support two different paradigms: managed storage as well as publication.

There are only two reasons for data creators to deposit in data repositories: they want to (share, publish, good academic practice, etc.), or they have to (policy). By focussing on services that are adaptable enough to serve users, building on SWORD to support flexible workflow and bringing deposit into automated or even more creative applications, research data repositories have the chance to support both motivations, instead of being left to emphasise policy as the primary motivator, as has happened for open access repositories.

Summary

Establishing and growing open access content is taking longer and proving harder than ever originally anticipated back in 2000. As we consider how to extend open access repositories to manage research data, are we learning the right lessons from open access? Have we covered all the important issues, or are we missing key factors? Research data repositories bring challenges that are distinct from open access. What are the new challenges, and which of these will have most impact on the success of research data repositories?

In this analysis the factors we have considered include policy, infrastructure, workflow, curation, rights and user interfaces. We haven’t covered preservation, but digital preservation is served by a comprehensive selection of tools that can be applied to repositories, and one lesson seems to be that repositories will move to be preservation-ready when content volumes and risk-analysis demand.

Open access began with the principle that it is good for researchers to share findings, and that digital networks enable that to happen more widely and at lower cost, ultimately free to users. It was anticipated that users would want to take advantage of this, as physicists already did with arXiv, and when this model failed to take off to the same degree in other disciplines, eventually institutional repositories emerged to encourage further growth of open access. As that growth appeared to hit a ceiling, research funders and institutions began to step in with open access policy. In other words, principle – whichever principle you prefer, returns to taxpayers, for example, or productivity of research, or escalating journal costs – was used to justify and frame policy for users. Users themselves, so it seems based on unmandated rates of open access deposit, have been less keen to put principle into practice.

In hindsight there are lessons that could have been learned to speed up the process. Progress with data repositories need not suffer the same mistakes or the same delays. Data repositories might occupy a more pragmatic, less emotional space than open access. Unlike for open access there is no single or easily defined target for research data repositories – what is data? continues to be a perennial question – so policy and requirements might be broader. Perhaps this time content deposited in data repositories can be driven by services that attract users, as well as by policy. In this case, the aim of data repositories must be find those users who want these services, and then to make those services work better for them.


May 24 2012

What can research data repositories learn from open access? Part 1

Steve Hitchcock

Institutional research data repositories follow in the wake of the widespread adoption of open access repositories across UK institutions during the last decade. What can these new repositories learn from the experiences of open access, and what pointers can we find for the development of data repositories? In the first part of this post we will consider factors such as policy, infrastructure, workflow and curation. In part 2 we will extend the analysis to rights and user interfaces.

It may be a timely moment to reflect. A recent speech by the UK government’s science minister David Willetts prompted renewed excitement over open access, with a forthcoming report to advise on specific actions to be taken to realise more open access. Less remarked on, apart from comment about the undefined but potentially high-profile role of Wikipedia founder Jimmy Wales, was the bigger picture view that anticipates stronger integration and linking between research publications, research information for reporting and assessment, and research data for data mining but also for research testing and validation.

Open access (OA) repositories, which principally provide free access to an author’s version of published research papers, effectively began with the physics arXiv in 1991. Institutional repositories, which switch the focus of coverage from the subject to the place of authorship, emerged in 2001 following the Open Archives Initiative (OAI). To complete the record, the term ‘open access’ was defined by the Budapest Open Access Initiative (BOAI) in 2002.

So institutional OA repositories have up to a decade head start on proposed institutional research data repositories. The University of Southampton, home of the DataPool project, has hosted a leading OA repository since 2005, so the project team has long experience of running a repository.

As with OA repositories, there are plenty of examples of subject-focussed research data repositories, but here we focus on factors affecting institutional repositories (IRs).

Policy

For OA IRs, technology and infrastructure preceded policy. First impressions are that for data IRs this will be the other way round. As with OA, data policies in the UK are being driven both by research funders and institutions.

OA policies focus on the need to expand full-text content collections held in repositories and typically require (mandate) or encourage authors to deposit versions of their published papers. The first university-wide mandatory OA policy was implemented at Queensland University of Technology in Australia, in 2004, according to the site EnablingOpenScholarship. This site also shows graphically how the number of institutional policies began to accelerate from the first quarter of 2009, some 5 years or so since the growth of IRs saw similar acceleration, although it remains a minority of institutions that have such polices. It has been calculated that OA mandate policies can increase deposit rates to above 60% of eligible papers from the average of 20%. In this respect, the lack of a suitable policy could be seen to hinder an institutional OA repository.

Emerging UK institutional data polices by comparison have focussed on requiring researchers to create data management plans and data records, and emphasise sustainable practices in managing and storing data for the purpose of access, stopping short of requiring open access or of institutional deposit of actual data that would then need to be supported by the institution. This might be because institutions have still to calculate and cost the the storage infrastructure needed, whether managed locally or in the ‘cloud’, because institutions are unclear what value they can bring to data management – or even where the value is in the data they seek to help support, or because there is not yet any consensus on whether data repositories should be subject-based, or institutional, an issue which OA repositories have still not fully resolved. Institutional data policies have in turn been driven and directed by research funders’ data policies, principally RCUK and EPSRC (Jones 2012) setting principles and expectations of institutional compliance within a specified timescale (for EPSRC, by 2015).

Data policies may benefit from being instituted ahead of developing infrastructure for collecting, managing and presenting data. However, the few early policies available suggest little common purpose – we are clearly some way from having a best-practice data policy template for others to follow, as has evolved for OA repositories. To serve even the limited requirements of these early policies, institutions will need to connect decisions on infrastructure and understand patterns of workflow that produce research data, as we shall see below.

Infrastructure

By infrastructure we mean the technical capability to support distinctive requirements. While OA repository infrastructure is well established, it has not had to tackle the challenge of large-scale storage that is likely characterise data repositories.

The essential infrastructure that led to OA repositories was put in place by OAI: this was a protocol for metadata harvesting OAI-PMH. This allowed individual repositories to be viewed collectively through services – search being the most prominent service, at a time when Google was new and relatively little known – based on OAI-PMH. Immediately, software emerged for setting up institutional repositories, first EPrints and later DSpace and others. These repository softwares now also bring a range of integral services established over a decade that can be utilised to manage a range of data types, including research data.

Hence this same infrastructure, with modification, is being used to serve data repositories. There is, however, one new infrastructure component that data repositories will need to introduce – large-scale data storage. While content volumes for OA repositories do not test conventional storage systems, data repositories will inevitably provide much bigger challenges to storage and curation. To get a sense of the scale of the problem, Figure 1 compares data volumes at different stages, and is taken from a presentation about scoping curation for digital repositories. It is notable that data generation volumes cannot be visualised on the same scale as the other stages, since these are orders of magnitude larger. We might call this the data curation gap. Rosenthal has recently questioned assumptions that all data generated might be kept ‘forever’, indicating the need to fill the curation gap: “Assuming (data) growth continues, endowing 2012’s data will consume 19% of Gross World Product (GWP). On these trends, endowing 2018’s data will consume more than the entire GWP for the year.”

Comparing data volumes at different stages - generation, repository storage and archiving

Figure 1. Comparing data volumes at different stages - generation, repository storage and archiving

Institutions appear to have two choices to serve this level of storage: locally managed, or remote storage in the cloud. It is likely there will be a preference or a requirement to exert institutional control over storage (for example, at the University of Brighton: “we currently have a policy of not hosting staff data outside of the institution”), even in the case of cloud storage, hence developments such as the JISC UMF Cloud Pilot managed by Eduserv.

They could instead opt to advise researchers and data producers on selecting their own storage, from data archiving services such as UK Data Archive and the Archaeology Data Service, or data publication repositories such as Figshare, Dryad and other data repositories listed by DataCite, or even commercial cloud storage services (although a colleague noted that risk-averse advice might wish to start with where not to store data). Apart from the data archiving services, it remains to be seen whether these repositories can provide resilient, cost-effective, sustainable storage over an extended period, where content can be shared collaboratively during development and later made open access.

Workflow

OA repositories were designed from the outset for a publication mode of delivery that does not attempt to capture and support earlier phases in the workflow of writing a research paper. Given the more complex workflow (or life cycle) of research data, and the need to capture data at different stages of production and processing, the single publication mode may be inadequate for data repositories.

As the Web gained popularity in the mid-1990s all sorts of content began to appear, including digital versions of research papers published in what were then still largely paper journals. Authors were simply loading digital versions on to Web servers wherever these happened to be available, usually within their institutions, whether these servers were provided for this purpose or not.

OA institutional repositories served a simple purpose – to provide these authors with a more reliable, managed, services-based Web server to provide access to this digital content over a long timeframe. In this respect the designers behind these repositories over-estimated the number of authors that would use such services and the number of papers that would appear in repositories. Further, because the target content was papers due for peer-reviewed publication, the concept of workflow was barely considered beyond the expectation that the process of repository deposit would happen at the completion of writing the paper and in parallel with submission for formal publication. Thus OA repositories were designed for a one-stage deposit workflow, and no prior contact with authors while a paper was in preparation.

It has been suggested that by failing to engage authors at a sufficiently early stage and not providing support services for writing papers, that OA repositories have lost out to the more established process at the completion of a paper – publication. Further, by the time IRs were widespread, most journals were producing digital versions, so that was no longer a factor for authors posting Web copies of their papers, even if those journal digital versions still mostly stood behind subscription barriers.

While it is in principle simple to upload a completed paper from a local file store to a repository, it has been argued that a restraint to this happening is the requirement by many repositories for extensive accompanying ‘keystrokes’ or metadata. Competition with publishers for keystrokes at the point of completion and submission, lack of clarity in the benefits of OA repositories, and the failure to integrate with workflow may have been factors in preventing OA repositories from growing content to the levels anticipated, and led directly to the mandate policies described above.

The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention.

While workflow involved in the production of research data is more complex and less easy to classify than for OA publications, one helpful representation of this workflow has been illustrated by the University of Oxford (Figure 2). This shows how a project begins with a bid for funding and in future will invariably be accompanied by a data management plan (DMP), a data roadmap for the project to follow. If a workflow begins with a successful proposal and a DMP, it will lead to data and, increasingly, from policy or from users, a requirement for managed data storage with the ability to support controlled access for collaboration, and discovery for wider access. Figure 2 is taken from this presentation by the DataFlow Project.

Research Data Management Interventions at Oxford University

Figure 2. Representation of a research data management workflow, from the University of Oxford

Effective institutional data services will need to span this whole workflow and engage data creators at all stages. Lessons on workflow from open access suggest that for research data providing separate services for creating data records and storage, for example, will be insufficient. Data creators and authors will not engage in processes that do not enhance their work.

Curation

Digital curation is defined by Wikipedia as the “selection, preservation, maintenance, collection and archiving of digital assets”. For open access, selection is pre-determined – the target content is peer reviewed and published research papers. Further selection of such content for curation purposes is not merited by the data volumes involved. As we have seen with the ‘curation gap’ above, this does not hold for research data. As a result, more attention will need to be paid to curation for research data, and the line between simple user-managed storage and assisted curation will need to be more flexible.

Where that line might be drawn is thus open to question. It is drawn in principle by the strategy exemplified by DataFlow in Figure 2, which has two stages representing user-managed workspace and storage (the stage to Local Storage & Retrieval in Figure 2), with a transition to an institutionally curated space (Institutional Storage, or DataBank in Oxford’s system). The question remains as to what drives that transition. Such spaces are likely to have different curation criteria in different institutions, and will need to take account of researcher, policy and publication requirements, as well as costs.

An example of research data management that has optimised workflow, metadata collection and records creation, data curation, aggregation, discovery and access is eCrystals at Southampton, now extended to a federation led by the UK National Crystallography Service.

Interim summary

There are more lessons to learn from experience with open access that we can apply to research data repositories. In part 2 we will extend the analysis to rights and user interfaces.


Mar 29 2012

DataPool: presented, tweeted, blogged

Steve Hitchcock
Computer Applications and Quantitative Methods in Archaeology (CAA) 2012 conference

Computer Applications and Quantitative Methods in Archaeology (CAA) 2012 conference, hosted by the Archaeological Computing Research Group in the Faculty of Humanities at the University of Southampton on 26-30 March 2012

How do you give a conference presentation when your laptop with the presentation on it dies 1 hour before the presentation? You tweet it.

Graeme Earl is co-investigator with the DataPool Project. He is also a senior lecturer in archaeology at the University of Southampton and organiser of the Computer Applications and Quantitative Methods in Archaeology 2012 (CAA2012) conference being held in Southampton this week (26-30 March). So this is an especially busy time for Graeme, yet he still wanted to give a presentation on DataPool to his own research community.

Why choose the novel means of presenting via Twitter? Graeme explains: “I decided at lunchtime that I would give the paper via twitter, and upload slides as an accompaniment. An hour before the paper my laptop died catastrophically and, irony of ironies, my presentation materials were on my laptop rather than on a network location. So I assembled the presentation as links in 30 minutes and then delivered it.”

In case Graeme hasn’t time to blog his presentation as well, we’ll do it for him. Twitter is intended to be an immediate service so retrieval can get harder over time. You may be able to find the original tweets by searching for Graeme’s username or for the hashtags he used. To avoid repetition these have been removed from the tweets and are copied immediately below. There is also some brief annotation of links between tweets to assist readers. Otherwise, tweets are as Graeme’s originals. For reference, the presentation was given around 5 pm on Wednesday 28th March.

@GraemeEarl #caasoton #datapool #jisc

> Starting my tweeted paper on #datapool now #caasoton

> http://t.co/EX7ZgsGj Managing Research Data

Report on Developing Institutional Research Data Management Policies, a JISC Managing Research Data (MRD) Programme meeting held in Leeds on 12-13 March.

> http://t.co/fed0BzoE

This blog.

> http://t.co/Rkff1pVq Research data management infrastructure

Research data management infrastructure projects (RDMI), Web page on the first phase of the JISC MRD programme.

> http://t.co/zMva6Lx5 IDMB

JISC project page for IDMB: Institutional data management blueprint, predecessor project to DataPool.

> Creating a system – sharepoint, repository, metadata http://t.co/7TydrHQJ

DataPool poster paper, on Graeme’s Slideshare account.

> Rolling out a policy – ratified, embedded, implemented

> Producing examples – discipline, re-use case studies, domains e.g. imaging

> Developing skills – training staff and students; ‘help desk’

> Sharepoint infrastructure provides data access and collaboration

> University deep storage repository + connection to others e.g. via SWORD2

> ADS SWORD ARM project http://t.co/VzGKCGkU

JISC project page for SWORD-ARM: SWORD & Archaeological Research data Management.

> ADS page for SWORD ARM facilitating deposit from outside to ADS repository http://t.co/TEJ7HSz1

SWORD-ARM blog.

> Middle layer of metadata management – initially project/sub-project/item hierarchy

> Publication – push to and pull from external repositories e.g. ADS; policy implications for this?

> Provide external access to cache and deep storage versions

> Demonstration repository; trialling with https://t.co/uMr69v5x

Portus Project, Digital Humanities, University of Southampton.

> Presented at Soton Research and Enterprise Advisory Group (REAG) http://t.co/17GepTLs

A project for the research life cycle? DataPool blog post, 8 March 2012.

> Ratification by Soton senate; included user guides also clarify uncertainties

> Defining core focus areas e.g. USRG Imaging http://t.co/82KfVJrd

Computationally Intensive Imaging, University Strategic Research Groups (USRGs), University of Southampton.

> Building network of experts and interested people http://t.co/TvnF4p0r

Data system, policy, training: putting people first, DataPool blog post, December 8th, 2011.

> Defining internal dissemination mechanisms e.g. USRG DE https://t.co/TxoXWuXY

Digital Economy USRG, University of Southampton.

> data management plans presented to other JISC projects http://t.co/Kth6o8EL

Data management plans (DMPs): the day has arrived, DataPool blog post, 22 March 2012.

> Details of meeting disciplinary challenges in research data management planning workshop http://t.co/ELd7hEvg

Agenda for JISC workshop on Meeting (Disciplinary) Challenges in Research Data Management Planning held in London on 23 March.

> Finished. Taking questions.

> @PatHadley thankfully I had a helper to advance them for me!

That’s it: presented, tweeted, now blogged.


Mar 22 2012

Data management plans (DMPs): the day has arrived

Steve Hitchcock

Changed Days at Paddington Basin, Colin Smith, Geograph Project

Updated 28 March 2012

The day has arrived for data management plans (DMPs). It’s tomorrow (Friday 23rd March 2012) when Research Data Management Planning Projects from the JISCMRD (2011-13) programme convene a workshop in Paddington, London, to present their findings and results. But has the day for DMPs arrived in a bigger sense? Are DMPs pivotal to research data management? I suspect so, and at the meeting I will be looking for evidence to support the assertion, or not.

DMPs are the link between the conception and proposal of research projects, and the later production of data from those projects. These plans can be extensive and demanding to produce, but as a result the information they contain should be invaluable to data repositories. This is not the type of information the researcher is likely to provide again at the point of depositing data in an institutional data repository.

DMPs represent carefully planned information on the project and predict the existence of data, in some cases precisely. This creates a link with emerging research data policy, which requires an open record of data produced in the course of funded research and the effective management and storage of that data. DMPs have a role to play in monitoring and ensuring the completeness of the records.

This approach raises a series of questions about DMPs. What is the scope of a DMP, and who defines this? This is most likely to be the research funder, but might be institutions in other, non-funded cases. In what form will the DMP be completed? Presumably online. Where will online DMPs be hosted? The Digital Curation Centre, not a research funder, hosts the DMP Online tool. Should institutions create and/or host DMP tools? To what extent will it be possible to (pre-) populate data repository records from DMPs? How comfortable are funders, institutions and researchers about sharing and publishing information from DMPs? The answers will involve specifying where DMPs fit the researcher’s workflow, ensuring there is no duplication of effort, and allowing DMPs to be driven by the needs of research and researchers, not by systems requirements or other special pleading.

These are some of the issues that will be in my mind when listening to the DMP project presentations tomorrow, and which I shall report on afterwards.

Update. Following the JISC DMP meeting in Paddington, Meeting (Disciplinary) Challenges in Research Data Management Planning Workshop, and further feedback from key presenters, some of the questions I posed can begin to be answered. For me the key presentations in this context were by Kerry Miller and Adrian Richardson from DCC, who updated the meeting on future plans for the DMP Online tool, and from David Shotton, a zoology researcher at Oxford University who has been compiling a more researcher-friendly set of 20 DMP questions.

First, we were shown how DMP Online v3.0 now includes selectable templates for e.g. different funding council requirements, so we can see how customisation of DMP input forms is beginning to take shape. We can also see from the plans for DMP Online that one possibility being considered for the tool is Ability to host locally within institutions. This was my choice in the selection exercise, but it appears others did not rank this feature so highly. My recollection is this group exercise was somewhat curtailed, and the tied rankings for many features suggest the returns were not high, so I hope the development team will not feel bound to the ranking of these results.

Now to David Shotton’s analysis of a short, customised set of DMP questions. If we are to host locally and customise DMP tools, we need to be careful we do not get away from the core requirements of the forms, which are not simply to suit individuals or institutions. They still have to be grounded in formal funder and research requirements. So having framed his 20 DMP questions, David looked into comparing and aligning his questions with known sets of DMP questions, including from DMP Online and the US equivalent DMP Tool, and others. To make this alignment David created a downloadable spreadsheet containing the aligned DMP questions, which can be found in a link towards the end of the blog post. An analysis of this comparison is provided in a follow-up post, also linked.

That’s enough for this update. It’s time to look at David Shotton’s analysis and at the plans for DMP Online in more detail. I expect to return to this. I can’t answer yet my latter questions on whether researchers and funders will be happy with the approaches being considered here, nor how soon this work might come to fruition by integrating DMP tools in data repository-based research workflow. Even to suggest this phrase leaves me accused of over-egging this particular pudding. What I can say is that answers to my first questions, on customisation and hosting, have begun to be revealed and they are highly encouraging.


Mar 8 2012

A project for the research life cycle?

Dorothy Byatt

How do we view a project like DataPool?  What are we hoping to achieve?  These are important questions that need to be kept in mind throughout the life of the project.  It can be easy to become focused on specific tasks.  Projects can be seen as simply “a project” with a fullstop and an end, but DataPool is more than “just” a project testing ideas and systems.  It will do that, but we hope that it will do much more.  DataPool is about beginning the process of embedding the management of research data into the infrastructure and culture of our institution.  DataPool is here to make a difference and to make it throughout the research life cycle, from proposal to storing and sharing.

Bedrock

The bedrock underpinning the project will be a Research Data Management Policy for the University.  This will be key Pozo de las animas by Alejandro Colombo CC BY-NC-SA 2.0to all the other work and will inform the related guidance and training requirements.  Its development is being seen as an iterative process with views of the academic community initially being gathered through designated “data” contacts within the Faculties.  The policy will be valuable in informing data management processes in the University, influence plans where required by funders and will be a significant benefit arising from the DataPool project.  We would hope by the end of the project to see an increased number of references to the policy within research proposals, resulting over time in an increased number of datasets held securely and in a location that makes them available for re-use.

Infrastructure

The increased focus on research data, its management, storage and sharing requires that the systems offered within the University of Southampton are adapted and developed so that they can meet this need.  DataPool will be of benefit to this process.  DataPool will work to inform the decisions concerning the technical infrastructure of the institution to provide a simple deposit system that will also facilitate sharing at the appropriate time and under approved conditions.  This will be geared towards the individual researcher, influenced by case studies and discipline exemplars, with the aim of seeing how best it can support the research data workflow and capture metadata from existing University systems.  By the end of the project we would expect to have enhanced the storage and deposit options available, and seen an improved uptake of them.

Support

The start of the data life cycle is long before the creation of any data and really begins with the research proposal.  We plan to draw together a network of services that will support the researcher from proposal to deposit.  This will draw on existing services and expertise, both internal, such as our Research and Innovation Service, Doctoral Training Centres, Library, and external ones, such as the Digital Curation Centre.  We aim to create: guidance sheets; training materials; and to offer workshops and a web site. These will enhance the support that academic, professional and support staff can provide, whether for writing plans or advice on versioning through to different levels and types of metadata.  We would see the establishment of a central web site as an important step in this area.  The creation of this support will be a direct benefit arising from the Datapool project.


Feb 21 2012

Architecting research data management systems

Steve Hitchcock

There seemed to be general surprise following the revelation that Damien Hirst does not always ‘make’ his own works of art. Instead he leaves production to assistants based on his ideas and designs. In effect, he architects his art. Similarly, high profile architects like Norman Foster or Zaha Hadid are no less creative forces if they are not also builders.

In the rather different world of managing research data (MRD) we need systems to manage these data, but given the range of different types of data and emerging requirements of data producers, their institutions and users, we should be careful about simply adopting existing systems or even systems designs. The options are potentially wide, and complex. Instead of the systems engineer, the stage we are at needs the systems architect to take a high-level view of all the requirements to produce an elegant solution, fit for purpose and designed for the environment in which it is to be placed.

So I was interested to see the architecture for a research data repository at the University of Bristol illustrated by the JISC data.bris project. The accompanying description starts with front-ends and storage architectures and on the way refers to various technologies. The key feature, however, seems to be the recognition that this service must integrate with existing institutional information systems – not a unique view perhaps among current JISC projects, but one taken into account in the high-level architecture at the outset rather than as an afterthought.

Another of our companion JISC MRD projects, DataFlow at Oxford University, is developing a data deposit architecture that attracted my attention at the MRD programme launch meeting in Nottingham in December last year. This features a two-stage approach – DataStage and DataBank – that recognise different motivations for data deposit by researchers: 1, for storage and management (mimicking the popular Dropbox approach) prior to 2, data publication and access. The first is driven by researchers themselves, while the second may be more often driven by formal requirements by funders, institutions and policies. Broadly, these stages might offer a simple deposit interface and a more formally structured metadata collection interface, respectively (although I wait to see whether this is what DataFlow provides). The point about this approach is that it supports, and links, both motivations for data deposit.

What does DataPool offer in terms of a data system architecture? At the same Nottingham meeting the project presented a poster (pdf) including a diagram of a proposed system architecture. For those that saw it this graphic may have caught the eye for the splash of colour it brought to the poster, but the viewing context was not ideal for detailed information. It is worth reproducing that illustration here with some reflection on what it might offer to the general architectural principles we need to establish for research data systems.

Adapting the Southampton Microsoft Sharepoint 2010 system for data deposit

This diagram was produced by Peter Hancock, director of iSolutions, the University of Southampton’s ICT professional services department. Continuing development of this data architecture will remain an iSolutions responsibility in parallel with and beyond the DataPool Project. Where DataPool comes in is to seek to connect the data system approach with other institutional interests, notably researchers and users, through case studies and faculty contacts; training, through staff and graduate training centres; and policy, through the university’s research advisory and decision-making groups.

What follows are some thoughts on this architecture. First, as with DataFlow, this appears to be a two-stage architecture, in this case indicated as ‘Sharepoint’ and ‘Dropbox type infrastructure’. Actually, it would be hard to compare this too closely with DataFlow, or even Dropbox, without some illustration of the respective deposit interfaces, and that is for another post.

This figure omits to connect other deposit interfaces, which could be EPrints or SWORD for example, with the University Data Repository and storage service. Such interfaces might be produced by DataPool or others, rather than by iSolutions, but these will still need access to the underlying university data infrastructure.

Second, there are two access routes for users, which can broadly be defined as internal and external to the institution. As this is a service-oriented architecture this is inevitable and presupposes a privileged view for internal users. Whether this privilege extends beyond their own work is under discussion.

Finally, deposit is not restricted to the institution’s repository but allows data to be moved and copied between institutional and external disciplinary or subject-based data repositories. This might be accomplished via a service such as SWORD. Research data policies promote such options, providing chosen data storage services are reputable and appropriate, rather than specifying particular data repositories, and many researchers wish to exercise such choice.

All institutional research data services will need to make provision for extensive and expanding data storage. This is not elaborated in the figure, and strategy, infrastructure and costs for this continue to be discussed at a high level.

The point of an architecture is that it becomes a detailed plan for building, or in this case implementing a research data system. Prior to that, as a high level abstraction it serves as a platform for input for all interested stakeholders, from all perspectives. For DataPool those perspectives span the whole of the University of Southampton, and to capture those we have ensured we shall be working across faculties, with the faculty contacts, and with data producers through disciplinary exemplars and case studies. Ultimately, to make progress this has to be an iterative process of development and feedback, but central to this is the development of the architecture because that is what should reach out to most people.

At this conceptual level an institutional data management architecture will raise many questions. We may or may not need Sharepoint, EPrints, DSpace and other information systems solutions; what we need are systems to fit the architectural vision.


Dec 14 2011

Driving institutional research data policy

Steve Hitchcock

Porch/Pooch Policy at PowellsInstitutional data policy is necessary as one of the drivers of changing practices towards research data across the institution. The role of data policy generally, according to Neylon, is to drive data availability, data management, and data archiving while stressing the importance of data as a core output of public research.

In our first post we identified DataPool’s three-pronged approach – system, policy, training – that we hope will enable us to develop and support a rich collection of research data emerging from the University of Southampton. Here we report on how the proposed research data policy is shaping at Southampton, and on progress piloting it through the research and senior management channels towards adoption.

Where we stand now with the policy at Southampton is it was recently given a final-stage presentation to the university’s Research and Enterprise Advisory Group (REAG), which directed the policy to the University Executive Group (UEG). Ultimately, UEG can forward it to the university’s highest policy-making body, the Senate, perhaps by March 2012 if it goes well.

This rate of progress is due in part to the work of our predecessor Institutional Data Management Blueprint project, but credit is also due to the sterling work of Wendy White and Mark Brown at the head of the DataPool Project in piloting the draft policy through the advisory group and policy-making networks. Wendy and Mark are veterans of the university’s Open Access Policy, so they know their way around the networks of influence concerning the development of institutional repositories.

The policy includes the policy document supported by a series of user guides to smooth implementation. It would be premature to describe the specifics of the policy here, although broadly it covers a researcher’s responsibilities, IPR, storage, retention, disposal and access, as well as setting out contextual issues such as purpose, objectives, and definitions. My viewpoint on reading the draft policy is to anticipate how a researcher might respond to it in terms of clarity of actions, options and consequences. In this respect it is noticeable how much the policy has improved through review and iterations. Admittedly it may not attract the same level of excited publicity as, say, an open data policy, but the scope is wider and the purpose more pragmatic.

We do not expect the policy to be without issues when it comes to implementation, clearly, for an initiative of this scale, but the policy will give the DataPool Project the basis to investigate and resolve the issues, in terms of actions and answers. On current schedule, there should be a year for the project to work with this.

There is little prior art on institutional data policy, and one of the reasons JISC has funded DataPool is not just to help produce a data policy, but to inform other institutions on implementation. Logged on the DCC page of UK Institutional data policies are currently just four examples, one of which is a ‘commitment’ rather than a policy, while others are in the early stages of implementation. Policy implementation, monitoring and ability to adapt are the real testing ground for this latest phase of research data management projects.

More, and somewhat better established, data policies can be found among the UK’s research funders, again as logged by DCC. These policies can be seen as context rather than competition for institutional data policies. One of the reasons managers of institutions might commit to research data policy are the requirements on their researchers that are embedded in the funder policies. For the institutions there is a need to support their researchers in complying with the policies, for no doubt there will in future be implications for research assessment processes. There is also the incentive of competition between institutions, and the scent of a leading edge in exploiting innovation driven by the profound changes in digital research data management. As Neylon says: “In the longer term, those who adopt more effective and efficient approaches will simply out compete those who do not or can not.” We will look in more detail at the funder policies and their implications for institutions in a later post.

One of the points of contention in emerging data policy is to define the term ‘research data’. How can policy on this be effectively implemented unless everyone has the same understanding? This may be a semantic argument, but it must also be rooted in current practice by researchers, and also in how that practice is already being shaped by current policy, notably from the research funders. My simplistic take here is that researchers are finding their own preferred approaches to storing and managing early-stage research data, that is, data some way from publication. We might call this the Dropbox approach. Meanwhile funder policy, on the other hand, tends to apply more to data that underpins publication, that is, is concerned with the quality and reproducibility of results, the bedrock of scientific testability. If simple and unrepresentative, this view on the different motivations and practices for capturing both early and late-stage research data nevertheless seems to mirror the framework of our companion JISC DataFlow Project at the University of Oxford, as represented in its DataStage (a secure personalized ‘local’ file management environment for use at the research group level) and DataBank (an institutional-level research data repository allowing researchers to store, reference, manage and discover datasets) processes, respectively.

Seeing the Southampton policy develop through engagement with research, policy and legal experts on advisory groups it is easy to anticipate this prospectively as a worthy policy exemplar for research data. It won’t be the last institutional research data management policy:

> @simonhodson99 By March 2013 all these #jiscmrd projects will develop research data management policies for their institutions http://t.co/gqzYf4pC #idcc11, 6 December 2011

Timing is key, and our aim is to bring forward policy ratification early in 2012 rather than by March 2013, the project end. It’s important to allow enough time to test the policy in practice. Given the scope of its intended coverage and the range of open questions posed by research data, it is possible the policy might contain unexpected holes or omissions that could limit uptake by both willing and unwilling researchers. Even when adopted – perhaps even more so when adopted – we have to be proactive and vigilant in monitoring how researchers respond to the institutional research data policy.


Dec 8 2011

Data system, policy, training: putting people first

Steve Hitchcock

To support research data management across a large multi-disciplinary institution such as the University of Southampton you need a collection, storage and archiving system, right? Yes, but you need more than that. The proposal for the DataPool Project reveals that it will tackle three distinct developments:

  • Research Data Management System Implementation
  • Research Data Management Policy Ratification and Implementation
  • Integrated Training, Guidance and Support for Researchers

So in addition to a system you need an institutional policy setting out requirements for participation by members of the institution, and training to help them do what the policy specifies using the system provided. We will return to these three developments often in this blog as the project builds, and the next posts will set out the details of where we start for each of these developments.

But there is another crucial element, and the clues are becoming clearer – that is, people. As one of the joint project managers for the DataPool Project, with my colleague Dorothy Byatt, the brief in the project proposal contains no magic bullet that will solve the challenge of rolling out digital data management to members and all disciplines across an institution, but it begins to set out a network of colleagues to help us achieve the goal.

Since we began the project Wendy White, co-investigator from the university library, has been extending this network beyond the co-investigators named in the proposal to encompass data contacts for all eight major faculties across the university, and leaders for a series of disciplinary case studies involving different data types produced by postgraduate and undergraduate students as well as researchers. We look forward to introducing data contacts and case study leaders as we report their work here.

We will, however, introduce our team of co-investigators now because they have shaped both the proposal and the prior project, the Institutional Data Management Blueprint (IDMB), that led to where we begin with DataPool. They are:

Mark Brown (Principal Investigator), Les Carr (computer science), Simon Cox (engineering sciences), Graeme Earl (archaeology), Jeremy Frey (chemistry) and Peter Hancock (iSolutions).

We hope they will have the opportunity during the project to introduce themselves as co-contributors to this blog.