May 20 2013

Research data cataloguing at Southampton using Microsoft SharePoint and EPrints

Steve Hitchcock

Motivated by the JISC RDM Programme many UK research institutions are implementing research data repositories. A variety of repository platforms, from original solutions such as DataFlow at Oxford, as well as established digital repositories such as DSpace and EPrints, have been adopted. The University of Southampton is unusual in pursuing two repository options.

DataPool has been developing a data cataloguing facility based on Microsoft SharePoint, which provides a number of IT services at the university, and on EPrints repository software. An earlier presentation considered the development of SharePoint and EPrints as emerging research data repositories in the context of high-level ‘architectural’ and practical ‘engineering’ challenges. A new report describes further progress along both repository routes, notably collaboration with the JISC Research Data @Essex project on ReCollect, a standards-based data deposit application for EPrints.

Icon for ReCollect, an EPrints plugin application for managing research dataThe ReCollect plugin, or app, provides an EPrints repository with an expanded metadata profile for describing research data based on DataCite, INSPIRE and DDI standards. The metadata profile was designed by Research Data @Essex, and packaged as an EPrints app in collaboration with Patrick McSweeney at the University of Southampton. The resulting app first appeared on 27 March 2013 in the EPrints Bazaar, which advertises and distributes applications that can be installed in an EPrints repository with one click.

The list of data types that can be deposited in EPrints was expanded to include Dataset and Experiment, to support the submission of research data, in 2007. Selection of a data type presents the depositor with a series of pages and fields that are designed to be appropriate for the description of that type. The order in which these pages and fields are presented defines the deposit ‘workflow’ for the data type, and is typically customised to specific repository implementations by institutions.

If workflow for different data types is provided directly by EPrints, and can be customised to local repository needs, what is the need for an app that implements data deposit workflow? First, it can simplify and speed up customisation if the desired workflow can be implemented from an app. Second, and more importantly for research data workflow, it can lead to greater standards compliance, consistency and collaboration between repositories.

Repositories can customise the data deposit workflow provided by ReCollect from the profile designed by Research Data @Essex without affecting standards compliance. The new report compares an example of customising the workflow for the ePrints Soton institutional repository with the ReCollect original.

A community of potential users for ReCollect, including the EPrints repositories at Glasgow University and Leeds University, has been established through webinar-based conference calls.

Further modifications to the Southampton workflow are likely, including the facility to automate minting and embedding of British Library DataCite DOIs, designed for data citation, for each data record.

Microsoft SharePoint logoIn the case of SharePoint, user interface forms for creating data records have been piloted and tested. The approach is distinctive in creating two linked forms, one to describe a project, the other to record a dataset, rather than a single workflow as in the case of EPrints. This development of SharePoint for description and storage of research data is part of a longer-term extension and integration of services provided on the platform at Southampton.

So in the first instance research data cataloguing at Southampton uses EPrints, extending the existing institutional repository by installing the Essex ReCollect data app. The service went live on ePrints Soton in April 2013.

What is needed now, however, is practice and experience with real data collections. In this respect many questions about the use of data repositories remain open. These early implementations are likely to change significantly as that process evolves.

For more on this DataPool case study see the full report.


Feb 25 2013

Mapping RDM requirements for the next stage of data repositories

Steve Hitchcock

CKAN logoDataPool and the University of Southampton have been investigating the use of EPrints and Sharepoint to extend the capabilities of repositories for research data management (RDM). Others, notably the Universities of Lincoln and Bristol, have been looking at CKAN, a data portal platform from the Open Knowledge Foundation, and were responsible with JISC for a ‘sold out’ meeting on CKAN for Research Data Management in an Academic Setting (18 February 2013).

The principal output of the meeting is a set of CKAN RDM requirements (a Google Doc spreadsheet), produced by workgroups in which all participants at the meeting were involved, based around different stakeholder positions. Delete the term ‘CKAN’ from the title of this spreadsheet and you have a series of RDM requirements that define the space in which all repository platforms seeking to support RDM will be challenged to engage. In other words, while adapting deposit workflow is a start, it is not sufficient. Dropbox – the elephant in the room that went unmentioned, for at least an hour into the meeting – stands as the model that illustrates some of these challenges, but there are now many more requirements set out from this workshop.

At Lincoln, Joss Winn explained, they have an EPrints publications repository and are developing a CKAN data approach to “create a record of CKAN data in EPrints, thereby joining research outputs with research data” through a SWORD2 implementation.

Is this a path to get rid of EPrints at Lincoln, to accommodate CKAN? No, Joss said, quite definitely, but then effectively questioned his own answer: if starting now, would we start from here, i.e. a combination of two software platforms? The implication is that over time, possibly years, the definite answer could change. The challenge is on.

Addendum For more detail on proceedings at the workshop see Patrick McCann’s report for DCC, and a view from a presenter, Simon Price of data.bris.


Dec 17 2012

To architect or engineer research data repositories

Steve Hitchcock

There cannot be many mature products where development meetings have not been interrupted with a rueful declaration that to make further progress “you wouldn’t start from here”. This encapsulates one key difference between the architect and engineer, the latter prepared to work with the set of tools provided, the other preferring to start with a blank sheet of paper or an open space.

In building research data repositories using two different softwares, Microsoft Sharepoint and EPrints, the DataPool Project is working somewhere between these extremes. Which approach will prove to be the more resilient for research data management (RDM)? In this invited talk for RDMF 9, the ninth in the DCC series of Research Data Management Forums, held in Cambridge on 14-15 November 2012, we will look at the relevant factors. As a project we are agnostic to repository platforms, and as an institutional-scale project we have to work with who will support the chosen platform.


The original Powerpoint slides are available from the RDMF9 site. This version additionally reproduces the notes for each slide used to inform the commentary from the presentation. It might be worth opening the Slideshare site (adverts notwithstanding) to switch between the slide notes below and the graphic slides – clicking on View on Slideshare in the embedded view will open these in a separate browser window

I thank Graham Pryor of DCC, organiser of RDMF9, for inviting this talk, and for suggesting this topic based, presumably, on the project blog post shown in slide 2. This post sets out some of the higher-level issues while avoiding the trap of setting up a straw man pitting Sharepoint versus EPrints.

Before we get into the detailed notes, here is the live Twitter stream for the DataPool presentation (retrieved from #rdmf9 hashtag on 15 Nov.).

@jiscdatapool Preparing to talk at #rdmf9. Have the 9 am slot
@MeikPoschen #rdmf9 2nd day: To architect or engineer? Lessons from DataPool on building RDM repositories, first talk by Steve Hitchcock #jiscmrd
@MeikPoschen JISC DataPool Project at Southampton, see t.co/g5FCfkhB #jiscmrd #rdmf9
@simonhodson99 Down to work at #rdmf9 at Madingley Hall – outside it’s misty, autumnal – inside it’s Steve Hitchcock, DataPool: to architect or engineer?
@simonhodson99 Steve Hitchcock argues that the DataFlow t.co/RQqp8VdQ solution is one of the most innovative things to come through #jiscmrd #rdmf9
@simonhodson99 ePrints data apps available from ePrints Bazaar: t.co/d1zk8oD1 #jiscmrd #rdmf9
@jtedds Hitchcock (Southampton) describes institutional drive to implement SharePoint type solution but can it compete with DropBox? #jiscmrd #rdmf9
@jtedds Trial integrations with DataFlow MT @simonhodson99 ePrints data apps available from ePrints Bazaar t.co/4X8pv9iz #jiscmrd #rdmf9
@John_Milner Hitchcock highlights the challenge of getting quality RDM while keeping deposit simple for researchers, not easy #RDMF9
@simonhodson99 Perennial question of the level of detail required in metadata: with minimal metadata will the data be discoverable or reusable? #rdmf9
@simonhodson99 Is SharePoint a sufficient and appropriate platform for active data management? Sustainable? One size fits all? #rdmf9

Are the Twitter contributions a fair summary? We return to the slide commentary to find out.

Slide 3 The blog post highlighted in slide 2 included this architectural diagram, produced by Peter Hancock, director of the iSolutions IT services provider at the University of Southampton. Although it leans heavily towards referencing Sharepoint, it can be viewed as a high-level reference model, analogous to the OAIS in digital preservation, and therefore as a model that can embrace other repository types.

Slide 4 Before we get into the detail of the presentation, here is a poster-based summary of the DataPool Project. It has a tripartite approach characteristic of similar institutional projects in the JISC MRD programme, covering data policy, training and, the area of interest here, building a data repository. It is worth noting as well, in this context, that the development partners shown in the row beneath the tripartite elements effectively represent ways of getting data in and out of the RDM service adopted, and are relevant factors in the repository design.

Slide 5 Here is how the different repository platforms might line up on a broad spectrum of Architected vs Engineered. This is a rough-and-ready approach to illustrate the basic point. Also included is DataFlow, from the University of Oxford, perhaps the most innovative repository platform to have emerged for RDM. Given its originality, it appears towards the architected end of the spectrum. We could not claim that Sharepoint is a new software platform in the same way as DataFlow, but from an RDM perspective you don’t get anything out of the box – you have to start from scratch and ‘architect’ an RDM solution. What developers can do is try and ‘engineer’ the designed RDM element with the IT services already provided in Sharepoint. EPrints first appeared in 2001 to manage research publications. It has offered a ‘dataset’ deposit type since 2007, so provides a ready-made solution for an RDM repository, and can be ‘engineered’ to enhance that solution. As the slide notes, other RDM repository platforms are available. In the following slides we will explore the features of our three highlighted RDM platforms, starting with DataFlow.

Slide 6 DataFlow is a two-stage architecture for data management: an open (Dropbox-like) space for data producers (DataStage), and a managed and curated repository (DataBank), connected by a standard content transfer protocol, SWORD. While DataBank provides a bespoke data management service for Oxford, we have recently noted experiments to connect an open source version of DataStage with EPrints- and DSpace-based curated repositories, thus providing the yearned for Dropbox functionality apparently so in demand with research data producers.

Slide 7 This is an example screenshot from the DataStage-EPrints experimental arrangement used by the JISC Kaptur project. It shows the familiar Choose File-Upload button combination familiar to e.g. WordPress blog users, for uploading data. Uploaded data is then shown in a conventional file manager list.

Slide 8 To move data from DataStage to the curated repository, again shown in the experimental Kaptur implementation, uses this surprisingly simple SWORD client interface. If this seems insufficient description for a curated item, presumably a more detailed SWORD client could be substituted.

Slide 9 One basis for building a more comprehensive description, or metadata, for research data is this 3-layer model produced by the Institutional Data Management Blueprint (IDMB) Project, the project that preceded DataPool at the University of Southampton. This is quite a general-purpose and flexible model, perhaps with more flexibility than meaning. Structurally, nevertheless, we will see that this has some relevance to repository deposit workflow design.

Slide 10 The 3-layer metadata model can be seen quite clearly in the emerging user interface for data deposit built on Sharepoint. Here we see the interface for collecting project descriptions, used once per project and then linked to each data record produced by the project.

Slide 11 In the same style, here is the Sharepoint user interface for collecting data descriptions. One of the most noticeable features within both the Project and Data forms is the small number of mandatory fields (indicated with a red asterisk), just one on each form. Mandatory fields have to be filled in for the form to submit successfully. Most people will have experienced these fields; invariably when completing a Web shopping form these will be returned with red text warning. In this case you could feasibly submit a project or data description containing only a title. Aspects such as this are shortly to be subjected to user testing and review of this implementation.

Slide 12 Sharepoint has its detractors as an IT service platform, principally bemoaning its complexity-to-functionality ratio. Prof Simon Cox from Southampton University takes the opposite view passionately. This is an extract from his intervention at a DataPool Steering Group meeting (May 2012) putting the case for Sharepoint. It is a good way of understanding the wider strengths of Sharepoint, which may not be immediately apparent to users of particular Sharepoint services. Building the range of services suggested is a difficult and long-term project.

Slide 13 EPrints supports the deposit of many item types, including datasets since 2007. When you open a new deposit process in EPrints you will first be shown this screen, where you can select an item type such as ‘dataset’.

Slide 14 Selecting ‘dataset’ will take you to this next screen, which might look something like this from ePrints Soton, the Southampton Institutional Repository. This is not quite a default screen for standard EPrints installs; the workflow and fields have been customised in some areas by a repository developer.

Slide 15 EPrints users need not be restricted to standard interfaces or interfaces customised to a repository requirement. Interfaces in EPrints can be added to or amended by simply installing an app from the app store, or EPrints Bazaar. Unlike the Apple app store, with which it might optimistically be compared, EPrints apps are not selected to be installed by users but installation is authorised by repository managers. There are already two apps for those managers to choose to suit particular RDM workflow requirements: DataShare and Data Core. More data apps are expected to follow. EPrints is thus being engineered for flexibility in RDM deposit. In the following slides we will explore these first two data apps.

Slide 16 DataShare makes some minor modifications to the default EPrints workflow for deposit of datasets, highlighted with red circles here.

Slide 17 Data Core aims to implement a minimal ‘core’ metadata for datasets. Implementing this app will overwrite the default EPrints workflow, replacing it with the minimal set, approximately half of which is shown here (the remainder in the next slide). In addition, we have a short description of the design aims for Data Core, which are unavailable for Sharepoint data deposit and the DataShare app.

Slide 18 Taking both slides showing the Data Core deposit workflow, this is comparable, in extent, with the Sharepoint ‘data’ interface shown earlier, although it has a few more mandatory fields.

Slide 19 Another example of an EPrints data deposit interface has been developed by Research Data @Essex at the University of Essex. Like Data Core, the Essex approach has explicit design objectives, based on aligning with other metadata initiatives to support multi-disciplinary data. In other words, this does not simply expand or reduce the default EPrints workflow for data deposit, but starts with a new perspective. We have been liaising with its development team to investigate the possibility of building this approach into an Essex EPrints app for other repositories to share.

Slide 20 Here is a section of the Essex workflow, highlighting one area of major difference with the default workflow. It shows fields for time- and geographic-based information.

Slide 21 We’ve looked at getting data into the repository, but not yet how it is displayed as an output, or a data record from the repository. This is one example. It is not the most revealing record, but could be expanded.

Slide 22 Essex has cited specific design criteria for its research data repository. Additionally we have observed some characteristic features, indicated here. In particular, it is a data-only repository, without provision for other data-types offered by EPrints (shown in slide 13). The indication of mandatory fields adds a further layer of insight into the implementation of the design criteria.

Slide 23 So far in this presentation we have seen different implementations of data repository deposit interfaces, including DataFlow, Sharepoint, and multiple interfaces for EPrints. Where is this heading, and what are the common themes? Since we are exploring the difference between architecting and engineering these repositories, I was interested to see this national newspaper article about a major redevelopment of an area close to central London, Nine Elms, an area that interests me as I pass through it on regular basis. Phrases that stand out refer to the relationship between the planned new high-rise buildings. What does this have to do with data repositories?

Slide 24 Interoperability is the relationship between repositories and how they interact with services, such as search, through shared metadata. If repositories have “nothing in particular to do with anything around them” or “show little interest in anything around” them, then they will not be interoperable. If repositories stand alone rather than interoperate then they become less effective at making their contents visible. Open access repositories have long recognised the importance of interoperability, being founded on the Open Archives Initiative (OAI) over a decade ago, and efforts to improve interoperability continue with current developments. Shown here are some current interoperability initiatives from one morning’s mailbox. Data repositories will be connected to this debate, but so far it has not been a priority in the examples we have considered here.

Slide 25 One of the organisations listed on the previous slide, COAR, produced a report that outlines more comprehensively the scope of current interoperability initiatives for open access. While some solutions to the capture of research data seen here have reasonably been ‘architected’, that is, starting with a blank sheet to focus on the specific design needs of data deposit, these will need to catch up quickly with interoperability requirements, including most of those listed here. Data repositories ‘engineered’ on a platform such as EPrints, originally designed for other data types, do not obviously lack the flexibility to accommodate research data, and by virtue of having contributed to repository interoperability since the original OAI, already support most of the requirements shown here.

Slide 26 As for the DataPool Project, it will continue its dual approach of developing and testing both Sharepoint and EPrints apps. As a project it does not get to choose what is ultimately adopted to run the emerging research data repository at the University of Southampton. There are repository-specific factors that will determine that; but there are other organisational factors to take into account as well. Institutions seeking to build research data repositories that are clearly focussed on this range of factors are likely to have most success in implementing a repository to attract data deposit and usage.

This post has covered just one presentation, from DataPool, at RDMF9. The following two blog reports give a wider flavour of the event, the first exploring the architectural issues raised.

Julie Allinson, Some initial thoughts about RDM infrastructure @ York: “I’ll certainly carry on working up my architecture diagram, and will be drawing on the data coming out of our RDM interviews and survey to help flesh out the scenarios we need to support. But what I feel encouraged and even a little bit excited by is the comment by Kevin Ashley at the end of the RDMF9 event: that two years ago everyone was talking about the problem, and now people are coming up with solutions.”

Carlos Silva, RDMF9: Shaping the infrastructure, 14-15 November 2012: “Overall it was a good workshop which provided different points of view but at the same time made me realise that all the institutions are facing similar issues. IT departments will need to work more closely with other departments, and in particular the Library and Research Office in order to secure funding and make sustainable decisions about software.”


May 28 2012

What can research data repositories learn from open access? Part 2

Steve Hitchcock

Open access is finally attracting high-level attention from national governments, but full open access has been a long time arriving despite extensive funding, development and the commitment of many people. As much of that effort switches towards the implementation of repositories to store, share and publish the research data that informs publications, we are considering what lessons might be learned from open access repositories, so that the path to effective data repositories might be shorter and less fraught. In part 1 the factors considered included policy, infrastructure, workflow and curation. Here in part 2 we look at rights and user interfaces.

2500 Creative Commons Licenses

2500 Creative Commons Licenses

Rights

Since open access is indelibly associated with publication, one of the primary impediments to providing open access is transfer of rights to publishers, a practice that has failed to adapt to the digital switch. Research data is not so encumbered now, and with care data creators can deploy rights more effectively because they begin in the digital era.

It has often been argued that open access repositories failed to adopt or compete with Web 2.0 services. Quite what this means is not clear, but one aspect might be social and user engagement for the purpose of growing content. Well-known services that became associated with Web 2.0 are YouTube and Flickr, so the case might be that OA repositories were not as successful in attracting content as these services. There is one key point that differentiates these services from open access: prior to Web photo and video services, there were no simple publication outlets for this type of content for non-professional or non-broadcast works. In the case of open access there are pre-exisiting publications, such as journals and conference proceedings. Open access repositories do not seek to eliminate the journals, but to supplement the access they provide. There is thus another party with a vested interest in ownership of this content.

This is why open access can get mired in discussions about rights. Creative Commons (CC) licences were designed for content to be shared on the Web and communicate how creators are prepared to share their rights with users to open and extend use of their content. For research papers, however, since long before the Web, publishers have required a transfer of rights from the author in return for publication – hence the ownership issue. Unlike CC, these rights can be used to lock down access and reuse, for commercial purposes.

There is a form of open access, ‘gold’ OA journals (in BOAI this is complementary to ‘green’ OA repositories), which may be accompanied by release of commercial rights using CC but often at a cost for publication. In other words, publication is paid for financially rather than in a transfer of rights. Such journals present this as an advantage over non-OA journals and OA repositories, and this can be beneficial for text mining and other applications. While this form of OA publishing has been growing in recent years, it remains to be seen how quickly it can replace or adapt key high-impact journals, and at what cost.

Broadly, research data are not yet subject to publication rights. Publications are a highly processed form of research data, in the form of tables and graphs, for example. Typically the data targeted by data repositories precedes the refined and summarised publication versions, and is therefore not covered by the same rights transfer. That could change if expanded publications requiring data deposit or third-party service providers seek to obtain rights in return for these services.

Strictly, while institutions where research has been performed inherently own the rights to that work, they have been reluctant to exercise those rights in ways that would restrict a researcher’s choice of publication, or to require or even advise authors on some retention of rights or amendments to rights agreements. Unlike with peer reviewed papers, where precedent is more strongly established, it is possible that institutions will seek to impose more control of rights where research data is concerned. Recently reported cases show how a university’s allocation of control of rights within research teams, the special case at Purdue University notwithstanding, can have consequences for publication. What data creators and authors will be concerned about is whether the exercise of those rights by institutions is commensurate with the services that are provided in return. There may be resistance if established academic freedoms are constrained and research impact is reduced as a result, but with the right services and effective exercise of rights impact can be increased by sharing research data openly.

The lesson of open access is that rights matter, that the traditional all-rights transfer for academic publication is no longer appropriate for or conducive to fully exploiting new forms of digital dissemination, but also that established practices can be slow to change. Institutions and authors should be careful not to let rights to research data slip away as they did for publications in another era, but equally they must be careful to work together to use those rights in ways that maximise the benefits and impact for them and for research.

User interfaces

Users of repository services are both those who provide the data and those who consume it. The features that define and characterise repositories are the interfaces through which users can perform these actions, but are these interfaces flexible or adaptable enough to serve all those who might want to use repositories for publications or data?

Within this analysis (including part 1) it has been suggested that OA repositories may have overlooked workflow, and Web 2.0 developments with regard to content growth, services and engagement with users. In fact, some helpful developments can be found buried deep within repository software, but to see where these might impact users more directly we have to look away from the familiar repository interfaces. This critical development is called SWORD (Simple Web service Offering Repository Deposit), and it will impact on data repositories as well, in ways that we have not yet seen implemented on a large scale, even for OA repositories.

As the name indicates, SWORD is focussed on one of the actions that a repository supports, deposit, that is, getting new content into a repository or updating content, this updating feature recently becoming available with SWORD version 2.

SWORD frees the user deposit interface from the repository software and the specific instance of a repository. As the number and types of repositories have grown, some authors may wish to deposit in more than one place. SWORD can help with that. If the repository deposit interface demands too many keystrokes (metadata), or does not allow all the metadata you want to record – too few keystrokes, SWORD can help there as well.

The deposit still needs to reach a repository (‘endpoint’) so SWORD and repository softwares are working together on this, not competing. All major repository softwares support SWORD, and the most recent releases support SWORD v2. What’s needed are more SWORD client interfaces, as there have been relatively few examples to date.

It is easy to see that data repositories can benefit from SWORD in the same way as open access – deposit in many places from a single interface. When it comes to scoping metadata within a deposit interface, given the wide disparity in describing different data types in different disciplines with metadata, SWORD begins to appear essential for data deposit. These are just the services we can anticipate now.

With SWORDv2 we can envisage taking deposit out of the forms-based deposit approach and into different applications. One that may work for data deposit is a DropBox-like application for file-based deposit. With this application ‘dropping’ a file to a specified directory in a file manager on a laptop, say, synchronises and copies subsequent versions of that file to a repository (Figure 3), or potentially to a remote storage service, which can be accessed by the user logging on to the storage site using any Web-connected device. Data can thus be accessed and shared, or published in open access repositories. Using SWORDv2, file manager-based services could be used for simple deposit of research data files in conjunction with storage services; with SWORD v2 these could also fulfil automated deposit cases.

Figure 3. Dragging an image copies it to the selected repository

The DataFlow workflow illustrated in part 1 uses SWORD as the transfer mechanism between the user’s local storage and the curated institutional storage, in essence using it to capture additional metadata.

Another demonstrated application of a SWORDv2-based interface works within desktop authoring tools, such as a word processor or other office applications.

What these applications portend is that data repositories can fill the workflow gap, which we recognised was missing from open access repositories, and which looks to be potentially more complex for data repositories. We can begin to support deposit of data to a schedule that need not be based on the same frequency and mode as publication but is more flexible. As well as needing more SWORD client interfaces, however, another open question is how repository softwares designed for publication can adapt to support two different paradigms: managed storage as well as publication.

There are only two reasons for data creators to deposit in data repositories: they want to (share, publish, good academic practice, etc.), or they have to (policy). By focussing on services that are adaptable enough to serve users, building on SWORD to support flexible workflow and bringing deposit into automated or even more creative applications, research data repositories have the chance to support both motivations, instead of being left to emphasise policy as the primary motivator, as has happened for open access repositories.

Summary

Establishing and growing open access content is taking longer and proving harder than ever originally anticipated back in 2000. As we consider how to extend open access repositories to manage research data, are we learning the right lessons from open access? Have we covered all the important issues, or are we missing key factors? Research data repositories bring challenges that are distinct from open access. What are the new challenges, and which of these will have most impact on the success of research data repositories?

In this analysis the factors we have considered include policy, infrastructure, workflow, curation, rights and user interfaces. We haven’t covered preservation, but digital preservation is served by a comprehensive selection of tools that can be applied to repositories, and one lesson seems to be that repositories will move to be preservation-ready when content volumes and risk-analysis demand.

Open access began with the principle that it is good for researchers to share findings, and that digital networks enable that to happen more widely and at lower cost, ultimately free to users. It was anticipated that users would want to take advantage of this, as physicists already did with arXiv, and when this model failed to take off to the same degree in other disciplines, eventually institutional repositories emerged to encourage further growth of open access. As that growth appeared to hit a ceiling, research funders and institutions began to step in with open access policy. In other words, principle – whichever principle you prefer, returns to taxpayers, for example, or productivity of research, or escalating journal costs – was used to justify and frame policy for users. Users themselves, so it seems based on unmandated rates of open access deposit, have been less keen to put principle into practice.

In hindsight there are lessons that could have been learned to speed up the process. Progress with data repositories need not suffer the same mistakes or the same delays. Data repositories might occupy a more pragmatic, less emotional space than open access. Unlike for open access there is no single or easily defined target for research data repositories – what is data? continues to be a perennial question – so policy and requirements might be broader. Perhaps this time content deposited in data repositories can be driven by services that attract users, as well as by policy. In this case, the aim of data repositories must be find those users who want these services, and then to make those services work better for them.


May 24 2012

What can research data repositories learn from open access? Part 1

Steve Hitchcock

Institutional research data repositories follow in the wake of the widespread adoption of open access repositories across UK institutions during the last decade. What can these new repositories learn from the experiences of open access, and what pointers can we find for the development of data repositories? In the first part of this post we will consider factors such as policy, infrastructure, workflow and curation. In part 2 we will extend the analysis to rights and user interfaces.

It may be a timely moment to reflect. A recent speech by the UK government’s science minister David Willetts prompted renewed excitement over open access, with a forthcoming report to advise on specific actions to be taken to realise more open access. Less remarked on, apart from comment about the undefined but potentially high-profile role of Wikipedia founder Jimmy Wales, was the bigger picture view that anticipates stronger integration and linking between research publications, research information for reporting and assessment, and research data for data mining but also for research testing and validation.

Open access (OA) repositories, which principally provide free access to an author’s version of published research papers, effectively began with the physics arXiv in 1991. Institutional repositories, which switch the focus of coverage from the subject to the place of authorship, emerged in 2001 following the Open Archives Initiative (OAI). To complete the record, the term ‘open access’ was defined by the Budapest Open Access Initiative (BOAI) in 2002.

So institutional OA repositories have up to a decade head start on proposed institutional research data repositories. The University of Southampton, home of the DataPool project, has hosted a leading OA repository since 2005, so the project team has long experience of running a repository.

As with OA repositories, there are plenty of examples of subject-focussed research data repositories, but here we focus on factors affecting institutional repositories (IRs).

Policy

For OA IRs, technology and infrastructure preceded policy. First impressions are that for data IRs this will be the other way round. As with OA, data policies in the UK are being driven both by research funders and institutions.

OA policies focus on the need to expand full-text content collections held in repositories and typically require (mandate) or encourage authors to deposit versions of their published papers. The first university-wide mandatory OA policy was implemented at Queensland University of Technology in Australia, in 2004, according to the site EnablingOpenScholarship. This site also shows graphically how the number of institutional policies began to accelerate from the first quarter of 2009, some 5 years or so since the growth of IRs saw similar acceleration, although it remains a minority of institutions that have such polices. It has been calculated that OA mandate policies can increase deposit rates to above 60% of eligible papers from the average of 20%. In this respect, the lack of a suitable policy could be seen to hinder an institutional OA repository.

Emerging UK institutional data polices by comparison have focussed on requiring researchers to create data management plans and data records, and emphasise sustainable practices in managing and storing data for the purpose of access, stopping short of requiring open access or of institutional deposit of actual data that would then need to be supported by the institution. This might be because institutions have still to calculate and cost the the storage infrastructure needed, whether managed locally or in the ‘cloud’, because institutions are unclear what value they can bring to data management – or even where the value is in the data they seek to help support, or because there is not yet any consensus on whether data repositories should be subject-based, or institutional, an issue which OA repositories have still not fully resolved. Institutional data policies have in turn been driven and directed by research funders’ data policies, principally RCUK and EPSRC (Jones 2012) setting principles and expectations of institutional compliance within a specified timescale (for EPSRC, by 2015).

Data policies may benefit from being instituted ahead of developing infrastructure for collecting, managing and presenting data. However, the few early policies available suggest little common purpose – we are clearly some way from having a best-practice data policy template for others to follow, as has evolved for OA repositories. To serve even the limited requirements of these early policies, institutions will need to connect decisions on infrastructure and understand patterns of workflow that produce research data, as we shall see below.

Infrastructure

By infrastructure we mean the technical capability to support distinctive requirements. While OA repository infrastructure is well established, it has not had to tackle the challenge of large-scale storage that is likely characterise data repositories.

The essential infrastructure that led to OA repositories was put in place by OAI: this was a protocol for metadata harvesting OAI-PMH. This allowed individual repositories to be viewed collectively through services – search being the most prominent service, at a time when Google was new and relatively little known – based on OAI-PMH. Immediately, software emerged for setting up institutional repositories, first EPrints and later DSpace and others. These repository softwares now also bring a range of integral services established over a decade that can be utilised to manage a range of data types, including research data.

Hence this same infrastructure, with modification, is being used to serve data repositories. There is, however, one new infrastructure component that data repositories will need to introduce – large-scale data storage. While content volumes for OA repositories do not test conventional storage systems, data repositories will inevitably provide much bigger challenges to storage and curation. To get a sense of the scale of the problem, Figure 1 compares data volumes at different stages, and is taken from a presentation about scoping curation for digital repositories. It is notable that data generation volumes cannot be visualised on the same scale as the other stages, since these are orders of magnitude larger. We might call this the data curation gap. Rosenthal has recently questioned assumptions that all data generated might be kept ‘forever’, indicating the need to fill the curation gap: “Assuming (data) growth continues, endowing 2012’s data will consume 19% of Gross World Product (GWP). On these trends, endowing 2018’s data will consume more than the entire GWP for the year.”

Comparing data volumes at different stages - generation, repository storage and archiving

Figure 1. Comparing data volumes at different stages - generation, repository storage and archiving

Institutions appear to have two choices to serve this level of storage: locally managed, or remote storage in the cloud. It is likely there will be a preference or a requirement to exert institutional control over storage (for example, at the University of Brighton: “we currently have a policy of not hosting staff data outside of the institution”), even in the case of cloud storage, hence developments such as the JISC UMF Cloud Pilot managed by Eduserv.

They could instead opt to advise researchers and data producers on selecting their own storage, from data archiving services such as UK Data Archive and the Archaeology Data Service, or data publication repositories such as Figshare, Dryad and other data repositories listed by DataCite, or even commercial cloud storage services (although a colleague noted that risk-averse advice might wish to start with where not to store data). Apart from the data archiving services, it remains to be seen whether these repositories can provide resilient, cost-effective, sustainable storage over an extended period, where content can be shared collaboratively during development and later made open access.

Workflow

OA repositories were designed from the outset for a publication mode of delivery that does not attempt to capture and support earlier phases in the workflow of writing a research paper. Given the more complex workflow (or life cycle) of research data, and the need to capture data at different stages of production and processing, the single publication mode may be inadequate for data repositories.

As the Web gained popularity in the mid-1990s all sorts of content began to appear, including digital versions of research papers published in what were then still largely paper journals. Authors were simply loading digital versions on to Web servers wherever these happened to be available, usually within their institutions, whether these servers were provided for this purpose or not.

OA institutional repositories served a simple purpose – to provide these authors with a more reliable, managed, services-based Web server to provide access to this digital content over a long timeframe. In this respect the designers behind these repositories over-estimated the number of authors that would use such services and the number of papers that would appear in repositories. Further, because the target content was papers due for peer-reviewed publication, the concept of workflow was barely considered beyond the expectation that the process of repository deposit would happen at the completion of writing the paper and in parallel with submission for formal publication. Thus OA repositories were designed for a one-stage deposit workflow, and no prior contact with authors while a paper was in preparation.

It has been suggested that by failing to engage authors at a sufficiently early stage and not providing support services for writing papers, that OA repositories have lost out to the more established process at the completion of a paper – publication. Further, by the time IRs were widespread, most journals were producing digital versions, so that was no longer a factor for authors posting Web copies of their papers, even if those journal digital versions still mostly stood behind subscription barriers.

While it is in principle simple to upload a completed paper from a local file store to a repository, it has been argued that a restraint to this happening is the requirement by many repositories for extensive accompanying ‘keystrokes’ or metadata. Competition with publishers for keystrokes at the point of completion and submission, lack of clarity in the benefits of OA repositories, and the failure to integrate with workflow may have been factors in preventing OA repositories from growing content to the levels anticipated, and led directly to the mandate policies described above.

The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention.

While workflow involved in the production of research data is more complex and less easy to classify than for OA publications, one helpful representation of this workflow has been illustrated by the University of Oxford (Figure 2). This shows how a project begins with a bid for funding and in future will invariably be accompanied by a data management plan (DMP), a data roadmap for the project to follow. If a workflow begins with a successful proposal and a DMP, it will lead to data and, increasingly, from policy or from users, a requirement for managed data storage with the ability to support controlled access for collaboration, and discovery for wider access. Figure 2 is taken from this presentation by the DataFlow Project.

Research Data Management Interventions at Oxford University

Figure 2. Representation of a research data management workflow, from the University of Oxford

Effective institutional data services will need to span this whole workflow and engage data creators at all stages. Lessons on workflow from open access suggest that for research data providing separate services for creating data records and storage, for example, will be insufficient. Data creators and authors will not engage in processes that do not enhance their work.

Curation

Digital curation is defined by Wikipedia as the “selection, preservation, maintenance, collection and archiving of digital assets”. For open access, selection is pre-determined – the target content is peer reviewed and published research papers. Further selection of such content for curation purposes is not merited by the data volumes involved. As we have seen with the ‘curation gap’ above, this does not hold for research data. As a result, more attention will need to be paid to curation for research data, and the line between simple user-managed storage and assisted curation will need to be more flexible.

Where that line might be drawn is thus open to question. It is drawn in principle by the strategy exemplified by DataFlow in Figure 2, which has two stages representing user-managed workspace and storage (the stage to Local Storage & Retrieval in Figure 2), with a transition to an institutionally curated space (Institutional Storage, or DataBank in Oxford’s system). The question remains as to what drives that transition. Such spaces are likely to have different curation criteria in different institutions, and will need to take account of researcher, policy and publication requirements, as well as costs.

An example of research data management that has optimised workflow, metadata collection and records creation, data curation, aggregation, discovery and access is eCrystals at Southampton, now extended to a federation led by the UK National Crystallography Service.

Interim summary

There are more lessons to learn from experience with open access that we can apply to research data repositories. In part 2 we will extend the analysis to rights and user interfaces.