May 20 2013

Research data cataloguing at Southampton using Microsoft SharePoint and EPrints

Steve Hitchcock

Motivated by the JISC RDM Programme many UK research institutions are implementing research data repositories. A variety of repository platforms, from original solutions such as DataFlow at Oxford, as well as established digital repositories such as DSpace and EPrints, have been adopted. The University of Southampton is unusual in pursuing two repository options.

DataPool has been developing a data cataloguing facility based on Microsoft SharePoint, which provides a number of IT services at the university, and on EPrints repository software. An earlier presentation considered the development of SharePoint and EPrints as emerging research data repositories in the context of high-level ‘architectural’ and practical ‘engineering’ challenges. A new report describes further progress along both repository routes, notably collaboration with the JISC Research Data @Essex project on ReCollect, a standards-based data deposit application for EPrints.

Icon for ReCollect, an EPrints plugin application for managing research dataThe ReCollect plugin, or app, provides an EPrints repository with an expanded metadata profile for describing research data based on DataCite, INSPIRE and DDI standards. The metadata profile was designed by Research Data @Essex, and packaged as an EPrints app in collaboration with Patrick McSweeney at the University of Southampton. The resulting app first appeared on 27 March 2013 in the EPrints Bazaar, which advertises and distributes applications that can be installed in an EPrints repository with one click.

The list of data types that can be deposited in EPrints was expanded to include Dataset and Experiment, to support the submission of research data, in 2007. Selection of a data type presents the depositor with a series of pages and fields that are designed to be appropriate for the description of that type. The order in which these pages and fields are presented defines the deposit ‘workflow’ for the data type, and is typically customised to specific repository implementations by institutions.

If workflow for different data types is provided directly by EPrints, and can be customised to local repository needs, what is the need for an app that implements data deposit workflow? First, it can simplify and speed up customisation if the desired workflow can be implemented from an app. Second, and more importantly for research data workflow, it can lead to greater standards compliance, consistency and collaboration between repositories.

Repositories can customise the data deposit workflow provided by ReCollect from the profile designed by Research Data @Essex without affecting standards compliance. The new report compares an example of customising the workflow for the ePrints Soton institutional repository with the ReCollect original.

A community of potential users for ReCollect, including the EPrints repositories at Glasgow University and Leeds University, has been established through webinar-based conference calls.

Further modifications to the Southampton workflow are likely, including the facility to automate minting and embedding of British Library DataCite DOIs, designed for data citation, for each data record.

Microsoft SharePoint logoIn the case of SharePoint, user interface forms for creating data records have been piloted and tested. The approach is distinctive in creating two linked forms, one to describe a project, the other to record a dataset, rather than a single workflow as in the case of EPrints. This development of SharePoint for description and storage of research data is part of a longer-term extension and integration of services provided on the platform at Southampton.

So in the first instance research data cataloguing at Southampton uses EPrints, extending the existing institutional repository by installing the Essex ReCollect data app. The service went live on ePrints Soton in April 2013.

What is needed now, however, is practice and experience with real data collections. In this respect many questions about the use of data repositories remain open. These early implementations are likely to change significantly as that process evolves.

For more on this DataPool case study see the full report.

Dec 17 2012

To architect or engineer research data repositories

Steve Hitchcock

There cannot be many mature products where development meetings have not been interrupted with a rueful declaration that to make further progress “you wouldn’t start from here”. This encapsulates one key difference between the architect and engineer, the latter prepared to work with the set of tools provided, the other preferring to start with a blank sheet of paper or an open space.

In building research data repositories using two different softwares, Microsoft Sharepoint and EPrints, the DataPool Project is working somewhere between these extremes. Which approach will prove to be the more resilient for research data management (RDM)? In this invited talk for RDMF 9, the ninth in the DCC series of Research Data Management Forums, held in Cambridge on 14-15 November 2012, we will look at the relevant factors. As a project we are agnostic to repository platforms, and as an institutional-scale project we have to work with who will support the chosen platform.

The original Powerpoint slides are available from the RDMF9 site. This version additionally reproduces the notes for each slide used to inform the commentary from the presentation. It might be worth opening the Slideshare site (adverts notwithstanding) to switch between the slide notes below and the graphic slides – clicking on View on Slideshare in the embedded view will open these in a separate browser window

I thank Graham Pryor of DCC, organiser of RDMF9, for inviting this talk, and for suggesting this topic based, presumably, on the project blog post shown in slide 2. This post sets out some of the higher-level issues while avoiding the trap of setting up a straw man pitting Sharepoint versus EPrints.

Before we get into the detailed notes, here is the live Twitter stream for the DataPool presentation (retrieved from #rdmf9 hashtag on 15 Nov.).

@jiscdatapool Preparing to talk at #rdmf9. Have the 9 am slot
@MeikPoschen #rdmf9 2nd day: To architect or engineer? Lessons from DataPool on building RDM repositories, first talk by Steve Hitchcock #jiscmrd
@MeikPoschen JISC DataPool Project at Southampton, see #jiscmrd #rdmf9
@simonhodson99 Down to work at #rdmf9 at Madingley Hall – outside it’s misty, autumnal – inside it’s Steve Hitchcock, DataPool: to architect or engineer?
@simonhodson99 Steve Hitchcock argues that the DataFlow solution is one of the most innovative things to come through #jiscmrd #rdmf9
@simonhodson99 ePrints data apps available from ePrints Bazaar: #jiscmrd #rdmf9
@jtedds Hitchcock (Southampton) describes institutional drive to implement SharePoint type solution but can it compete with DropBox? #jiscmrd #rdmf9
@jtedds Trial integrations with DataFlow MT @simonhodson99 ePrints data apps available from ePrints Bazaar #jiscmrd #rdmf9
@John_Milner Hitchcock highlights the challenge of getting quality RDM while keeping deposit simple for researchers, not easy #RDMF9
@simonhodson99 Perennial question of the level of detail required in metadata: with minimal metadata will the data be discoverable or reusable? #rdmf9
@simonhodson99 Is SharePoint a sufficient and appropriate platform for active data management? Sustainable? One size fits all? #rdmf9

Are the Twitter contributions a fair summary? We return to the slide commentary to find out.

Slide 3 The blog post highlighted in slide 2 included this architectural diagram, produced by Peter Hancock, director of the iSolutions IT services provider at the University of Southampton. Although it leans heavily towards referencing Sharepoint, it can be viewed as a high-level reference model, analogous to the OAIS in digital preservation, and therefore as a model that can embrace other repository types.

Slide 4 Before we get into the detail of the presentation, here is a poster-based summary of the DataPool Project. It has a tripartite approach characteristic of similar institutional projects in the JISC MRD programme, covering data policy, training and, the area of interest here, building a data repository. It is worth noting as well, in this context, that the development partners shown in the row beneath the tripartite elements effectively represent ways of getting data in and out of the RDM service adopted, and are relevant factors in the repository design.

Slide 5 Here is how the different repository platforms might line up on a broad spectrum of Architected vs Engineered. This is a rough-and-ready approach to illustrate the basic point. Also included is DataFlow, from the University of Oxford, perhaps the most innovative repository platform to have emerged for RDM. Given its originality, it appears towards the architected end of the spectrum. We could not claim that Sharepoint is a new software platform in the same way as DataFlow, but from an RDM perspective you don’t get anything out of the box – you have to start from scratch and ‘architect’ an RDM solution. What developers can do is try and ‘engineer’ the designed RDM element with the IT services already provided in Sharepoint. EPrints first appeared in 2001 to manage research publications. It has offered a ‘dataset’ deposit type since 2007, so provides a ready-made solution for an RDM repository, and can be ‘engineered’ to enhance that solution. As the slide notes, other RDM repository platforms are available. In the following slides we will explore the features of our three highlighted RDM platforms, starting with DataFlow.

Slide 6 DataFlow is a two-stage architecture for data management: an open (Dropbox-like) space for data producers (DataStage), and a managed and curated repository (DataBank), connected by a standard content transfer protocol, SWORD. While DataBank provides a bespoke data management service for Oxford, we have recently noted experiments to connect an open source version of DataStage with EPrints- and DSpace-based curated repositories, thus providing the yearned for Dropbox functionality apparently so in demand with research data producers.

Slide 7 This is an example screenshot from the DataStage-EPrints experimental arrangement used by the JISC Kaptur project. It shows the familiar Choose File-Upload button combination familiar to e.g. WordPress blog users, for uploading data. Uploaded data is then shown in a conventional file manager list.

Slide 8 To move data from DataStage to the curated repository, again shown in the experimental Kaptur implementation, uses this surprisingly simple SWORD client interface. If this seems insufficient description for a curated item, presumably a more detailed SWORD client could be substituted.

Slide 9 One basis for building a more comprehensive description, or metadata, for research data is this 3-layer model produced by the Institutional Data Management Blueprint (IDMB) Project, the project that preceded DataPool at the University of Southampton. This is quite a general-purpose and flexible model, perhaps with more flexibility than meaning. Structurally, nevertheless, we will see that this has some relevance to repository deposit workflow design.

Slide 10 The 3-layer metadata model can be seen quite clearly in the emerging user interface for data deposit built on Sharepoint. Here we see the interface for collecting project descriptions, used once per project and then linked to each data record produced by the project.

Slide 11 In the same style, here is the Sharepoint user interface for collecting data descriptions. One of the most noticeable features within both the Project and Data forms is the small number of mandatory fields (indicated with a red asterisk), just one on each form. Mandatory fields have to be filled in for the form to submit successfully. Most people will have experienced these fields; invariably when completing a Web shopping form these will be returned with red text warning. In this case you could feasibly submit a project or data description containing only a title. Aspects such as this are shortly to be subjected to user testing and review of this implementation.

Slide 12 Sharepoint has its detractors as an IT service platform, principally bemoaning its complexity-to-functionality ratio. Prof Simon Cox from Southampton University takes the opposite view passionately. This is an extract from his intervention at a DataPool Steering Group meeting (May 2012) putting the case for Sharepoint. It is a good way of understanding the wider strengths of Sharepoint, which may not be immediately apparent to users of particular Sharepoint services. Building the range of services suggested is a difficult and long-term project.

Slide 13 EPrints supports the deposit of many item types, including datasets since 2007. When you open a new deposit process in EPrints you will first be shown this screen, where you can select an item type such as ‘dataset’.

Slide 14 Selecting ‘dataset’ will take you to this next screen, which might look something like this from ePrints Soton, the Southampton Institutional Repository. This is not quite a default screen for standard EPrints installs; the workflow and fields have been customised in some areas by a repository developer.

Slide 15 EPrints users need not be restricted to standard interfaces or interfaces customised to a repository requirement. Interfaces in EPrints can be added to or amended by simply installing an app from the app store, or EPrints Bazaar. Unlike the Apple app store, with which it might optimistically be compared, EPrints apps are not selected to be installed by users but installation is authorised by repository managers. There are already two apps for those managers to choose to suit particular RDM workflow requirements: DataShare and Data Core. More data apps are expected to follow. EPrints is thus being engineered for flexibility in RDM deposit. In the following slides we will explore these first two data apps.

Slide 16 DataShare makes some minor modifications to the default EPrints workflow for deposit of datasets, highlighted with red circles here.

Slide 17 Data Core aims to implement a minimal ‘core’ metadata for datasets. Implementing this app will overwrite the default EPrints workflow, replacing it with the minimal set, approximately half of which is shown here (the remainder in the next slide). In addition, we have a short description of the design aims for Data Core, which are unavailable for Sharepoint data deposit and the DataShare app.

Slide 18 Taking both slides showing the Data Core deposit workflow, this is comparable, in extent, with the Sharepoint ‘data’ interface shown earlier, although it has a few more mandatory fields.

Slide 19 Another example of an EPrints data deposit interface has been developed by Research Data @Essex at the University of Essex. Like Data Core, the Essex approach has explicit design objectives, based on aligning with other metadata initiatives to support multi-disciplinary data. In other words, this does not simply expand or reduce the default EPrints workflow for data deposit, but starts with a new perspective. We have been liaising with its development team to investigate the possibility of building this approach into an Essex EPrints app for other repositories to share.

Slide 20 Here is a section of the Essex workflow, highlighting one area of major difference with the default workflow. It shows fields for time- and geographic-based information.

Slide 21 We’ve looked at getting data into the repository, but not yet how it is displayed as an output, or a data record from the repository. This is one example. It is not the most revealing record, but could be expanded.

Slide 22 Essex has cited specific design criteria for its research data repository. Additionally we have observed some characteristic features, indicated here. In particular, it is a data-only repository, without provision for other data-types offered by EPrints (shown in slide 13). The indication of mandatory fields adds a further layer of insight into the implementation of the design criteria.

Slide 23 So far in this presentation we have seen different implementations of data repository deposit interfaces, including DataFlow, Sharepoint, and multiple interfaces for EPrints. Where is this heading, and what are the common themes? Since we are exploring the difference between architecting and engineering these repositories, I was interested to see this national newspaper article about a major redevelopment of an area close to central London, Nine Elms, an area that interests me as I pass through it on regular basis. Phrases that stand out refer to the relationship between the planned new high-rise buildings. What does this have to do with data repositories?

Slide 24 Interoperability is the relationship between repositories and how they interact with services, such as search, through shared metadata. If repositories have “nothing in particular to do with anything around them” or “show little interest in anything around” them, then they will not be interoperable. If repositories stand alone rather than interoperate then they become less effective at making their contents visible. Open access repositories have long recognised the importance of interoperability, being founded on the Open Archives Initiative (OAI) over a decade ago, and efforts to improve interoperability continue with current developments. Shown here are some current interoperability initiatives from one morning’s mailbox. Data repositories will be connected to this debate, but so far it has not been a priority in the examples we have considered here.

Slide 25 One of the organisations listed on the previous slide, COAR, produced a report that outlines more comprehensively the scope of current interoperability initiatives for open access. While some solutions to the capture of research data seen here have reasonably been ‘architected’, that is, starting with a blank sheet to focus on the specific design needs of data deposit, these will need to catch up quickly with interoperability requirements, including most of those listed here. Data repositories ‘engineered’ on a platform such as EPrints, originally designed for other data types, do not obviously lack the flexibility to accommodate research data, and by virtue of having contributed to repository interoperability since the original OAI, already support most of the requirements shown here.

Slide 26 As for the DataPool Project, it will continue its dual approach of developing and testing both Sharepoint and EPrints apps. As a project it does not get to choose what is ultimately adopted to run the emerging research data repository at the University of Southampton. There are repository-specific factors that will determine that; but there are other organisational factors to take into account as well. Institutions seeking to build research data repositories that are clearly focussed on this range of factors are likely to have most success in implementing a repository to attract data deposit and usage.

This post has covered just one presentation, from DataPool, at RDMF9. The following two blog reports give a wider flavour of the event, the first exploring the architectural issues raised.

Julie Allinson, Some initial thoughts about RDM infrastructure @ York: “I’ll certainly carry on working up my architecture diagram, and will be drawing on the data coming out of our RDM interviews and survey to help flesh out the scenarios we need to support. But what I feel encouraged and even a little bit excited by is the comment by Kevin Ashley at the end of the RDMF9 event: that two years ago everyone was talking about the problem, and now people are coming up with solutions.”

Carlos Silva, RDMF9: Shaping the infrastructure, 14-15 November 2012: “Overall it was a good workshop which provided different points of view but at the same time made me realise that all the institutions are facing similar issues. IT departments will need to work more closely with other departments, and in particular the Library and Research Office in order to secure funding and make sustainable decisions about software.”

Feb 21 2012

Architecting research data management systems

Steve Hitchcock

There seemed to be general surprise following the revelation that Damien Hirst does not always ‘make’ his own works of art. Instead he leaves production to assistants based on his ideas and designs. In effect, he architects his art. Similarly, high profile architects like Norman Foster or Zaha Hadid are no less creative forces if they are not also builders.

In the rather different world of managing research data (MRD) we need systems to manage these data, but given the range of different types of data and emerging requirements of data producers, their institutions and users, we should be careful about simply adopting existing systems or even systems designs. The options are potentially wide, and complex. Instead of the systems engineer, the stage we are at needs the systems architect to take a high-level view of all the requirements to produce an elegant solution, fit for purpose and designed for the environment in which it is to be placed.

So I was interested to see the architecture for a research data repository at the University of Bristol illustrated by the JISC data.bris project. The accompanying description starts with front-ends and storage architectures and on the way refers to various technologies. The key feature, however, seems to be the recognition that this service must integrate with existing institutional information systems – not a unique view perhaps among current JISC projects, but one taken into account in the high-level architecture at the outset rather than as an afterthought.

Another of our companion JISC MRD projects, DataFlow at Oxford University, is developing a data deposit architecture that attracted my attention at the MRD programme launch meeting in Nottingham in December last year. This features a two-stage approach – DataStage and DataBank – that recognise different motivations for data deposit by researchers: 1, for storage and management (mimicking the popular Dropbox approach) prior to 2, data publication and access. The first is driven by researchers themselves, while the second may be more often driven by formal requirements by funders, institutions and policies. Broadly, these stages might offer a simple deposit interface and a more formally structured metadata collection interface, respectively (although I wait to see whether this is what DataFlow provides). The point about this approach is that it supports, and links, both motivations for data deposit.

What does DataPool offer in terms of a data system architecture? At the same Nottingham meeting the project presented a poster (pdf) including a diagram of a proposed system architecture. For those that saw it this graphic may have caught the eye for the splash of colour it brought to the poster, but the viewing context was not ideal for detailed information. It is worth reproducing that illustration here with some reflection on what it might offer to the general architectural principles we need to establish for research data systems.

Adapting the Southampton Microsoft Sharepoint 2010 system for data deposit

This diagram was produced by Peter Hancock, director of iSolutions, the University of Southampton’s ICT professional services department. Continuing development of this data architecture will remain an iSolutions responsibility in parallel with and beyond the DataPool Project. Where DataPool comes in is to seek to connect the data system approach with other institutional interests, notably researchers and users, through case studies and faculty contacts; training, through staff and graduate training centres; and policy, through the university’s research advisory and decision-making groups.

What follows are some thoughts on this architecture. First, as with DataFlow, this appears to be a two-stage architecture, in this case indicated as ‘Sharepoint’ and ‘Dropbox type infrastructure’. Actually, it would be hard to compare this too closely with DataFlow, or even Dropbox, without some illustration of the respective deposit interfaces, and that is for another post.

This figure omits to connect other deposit interfaces, which could be EPrints or SWORD for example, with the University Data Repository and storage service. Such interfaces might be produced by DataPool or others, rather than by iSolutions, but these will still need access to the underlying university data infrastructure.

Second, there are two access routes for users, which can broadly be defined as internal and external to the institution. As this is a service-oriented architecture this is inevitable and presupposes a privileged view for internal users. Whether this privilege extends beyond their own work is under discussion.

Finally, deposit is not restricted to the institution’s repository but allows data to be moved and copied between institutional and external disciplinary or subject-based data repositories. This might be accomplished via a service such as SWORD. Research data policies promote such options, providing chosen data storage services are reputable and appropriate, rather than specifying particular data repositories, and many researchers wish to exercise such choice.

All institutional research data services will need to make provision for extensive and expanding data storage. This is not elaborated in the figure, and strategy, infrastructure and costs for this continue to be discussed at a high level.

The point of an architecture is that it becomes a detailed plan for building, or in this case implementing a research data system. Prior to that, as a high level abstraction it serves as a platform for input for all interested stakeholders, from all perspectives. For DataPool those perspectives span the whole of the University of Southampton, and to capture those we have ensured we shall be working across faculties, with the faculty contacts, and with data producers through disciplinary exemplars and case studies. Ultimately, to make progress this has to be an iterative process of development and feedback, but central to this is the development of the architecture because that is what should reach out to most people.

At this conceptual level an institutional data management architecture will raise many questions. We may or may not need Sharepoint, EPrints, DSpace and other information systems solutions; what we need are systems to fit the architectural vision.