Mar 21 2013

Cost-benefit analysis: experience of Southampton research data producers

Steve Hitchcock

When businesses seek to invest in new development they typically perform a cost-benefit analysis as one measure in the decison-making process. In contrast, it is in the nature of academic research that while the costs may be calculable the benefits may be less definable, at least at the outset. When we consider the management of data and outputs emerging from research, particularly in an institutional context such as DataPool at the University of Southampton, we reach a point where the need for cost-benefit analysis once again becomes more acute. In other words, investment on this scale has to be justified.

Ahead of the scaling up of these services institutionally we have enquired about experience of cost-benefits among some of the large research data producers at Southampton, which are likely to be among the earliest and most extensive users of data management services provided institutionally. In addition we have some pointers from a cross-disciplinary survey of imaging and 3D data producers at Southampton, commissioned by DataPool.

Broadly, we have found elaboration of costs and benefits among these producers, but not necessarily together. It has to be recognised that any switch from data management services currently used by these projects to an institutional service is likely to be cost-driven, i.e. can an institution lower the costs of data management and curation?

KRDS benefits triangle

eCrystals

First we note that for cost-benefits of curation and preservation of research data a formal methodology has been elaborated and tested: Keeping Research Data Safe (KRDS). This method has been used by one of the data producers consulted here, eCrystals, a data repository managed at Southampton for the National Crystallography Service, which participated in the JISC KRDS projects (Beagrie, et al.):

“This benefits case study on research data preservation was developed from longitudinal cost information held at the Department of Chemistry in Southampton and their experience of data creation costs, preservation and data loss profiled in KRDS”

This case study concludes with a table of great clarity, Stakeholder Benefits in three dimensions, based on the benefits triangle, comparing:

  1. Direct Benefits vs Indirect Benefits (Costs Avoided)
  2. Near Term Benefits vs Long-Term Benefits
  3. Private Benefits vs Public Benefits

μ-VIS Imaging Centre

The μ-VIS Imaging Centre at the University of Southampton has calculated its rate of data production as (Boardman, et al.):

up to 2 TB/day (robotic operation) – 20 GB projections+30 GB reconstruction=50GB in as little as 10-15 minutes

As this data generation and storage facility has grown it has been offered as a service beyond the centre, both within and outside the university. The current mix of users is tentatively estimated at

“10-20% commercial, 10-20% external academic (including collaborative work) and 60-70% internal research.”

This mix of users is relevant as, broadly, users will have a range of data storage centres to choose from. Institutional research data policy at Southampton does not require that data is deposited within Southampton-based services, simply that there is a public record of all research data produced and where it is stored, and a requirement that the services used are ‘durable’ and accessible on demand by other researchers, the latter being a requirement of research funders in the UK. We can envisage, therefore, a series of competitive service providers for research data, from institutions to disciplinary archives, archival organisations, publishers and cloud storage services.

The need to be competitive is real for the μ-VIS service:

“After we went through costings for everything from tape storage through to cloud silos, we noticed that we pretty much can’t use anything exotic without increasing cost, and introducing a new cost for the majority of users would generally reduce the attractiveness of the service.”

Although the emphasis here is on cost, implicitly there is a simple cost-benefit analysis underlying this statement, with possible benefits being traded for lower cost. These tradeoffs can be seen more starkly in the cost-reliablity figures (Boardman, et al.):

  • One copy on one hard disk: ~10-20% chance of data loss over 5 years – Approximate cost in 2012: ~$10/TB/year
  • Two copies on two separate disks: ~1-4% chance of data loss over 5 years – Approximate cost in 2012: ~$20/TB/year
  • “Enterprise” class storage (e.g. NetApp): <1% chance of data loss over 5 years – Approximate cost in 2012: ~$500/TB/year
  • Cloud storage. Provides a scalable and reliable option to store data, e.g. Amazon S3 – ‘11 nines’ reliability levels. Typical pricing around $1200/TB/year; additional charges for uploading and downloading
Richard Boardman, Ian Sinclair, Simon Cox, Philippa Reed, Kenji Takeda, Jeremy Frey and Graeme Earl, Storage and sharing of large 3D imaging datasets, International Conference on 3D Materials Science (3DMS), Seven Springs, PA, USA, July 2012

3D rendering of fatigue damage (from muvis collection)

Imaging and 3D data case study

Image data, including data on three-dimensional objects, are a data type that will be produced across all disciplines of a university. A forthcoming imaging case study report from DataPool (when available will be tagged here) surveys producers of such data at the University of Southampton to examine availability and use of facilities, support and data management. Although this study did not examine cost-benefits specifically, it overlaps with, and reinforces, findings from some of the data producers reported here. Gareth Beale, one of the authors of the study, highlights efficiency gains attributed to accountability, collaboration and sharing, statistical monitoring and planning.

“Research groups with an external client base seem to be much better at managing research data than those which do not perceive this link. This is, we might assume, because of a direct accountability to clients. These groups claimed considerable efficiency savings as a result of improved RDM.

“Equipment sharing between different research groups led to ‘higher level’ collaborations and the sharing of data/resources/teaching. An enhanced researcher and student experience, driven by more efficient use of resources.

“Finally, one group used well archived metadata collected from equipment to monitor use. This allowed them to plan investments in servicing, spare parts and most importantly storage. Estimated archive growth based on these statistics was extremely accurate and allowed for more efficient financial planning.”

Open data

Research data and open data have much in common but are not identical. With research data the funders are increasingly driving towards a presumption of openness, that is, visible to other researchers. While openness is inherent to open data, it goes further in prescribing the data is provided in a format in which it can be mixed and mined by data processing tools.

The Open Data Service at Southampton has pioneered the use of linked open data connecting administrative data and data of all kinds “which isn’t in any way confidential which is of use to our members, visitors, and the public. If we make the data available in a structured way with a license which allows reuse then our members, or anyone else, can build tools on top of it”

While the cost-benefit analysis of data, whether research data or open data, may be similar, there are additional benefits when data is open in this way:

  • getting more value out of information you already have
  • making the data more visible means people spot (and correct) errors
  • helping diverse parts of the organisation use the same codes for things (buildings, research groups)

Linked open data map of University of Southampton’s Highfield campus, from the Open Data Service

Conclusion

The allocation of direct and indirect costs within organisations will often drive actions and decisions in both anticipated but also unforeseen ways. There is ongoing debate about whether the costs of institutional research data management infrastructure should be supported by direct subvention from institutional funds, i.e. an institutional cost, or from project overheads supported by research funders, i.e. a research cost.

Although we do not yet see a single approach to cost-benefits among these data producers, if the result of the debate is to produce intended rather than unforeseen outcomes, it will be necessary to look beyond a purely cost-based analysis to invoke more formal cost-benefit analysis.

I am grateful to Gareth Beale, Richard Boardman, Simon Coles, Chris Gutteridge and Hembo Pagi for their input into this short report.


Dec 20 2012

DataPool benefits-evidence table

Steve Hitchcock

JISC, funder of DataPool, of other projects in research data management, and many more projects on widening use of digital technology in education, tends to focus on areas close to practical exploitation. On the R&D spectrum, it is typically towards the development end. For project managers, therefore, there is an emphasis on procedures and tools to increase the impact of practical outcomes – evaluation, sustainability, exit strategies, technology transfer, etc.

Another planning tool being adopted in the Managing Research Data Programme (MRD) 2011-13, of which DataPool is a part, is benefits-evidence analysis. As this description suggests, the idea is to elaborate prospective benefits of a project, and then identify the evidence that will demonstrate whether or not the benefit has been realised. It is as much about informing the process of getting to the results, and identifying which results are important and achievable, as the results themselves.

Hence, JISC MRD projects were invited to Bristol for a 2-day programme workshop at the end of November to present their benefits-evidence slides. If this sounds a little repetitive, it is but not uninteresting, especially as in preparing for the workshop all projects had essentially to engage in the same analysis, and were therefore armed not just with their own slide but ready to comment on others.

For project managers used to working towards outputs (products or services arising from the project) and outcomes (effects of the outputs on users in the target community), benefits are another factor. Hence, the JISC MRD programme has recruited a team of evidence gatherers, to work with and assist projects to hone and refine the benefits they are working towards and the consequent evidence measures. “Those are more outputs than benefits” I was advised, fairly, during open discussion on some ‘benefits’ in my slide. But then I had seeded the slide with points to discuss rather than a definitive list, and unwittingly extended the project’s previously discussed benefits.

So after the workshop I was grateful for the advice of Laura Molloy, evidence gatherer for DataPool, on aligning our pre- and post-workshop benefits lists.

After all that effort it would be a remiss not to reveal our benefits-evidence table that emerged from the process. For the record, here are the benefits DataPool will seek to demonstrate in its final months into early 2013.

DataPool: Benefits-Evidence

1 Improved RDM skills across the target community, including researchers and professional support staff Qual reporting on effectiveness of training events.
Feedback from training courses and deskside consultations, DMP and email help services.
More staff running RDM support services, increased service offer.
2 Greater visibility and use of institution’s research data / research outputs through sharing, collaboration, reuse Qual case study describing improved dataset exposure.
Qual evidence of DMP engagement, including early indications of access routes.
* Quant indication of increase in dataset downloads.
No. of datasets stored in data repository.
Accesses of open datasets vs closed datasets vs shared datasets.
3 Sustained institutional support for RDM / sustainability for RDM infrastructure at institution No. of training opportunities introduced.
Scope of: deskside consultations, DMP support service.
Results from case studies – engagement with existing data facilities.
Assessment of added value for institution of using institutional storage over other options – report.
4 Improved use/uptake of RDM infrastructure Quant account of ‘bid preparation consultations’, inc. qual narrative of referrals to data policy and DMP help.
Case study on working with data policy – feedback on uptake of policy.
Quant tracking of higher attendance at training.
Accesses to RDM guidance documents.
No. of deskside consultations.
* Quant indication of improved uptake of institutional storage and deposit options.
No. of large data projects switching to institutional data service.
5 Time / costs saved by improved RDM infrastructure Identifying early cost-benefits – combined case studies report, inc large data projects, open data, imaging, disciplinary efficiencies.
Assessment of added value for institution of using institutional storage over other options – report (see 3).

* This evidence not expected to be available during DataPool Project, following launch of RDM repository service by project end, but will be collected in ongoing work at Southampton University on institutional RDM. Table by Steve Hitchcock for DataPool, in collaboration with Wendy White, Dorothy Byatt. We gratefully acknowledge the feedback and suggestions from Laura Molloy, JISC evidence gatherer.

The University of Southampton has a 10 year roadmap for research data, of which DataPool represents the first stretch of road, so there is a commitment to go further, but the clearer the steer from DataPool the faster the progress afterwards.

As a little light relief from projects’ benefits-evidence slides, a presentation on the Southampton roadmap and business plan was given at the Bristol workshop. That will be covered in a separate post.

How will you know which benefits have been achieved as the project moves forward? This post is tagged with the label ‘benefits’. All updates reporting evidence from the table above will use this tag. Tags can be found in the column immediately to the right of this one, and up, from this point in the post.

This is how other JISC MRD projects are tackling these challenges and what benefits-evidence are being targetting:


Mar 8 2012

A project for the research life cycle?

Dorothy Byatt

How do we view a project like DataPool?  What are we hoping to achieve?  These are important questions that need to be kept in mind throughout the life of the project.  It can be easy to become focused on specific tasks.  Projects can be seen as simply “a project” with a fullstop and an end, but DataPool is more than “just” a project testing ideas and systems.  It will do that, but we hope that it will do much more.  DataPool is about beginning the process of embedding the management of research data into the infrastructure and culture of our institution.  DataPool is here to make a difference and to make it throughout the research life cycle, from proposal to storing and sharing.

Bedrock

The bedrock underpinning the project will be a Research Data Management Policy for the University.  This will be key Pozo de las animas by Alejandro Colombo CC BY-NC-SA 2.0to all the other work and will inform the related guidance and training requirements.  Its development is being seen as an iterative process with views of the academic community initially being gathered through designated “data” contacts within the Faculties.  The policy will be valuable in informing data management processes in the University, influence plans where required by funders and will be a significant benefit arising from the DataPool project.  We would hope by the end of the project to see an increased number of references to the policy within research proposals, resulting over time in an increased number of datasets held securely and in a location that makes them available for re-use.

Infrastructure

The increased focus on research data, its management, storage and sharing requires that the systems offered within the University of Southampton are adapted and developed so that they can meet this need.  DataPool will be of benefit to this process.  DataPool will work to inform the decisions concerning the technical infrastructure of the institution to provide a simple deposit system that will also facilitate sharing at the appropriate time and under approved conditions.  This will be geared towards the individual researcher, influenced by case studies and discipline exemplars, with the aim of seeing how best it can support the research data workflow and capture metadata from existing University systems.  By the end of the project we would expect to have enhanced the storage and deposit options available, and seen an improved uptake of them.

Support

The start of the data life cycle is long before the creation of any data and really begins with the research proposal.  We plan to draw together a network of services that will support the researcher from proposal to deposit.  This will draw on existing services and expertise, both internal, such as our Research and Innovation Service, Doctoral Training Centres, Library, and external ones, such as the Digital Curation Centre.  We aim to create: guidance sheets; training materials; and to offer workshops and a web site. These will enhance the support that academic, professional and support staff can provide, whether for writing plans or advice on versioning through to different levels and types of metadata.  We would see the establishment of a central web site as an important step in this area.  The creation of this support will be a direct benefit arising from the Datapool project.