Mar 28 2013

DataPool Steering Group, third meeting

Steve Hitchcock

There are moments for a project to be steered, and others when the results of steering come to the fore. This was the case for the third and final Steering Group meeting – at least in the context of DataPool if not of research data management at the University of Southampton – held on 12 March 2013. Another well attended meeting testified to the ongoing commitment to continue this work across the university beyond the end of the DataPool project, which completes its term of JISC funding at the end of March.

As with other posts in this series on the project Steering Group meetings, we present a record of these meetings based on copies of documents made available to the group prior to the meeting, where possible.

Collected documents for 3rd Steering Group meeting

  • Agenda, Steering Group meeting, 12 March 2013
  • Minutes of previous Steering Group meeting, 12 November 2012
  • Progress Report by Wendy White, DataPool PI

From the Introduction to the Progress Report: “We will take the opportunity with this final Steering Group update to highlight key areas of activity over the last period and illustrate themes for sustainability as we approach the mid-term of the 10 year roadmap (for research data at Southampton).

“Almost all aspects of the project have required collaboration. This has been true of the development of policy, joint work to review storage and investment priorities, training design and delivery, research data management planning and support services and iterating technical developments.  To help support on-going collaboration in the next phase of the roadmap some of the responsibilities as PI of this project are now formally embedded in my role to continue to take initiatives forward and lead co-ordination of services. This reflects an institutional commitment to ensure that responsibility for research data management is reflected in a range of existing roles and not handled as adjunct activity.”

Case study reports

Case studies produced by DataPool were circulated prior to the meeting in draft or summary form. If linked here these will point to complete final versions. If not linked try the case studies tag for updates.

Screenshot of Tweepository: jiscdatapool tweet collection

Screenshot of Tweepository: jiscdatapool tweet collection, from the Collecting and archiving tweets case study

Members of the steering group present at the meeting (University of Southampton unless otherwise indicated): Wendy White (Chair, DataPool PI and Head of Scholarly Communication), Philip Nelson (Pro-VC Research), Adam Wheeler (Provost and DVC), Mark Brown (University Librarian), Helen Snaith (National Oceanography Centre Southampton), Mylene Ployart (Associate Director, Research and Innovation Services), Sally Rumsey (Digital Collections Development Manager at The Bodleian Libraries, University of Oxford), Oz Parchment (iSolutions), Les Carr (Electronics and Computer Science), Simon Cox (Engineering Sciences), Jeremy Frey (Chemistry), Simon Coles (Chemistry, case presenter), Gareth Beale (Archaeology, case presenter) Dorothy Byatt, Steve Hitchcock (DataPool Project Managers). Joined by teleconference: Louise Corti (Associate Director, UK Data Archive). Apologies from: Graham Pryor (Associate Director, Digital Curation Centre).


Mar 27 2013

Collecting and archiving tweets: a DataPool case study

Steve Hitchcock

Information presented to a user via Twitter is variously called a ‘stream’, that is, a constant flow of data passing the viewer or reader. Where the totality of information passing through Twitter at any moment is considered, the flow is often referred to as a ‘firehose’, in other words, a gushing torrent of information. Blink and you’ve missed it. But does this information have only momentary value or relevance? Is there additional value in collecting, storing and preserving these data?

A short report from the DataPool Project describes a small case study in archiving collected tweets by, and about, DataPool. It explains the constraints imposed by Twitter on the use of such collections, describes how a service for collections evolved within these constraints, and illustrates the practical issues and choices that resulted in an archived collection.

An EPrints application called Tweepository collects and presents tweets based on a user’s specified search term over a specified period of time (Figure 1). DataPool and researchers associated with the project were among early users of Tweepository using the app installed on a test repository. Collections were based on the project’s Twitter user name, other user names, and selected hashtags, from conferences or other events.

Figure 1. Creating and editing a record for a Tweepository collection based on search terms

A dedicated institutional Tweepository was launched at the University of Southampton in late 2012. A packager tool enabled the ongoing test collections to be transferred to the supported Southampton Tweepository without a known break in service or collection.

For completeness as an exemplar data case study, given that institutional services such as Tweepository are as yet unavailable elsewhere, tweet collections were archived towards the end of the DataPool Project in March 2013. We used the provided export functions to create a packaged version of selected, completed collections for transfer to another repository at the university, ePrints Soton.

Attached to our archived tweet collections in ePrints Soton (see Figure 2) are:

  1. Reviewable PDF of the original Tweepository Web view (with some “tweets not shown…”)
  2. Reviewable PDF of complete tweet collection without data analysis, from HTML export format
  3. JSON Tweetstream* saved using the provided export tool
  4. Zip file* from the Packager tool

* reviewable only by the creator of the record or a repository administrator

File-level management in ePrints Soton, showing the series of files archived from Tweepository and Twitter

Figure 2. File-level management in ePrints Soton, showing the series of files archived from Tweepository and Twitter

We have since added the zip archive of the Project’s Twitter account, downloaded directly from Twitter, spanning the whole period from opening the account in November 2011. This service only applies to the archive of a registered Twitter user, not the general search collections possible with Tweepository.

What value the data in these collections and archival versions will prove to have will be measured through reuse by other researchers, and remains an open question, as it does for most research data entering the nascent services at institutions such as the University of Southampton.

Archiving tweets is a first step; realising the value of the data is a whole new challenge.

For more on this DataPool case study see the full report.


Mar 25 2013

Institutional alignments for progressing research data management

Steve Hitchcock

Can visualisation of alignments – of people and ideas across an institution – reveal and predict progress towards research data management (RDM)?

DataPool has been seeking to institute formal RDM practices at the University of Southampton on three fronts – policy, technical infrastructure, and training – as we have noted before. In addition, the university has a longer-term roadmap looking years beyond the point reached in DataPool.

One aspect of this work we haven’t addressed is the alignments that have been instrumental in making progress on these three fronts. It follows that if we can visualise these alignments then not only does this chart progress but it may reveal new alignments that need to be forged looking forward, and where there may be gaps in existing alignments there could be lessons for future progress. Since in terms of these alignments the University of Southampton may be distinctive but not unique, this analysis might extend to other institutional RDM projects. That is the idea, at least, behind the latest DataPool poster presentation, shown below, prepared for the final JISC MRD Programme Workshop (25-26 March 2013, Aston Business School, Birmingham).


Within DataPool we have established formal and informal networks of people that connect with and cross existing institutional forums. For example, the project has close and regular contact with an advisory group of disciplinary experts, has established a network of faculty contacts, has been working with the multidisciplinary strands of the University Strategic Research Groups (USRGs), and with senior managers and teams in IT support (iSolutions) and Research and Innovation Services (RIS). At the apex, we have a high-level steering group that spans all of these areas with in addition senior institutional managers (Provost, Pro-VC) as well leaders from external data management organisations. A series of case studies provide insights into the current data practices and needs of those researchers who are data creators and users.

Returning to the three fronts of our investigations, we have reached either natural and expected conclusions ready to be taken forward beyond DataPool, or in some cases incomplete and possibly unexpected conclusions. Below we reveal and assess the alignments that have driven progress on these three fronts:

Policy. Approved by Senate, the University’s ‘primary academic authority’, following recommendations from the Research and Enterprise Advisory Group (REAG), and officially published within the University Calendar. This alignment did not happen by chance, but began to be formed by the library team through the IDMB project and was taken forward within DataPool. Supporting documentation and guidance for the policy is provided on the University Library web site. The policy is effective from publication, but with a ‘low-profile’ launch and follow-up it has by design not had widespread impact on researchers to date.

Data infrastructure. Research data apps for EPrints repositories, with selected apps installed on ePrints Soton, the institutional repository, which is now better structured for data deposit. Progress made with initial interfaces in Sharepoint, the university’s multi-service IT support platform, to describe data projects and facilitate data deposit; some user testing, but currently remains incomplete. On storage infrastructure it has not been possible to cost extensions to the existing institutional storage provision, a limitation in extending data services to large and regular data producers, who by definition are the most active data researchers. One late development has been to embed support for minting and embedding DataCite DOIs for data citation in data repositories at Southampton.

Training and support. Principally extended towards PhD and early career researchers, and in-service support teams in the library. Plans to embed RDM training within the university’s extended support operations across all training areas, Gradbook and Staffbook. One highlight in this area is the uptake of support for data management planning (DMP), particularly at the stage of submitting research project proposals for funding.

In these examples we can see alignments spanning governance-IT-services-users.

From the brief descriptions of these fronts it can be seen that the existing alignments have brought us forward, but to go further we have to return to those alignments and reinforce the actions taken so far: to widen awareness, impact and uptake of policy; to provide adequate and usable RDM infrastructure for data producers; to develop and integrate training support within the primary delivery channels.

Almost all of these outcomes and the need for more follow-through can be traced to the alignments. However, the elusive element common across these alignments is the researcher and data producer, despite being a perennial target. Data initiatives, whether from institutions or wider bodies such as research funders, start out with the researcher in mind, but can lose momentum if the researcher appears not to engage. That may be because the benefits identified do not align with the interests of the researcher, or it may be because at a practical level the support and resources provided are insufficient. Thus the extended alignments required for full RDM do not materialise. Worse, the existing alignments can be prematurely discouraged, lack incentives and confidence to promote the real innovation they have delivered, in turn affecting investment decisions and service development.

Where the researcher is engaged the results can be quite different, as seen in the DataCite example, motivated and developed by researchers, and in DMP uptake, where researchers clearly begin to recognise both the emergence of good practice in digital data research and the need for compliance with emerging policy.

These alignments are a crucial but largely unnoticed aspect of DataPool, and no doubt of other similar #jiscmrd projects at other institutions as well. If this analysis is correct then for institutional-scale projects alignments can both reveal and predict progress.


Mar 21 2013

Cost-benefit analysis: experience of Southampton research data producers

Steve Hitchcock

When businesses seek to invest in new development they typically perform a cost-benefit analysis as one measure in the decison-making process. In contrast, it is in the nature of academic research that while the costs may be calculable the benefits may be less definable, at least at the outset. When we consider the management of data and outputs emerging from research, particularly in an institutional context such as DataPool at the University of Southampton, we reach a point where the need for cost-benefit analysis once again becomes more acute. In other words, investment on this scale has to be justified.

Ahead of the scaling up of these services institutionally we have enquired about experience of cost-benefits among some of the large research data producers at Southampton, which are likely to be among the earliest and most extensive users of data management services provided institutionally. In addition we have some pointers from a cross-disciplinary survey of imaging and 3D data producers at Southampton, commissioned by DataPool.

Broadly, we have found elaboration of costs and benefits among these producers, but not necessarily together. It has to be recognised that any switch from data management services currently used by these projects to an institutional service is likely to be cost-driven, i.e. can an institution lower the costs of data management and curation?

KRDS benefits triangle

eCrystals

First we note that for cost-benefits of curation and preservation of research data a formal methodology has been elaborated and tested: Keeping Research Data Safe (KRDS). This method has been used by one of the data producers consulted here, eCrystals, a data repository managed at Southampton for the National Crystallography Service, which participated in the JISC KRDS projects (Beagrie, et al.):

“This benefits case study on research data preservation was developed from longitudinal cost information held at the Department of Chemistry in Southampton and their experience of data creation costs, preservation and data loss profiled in KRDS”

This case study concludes with a table of great clarity, Stakeholder Benefits in three dimensions, based on the benefits triangle, comparing:

  1. Direct Benefits vs Indirect Benefits (Costs Avoided)
  2. Near Term Benefits vs Long-Term Benefits
  3. Private Benefits vs Public Benefits

μ-VIS Imaging Centre

The μ-VIS Imaging Centre at the University of Southampton has calculated its rate of data production as (Boardman, et al.):

up to 2 TB/day (robotic operation) – 20 GB projections+30 GB reconstruction=50GB in as little as 10-15 minutes

As this data generation and storage facility has grown it has been offered as a service beyond the centre, both within and outside the university. The current mix of users is tentatively estimated at

“10-20% commercial, 10-20% external academic (including collaborative work) and 60-70% internal research.”

This mix of users is relevant as, broadly, users will have a range of data storage centres to choose from. Institutional research data policy at Southampton does not require that data is deposited within Southampton-based services, simply that there is a public record of all research data produced and where it is stored, and a requirement that the services used are ‘durable’ and accessible on demand by other researchers, the latter being a requirement of research funders in the UK. We can envisage, therefore, a series of competitive service providers for research data, from institutions to disciplinary archives, archival organisations, publishers and cloud storage services.

The need to be competitive is real for the μ-VIS service:

“After we went through costings for everything from tape storage through to cloud silos, we noticed that we pretty much can’t use anything exotic without increasing cost, and introducing a new cost for the majority of users would generally reduce the attractiveness of the service.”

Although the emphasis here is on cost, implicitly there is a simple cost-benefit analysis underlying this statement, with possible benefits being traded for lower cost. These tradeoffs can be seen more starkly in the cost-reliablity figures (Boardman, et al.):

  • One copy on one hard disk: ~10-20% chance of data loss over 5 years – Approximate cost in 2012: ~$10/TB/year
  • Two copies on two separate disks: ~1-4% chance of data loss over 5 years – Approximate cost in 2012: ~$20/TB/year
  • “Enterprise” class storage (e.g. NetApp): <1% chance of data loss over 5 years – Approximate cost in 2012: ~$500/TB/year
  • Cloud storage. Provides a scalable and reliable option to store data, e.g. Amazon S3 – ‘11 nines’ reliability levels. Typical pricing around $1200/TB/year; additional charges for uploading and downloading
Richard Boardman, Ian Sinclair, Simon Cox, Philippa Reed, Kenji Takeda, Jeremy Frey and Graeme Earl, Storage and sharing of large 3D imaging datasets, International Conference on 3D Materials Science (3DMS), Seven Springs, PA, USA, July 2012

3D rendering of fatigue damage (from muvis collection)

Imaging and 3D data case study

Image data, including data on three-dimensional objects, are a data type that will be produced across all disciplines of a university. A forthcoming imaging case study report from DataPool (when available will be tagged here) surveys producers of such data at the University of Southampton to examine availability and use of facilities, support and data management. Although this study did not examine cost-benefits specifically, it overlaps with, and reinforces, findings from some of the data producers reported here. Gareth Beale, one of the authors of the study, highlights efficiency gains attributed to accountability, collaboration and sharing, statistical monitoring and planning.

“Research groups with an external client base seem to be much better at managing research data than those which do not perceive this link. This is, we might assume, because of a direct accountability to clients. These groups claimed considerable efficiency savings as a result of improved RDM.

“Equipment sharing between different research groups led to ‘higher level’ collaborations and the sharing of data/resources/teaching. An enhanced researcher and student experience, driven by more efficient use of resources.

“Finally, one group used well archived metadata collected from equipment to monitor use. This allowed them to plan investments in servicing, spare parts and most importantly storage. Estimated archive growth based on these statistics was extremely accurate and allowed for more efficient financial planning.”

Open data

Research data and open data have much in common but are not identical. With research data the funders are increasingly driving towards a presumption of openness, that is, visible to other researchers. While openness is inherent to open data, it goes further in prescribing the data is provided in a format in which it can be mixed and mined by data processing tools.

The Open Data Service at Southampton has pioneered the use of linked open data connecting administrative data and data of all kinds “which isn’t in any way confidential which is of use to our members, visitors, and the public. If we make the data available in a structured way with a license which allows reuse then our members, or anyone else, can build tools on top of it”

While the cost-benefit analysis of data, whether research data or open data, may be similar, there are additional benefits when data is open in this way:

  • getting more value out of information you already have
  • making the data more visible means people spot (and correct) errors
  • helping diverse parts of the organisation use the same codes for things (buildings, research groups)

Linked open data map of University of Southampton’s Highfield campus, from the Open Data Service

Conclusion

The allocation of direct and indirect costs within organisations will often drive actions and decisions in both anticipated but also unforeseen ways. There is ongoing debate about whether the costs of institutional research data management infrastructure should be supported by direct subvention from institutional funds, i.e. an institutional cost, or from project overheads supported by research funders, i.e. a research cost.

Although we do not yet see a single approach to cost-benefits among these data producers, if the result of the debate is to produce intended rather than unforeseen outcomes, it will be necessary to look beyond a purely cost-based analysis to invoke more formal cost-benefit analysis.

I am grateful to Gareth Beale, Richard Boardman, Simon Coles, Chris Gutteridge and Hembo Pagi for their input into this short report.


Mar 4 2013

Confused about data management? Hands up

Gareth Beale
Dr. Julius Axelrod checking a student's work on the chemistry of catecholamine reactions in nerve cells.

Data management can be a daunting subject

I will always remember sitting in my chemistry lesson trying in vain to balance an equation and our teacher looking me in the eye and telling me that if I wasn’t sure then I should put up my hand. I can still hear his encouraging words; “If you are confused then you can bet that most other people are too, so don’t be embarrassed”. This traumatic moment in my life came back to me recently when I was talking to a colleague about the management of our research data. I will explain…

As part of the DataPool project myself and Hembo Pagi have been talking to users of 3D and 2D imaging data. We were interested in finding out how this community were adapting to growing data sets and increasing demands to make these data available. We both produce vast amounts of imaging data as part of our work in archaeology and we were very interested in knowing what kinds of data management strategies other people were using. “We have our own ways of coping but surely other people have got these problems sorted?” we said to ourselves.

Four months later and we are just beginning to uncover answers to this question. As we went around the University talking to physicists, artists, archaeologists, geographers, oceanographers and many others we discovered that there were as many answers as there were researchers.

All of us have different requirements because we work in different areas, use different data, have different outputs and have different resources available to us. We found that not only were people responding to the challenges of data management with amazing creativity and resourcefulness, they were all doing so in unique ways. The range of creative responses which we have encountered paints a picture of a research community that is eager to deal with the challenges and opportunities of data management.

However, like us, very few of these researchers were aware of the approaches adopted by others. Innovative and highly developed data
management strategies are frequently used by a small group of researchers but are unknown to the wider research community. The key to making data management work is to devise an institutional approach which reflects the needs of the users. If we are going to design infrastructure and support mechanisms which work then they must be designed in response to real challenges and real research scenarios.

Children in a classroom with hands up facing the camera

Hands Up if you want to join in.

Which finally brings us back to that hot chemistry lab in the early 1990s. If we have problems with data management and we don’t know how to solve them then we need to put our hand up and ask. Our conversations with researchers have clearly shown that we are all facing similar challenges. Conversely, if we have ideas which might help others (and nearly all of you do) then we need to share them. Our report will suggest that improved communication should lie at the heart of the way in which the University plans for institutional data management. As systems which might facilitate these conversations are developed it is important that the considerations of researchers are taken into account.

Help with data managment can currently be sought from a number of sources including the Library, Library Digitisation Unit and the Software Sustainability Institute, which are all based here at Southampton. But in addition to this it is important that we talk to each other. If you have a problem relating to the management of data then you can be sure that somebody, somewhere in the University has been there before and can help you to solve it.

If you would like to contribute to the development of a forum of this type, have ideas about what form it might take or you just have questions about data management and don’t know where to look then please get in touch. You can email specialists in data management at the library at data@soton.ac.uk. For more information about the DataPool Project go to datapool.soton.ac.uk, or if you have comments or ideas then email me at gareth[dot]beale[at]soton[dot]ac[dot]uk.