Collecting and archiving tweets: a DataPool case study
Information presented to a user via Twitter is variously called a ‘stream’, that is, a constant flow of data passing the viewer or reader. Where the totality of information passing through Twitter at any moment is considered, the flow is often referred to as a ‘firehose’, in other words, a gushing torrent of information. Blink and you’ve missed it. But does this information have only momentary value or relevance? Is there additional value in collecting, storing and preserving these data?
A short report from the DataPool Project describes a small case study in archiving collected tweets by, and about, DataPool. It explains the constraints imposed by Twitter on the use of such collections, describes how a service for collections evolved within these constraints, and illustrates the practical issues and choices that resulted in an archived collection.
An EPrints application called Tweepository collects and presents tweets based on a user’s specified search term over a specified period of time (Figure 1). DataPool and researchers associated with the project were among early users of Tweepository using the app installed on a test repository. Collections were based on the project’s Twitter user name, other user names, and selected hashtags, from conferences or other events.
A dedicated institutional Tweepository was launched at the University of Southampton in late 2012. A packager tool enabled the ongoing test collections to be transferred to the supported Southampton Tweepository without a known break in service or collection.
For completeness as an exemplar data case study, given that institutional services such as Tweepository are as yet unavailable elsewhere, tweet collections were archived towards the end of the DataPool Project in March 2013. We used the provided export functions to create a packaged version of selected, completed collections for transfer to another repository at the university, ePrints Soton.
Attached to our archived tweet collections in ePrints Soton (see Figure 2) are:
- Reviewable PDF of the original Tweepository Web view (with some “tweets not shown…”)
- Reviewable PDF of complete tweet collection without data analysis, from HTML export format
- JSON Tweetstream* saved using the provided export tool
- Zip file* from the Packager tool
* reviewable only by the creator of the record or a repository administrator

Figure 2. File-level management in ePrints Soton, showing the series of files archived from Tweepository and Twitter
We have since added the zip archive of the Project’s Twitter account, downloaded directly from Twitter, spanning the whole period from opening the account in November 2011. This service only applies to the archive of a registered Twitter user, not the general search collections possible with Tweepository.
What value the data in these collections and archival versions will prove to have will be measured through reuse by other researchers, and remains an open question, as it does for most research data entering the nascent services at institutions such as the University of Southampton.
Archiving tweets is a first step; realising the value of the data is a whole new challenge.
For more on this DataPool case study see the full report.
Thanks @sparrowbarley for prompting the following Twitter discussion:
jiscdatapool Collecting and archiving tweets: a DataPool case study. The report http://eprints.soton.ac.uk/350646/ #jiscmrd
5:47 PM – 27 Mar 13
sparrowbarley @jiscdatapool What are the copyright & privacy implications of archiving personally identifiable tweets? I didn’t see that covered. #jiscmrd
6:18 PM
jiscdatapool @sparrowbarley Good points. I covered rights in Twitter services, not tweets. Consensus seems to be most tweets not copyrightable
8:37 PM
jiscdatapool @sparrowbarley On privacy of tweets, assume Twitter privacy policy applies, even to 3rd-party archive. Only archiving public tweets and info
8:37 PM
jiscdatapool @sparrowbarley By replying to ‘@jiscdatapool’ your tweet is part of our archive. What is your view on rights and privacy of your tweets?
8:39 PM
sparrowbarley @jiscdatapool Thanks. Why? Because they’re small?
8:39 PM
lescarr @jiscdatapool does that consensus include lawyers???
8:47 PM
jiscdatapool @sparrowbarley On copyrightable tweets see http://www.wipo.int/wipo_magazine/en/2009/04/article_0005.html … Different for linked Twitter content (eg pics) http://www.wipo.int/wipo_magazine/en/2009/04/article_0005.html …
8:51 PM
jiscdatapool @lescarr wipo refs 1 lawyer, otherwise ‘experts’. Also law professors http://blogs.telegraph.co.uk/technology/shanerichmond/100004758/sxsw-2010-can-you-copyright-a-tweet/ …
8:55 PM
jiscdatapool @sparrowbarley Try again. On copyrightable tweets see http://www.wipo.int/wipo_magazine/en/2009/04/article_0005.html … Different for infringing tweets http://gigaom.com/2012/11/04/new-twitter-policy-lets-users-see-tweets-pulled-down-for-copyright/ …
9:04 PM
jiscdatapool @sparrowbarley Tweets that infringe typically link to copyright content, see http://arstechnica.com/tech-policy/2012/01/twitter-uncloaks-a-years-worth-of-dmca-takedown-notices-4410-in-all/ …
9:10 PM
Continuing the discussion from the previous comment on the ethics and privacy issues associated with archiving Tweets, initiated by @sparrowbarley and here joined by @briankelly. I have to thank these contributors for permission to reproduce this discussion, for reasons that will become apparent.
Note, this is presented mostly as a chronology of the discussion, although at some points where the chronology loses synchronisation the tweets are re-ordered for flow and readability within sub-threads.
jiscdatapool
@sparrowbarley By replying to ‘@jiscdatapool’ your tweet is part of our archive. What is your view on rights and privacy of your tweets?
8:39 PM – 27 Mar 13
sparrowbarley
@jiscdatapool Tbh it would be nice to know up front that by replying to you I’ll be in your tweet archive. Could you say so in your profile?
1:06 PM – 28 Mar 13
briankelly
@sparrowbarley Surely not a scaleable solution. @jiscdatapool
1:12 PM – 28 Mar 13
sparrowbarley
@briankelly @jiscdatapool But perhaps an ethical one nonetheless? I’m not actually bothered but was asked.
1:16 PM – 28 Mar 13
sparrowbarley
@jiscdatapool IDCC 11 had a paper on this. Now in IJDC: What Your Tweets Tell Us About You: Identity, Ownership and Privacy of Twitter Data
1:09 PM – 28 Mar 13
sparrowbarley
@briankelly @jiscdatapool I remember the presenters handed out sample consent forms for tweeps for collecting tweets. http://ijdc.net/index.php/ijdc/article/view/214
1:20 PM – 28 Mar 13
sparrowbarley
Office clearout: @jiscdatapool I just found it! Twitter data deposit form- not consent as I’d thought though that is a yes/no field.
6:12 PM – 1 Apr 13
sparrowbarley
@jiscdatapool ambiguous understanding of privacy and norms= “complex issues for curators trying to ensure…ethical reuse of Twitter data.”
1:25 PM – 28 Mar 13
sparrowbarley
@briankelly I think @jiscdatapool and I are debating asynchronously on twitter – not an ideal situation! Couldn’t even quote whole sentence.
1:26 PM – 28 Mar 13
sparrowbarley
@jiscdatapool So I moved on to ethics/privacy. But thanks for links on tweet copyright.Have you seen the “twitter fiction” in Sat @guardian?
1:39 PM – 28 Mar 13
sparrowbarley
@jiscdatapool @guardian Just mentioning because I bet they enjoy copyright. A whole story told in single tweet-quite well, by known authors.
1:41 PM – 28 Mar
briankelly
@sparrowbarley @jiscdatapool Some thoughts from last week’s discussion about copyright of tweets in context of people archiving @ tweets)
6:15 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool Cookie law shows flaws in seeking to apply legislation which fails to address technological differences.
6:17 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool Also see Guardian article on how Copyright wars are damaging the health of the internet: bit.ly/179ks58
http://www.guardian.co.uk/technology/blog/2013/mar/28/copyright-wars-internet
6:19 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool So for me issue isn’t whether tweets are copyrighted, but whether its applicable to apply copyright legislation
6:20 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool However I suspect commerical providers will be more innovative in analysis of tweets than HE 🙁
6:21 PM – 1 Apr 13
sparrowbarley
@briankelly @jiscdatapool Hmm. Well my point was in regard to the ethics of archiving other people’s tweets without their knowledge/consent.
6:23 PM – 1 Apr 13
sparrowbarley
@briankelly @jiscdatapool Unlike archiving a hashtag where I am adding my tweet to a collection voluntarily and know archiving is likely.
6:24 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool So you object to tools like Storify?
6:24 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool I’ve use a manual opt-out approach with Storify e.g. see heading at bit.ly/179l0I5
http://storify.com/briankelly/the-quality-of-embedded-metadata-in-pdfs-jan-2013/
6:25 PM – 1 Apr 13
sparrowbarley
@briankelly @jiscdatapool That looks reasonable. Presumably you also tweeted its existence so participants were likely toknow if they follow
6:35 PM – 1 Apr 13
briankelly
@sparrowbarley @jiscdatapool I’ve tweeted to ppl included, but not always, as it’s not a scalable solution (I’d be sending too many tweets)
6:38 PM – 1 Apr 13
sparrowbarley
@briankelly @jiscdatapool Yes I see but it seems they are somewhat likely to know about it. They have engaged with you, you have tweeted it.
6:40 PM – 1 Apr 13
jiscdatapool
@sparrowbarley @briankelly Thx for helpful extended dialogue on ethics, privacy and rights of tweet archiving
5:13 PM – 10 Apr 13
jiscdatapool
@sparrowbarley @briankelly Consistent with point raised, do I have your permission to ‘archive’ rest of tweet dialogue in blog comment?
5:18 PM – 10 Apr 13
briankelly
@jiscdatapool Me or @sparrowbarley ? I don’t require permission. I would like my tweets to be used sensibly (e.g. not out-of-context)
5:22 PM – 10 Apr 13
briankelly
@jiscdatapool BTW since yours is a project Twitter account, I would expect personal privacy considerations not to be relevant. Yes?
5:24 PM – 10 Apr 13
jiscdatapool
@briankelly We tweet differently for project accts. Do followers have different privacy expectations of projects? Hadn’t thought so. Maybe
5:56 PM – 10 Apr 13
briankelly
@jiscdatapool @sparrowbarley Similar to sharing a private conversation with others – you make a judgement as to whether it’s appropriate.
5:23 PM – 10 Apr 13
jiscdatapool
@briankelly @sparrowbarley Right or not, practical or not, if we accept need to seek permission we can’t assume it. But yes, not your point
5:45 PM – 10 Apr 13 · Details
briankelly
@jiscdatapool Do you ask permission to include (copyrighted) screen image? Suspect you don’t (& assume implied consent or take risk)
5:52 PM – 10 Apr 13
jiscdatapool
@briankelly Yes, but to choose to ask (in absence of other signals), or not, in all cases consistently is the point
6:00 PM – 10 Apr 13
sparrowbarley
@jiscdatapool @briankelly Thanks-sure, can you tweet the link so I can see if clarification is needed? Hard to discuss ‘ethics’ via twitter.
6:27 PM – 10 Apr 13
sparrowbarley
@jiscdatapool @briankelly Sorry I didn’t respond promptly; you have my permission!
5:24 PM – 23 Apr 13