Is your research based around the measurement of public opinion? Are you interested in changing social attitudes? If you’re thinking of using content from social media platforms like Facebook, Twitter, or LinkedIn as key sources of research data then you may want to read a recently published Technology Watch Report from the Digital Preservation Coalition on “Preserving Social Media”.
Published in February this year, the report throws interesting light onto issues of archiving and preservation of social media content for social research, and shows how research is helping unpick the technical and legal difficulties associated with this very new area of study. I summarise some of the key points below, but if your research might use content from social media, it’s worth reading the original.
Traditional social research involving human participants takes great care over obtaining their consent, but most users of social media platforms tick away the ownership of their personal data without much thought. Social media archives use data owned by corporations, but created by end users with little power in the social media ecosystem.
Accidental disclosure of personal information is made more likely by the interlinked big datasets of modern social media platforms. Researchers will have to work even harder to protect “the right to be forgotten”.
Commerce vs Public Good
Most social research is conducted for the public good. Social media runs on a commercial model and therefore treats data as a commercial asset rather than a public good. Social media platforms sell data to businesses to measure current trends and behaviours; they are not interested in the long term value of their data, a key area of interest to social research. This difference in approach affects the ways in which they make their data available and the controls they place on its further use; researchers are prohibited from sharing raw data, or publishing it except in small non-machine readable datasets. Some large archives store the raw data, and provide access to a few researchers whilst negotiating with data owners for future relaxation of controls.
Worryingly, many platforms do not have an internal preservation policy; of all the major social media platforms, only Twitter has allowed the Library of Congress to archive its entire collection of tweets. It has not yet allowed free access to that archive.
Transient Big Data
The multi-platformed, linked nature of social media data makes it hard to select those data for preservation or storage. A tweet for instance contains up to 140 characters with images, shortened URLs and embedded links to other social media content. In order for a researcher to derive meaning from that content at a later date, there has to be some context stored with the data. Geolocation data, hashtags, keywords, timestamps, can all help preserve context and give meaning to a specific collection.
The huge volumes of data generated by social media mean storage can be a problem, especially as current EU legislation restricts the use of cloud storage to EU locations. Meaningful access by future researchers to vast data collections depends upon the development of robust database architectures that can cope with natural language queries like “Donald Trump” or “2013 Bundestag”, without taking a year to run the query. Early database designs in this are use pre-filtering by timestamp, or hashtag to improve responsiveness.
To preserve meaning and context within social media, data need to be prepared for archiving, linking back to longer versions of shortened URLS, and to archived versions of sites mentioned in social media. One case study mentioned by the author has successfully automated those two parts of data preparation to reduce costs.
Data Management Solutions
The case studies referenced within this report show that there are many tools and technologies already developed, or under development to help deal with both the archiving, and the managed access to the huge datasets that can be created by data harvesting. In a very new and rapidly evolving area of research, it is heartening to read the progress that many public research organisations have already made, not just in terms of technology, but in the management of data, and management of access to data.
The report advocates the creation of centralised storage under the auspices of a specialist national agency to deal with issues of quality and long term access, and calls for greater collaboration between agencies working in this area.