Last talk of the day from Kevin Ashley (from the Digital Curation Centre) – he says if you don’t know what data citation is now, he hopes he will be able to tell you why you should care about it and why it will be important in the future.
Kevin mentioning the DCC Curation Lifecycle Model – but today’s talk is focussing only on one aspect – Access, Use and Reuse.
So – why should we care about data citation? Kevin giving example of paper on LIDAR and RADAR images of ice clouds – in paper, only images – not the data used to create those images. Kevin showing how data can be misrepresented – showing graphs that don’t start at zero on one scale can lead to misleading conclusions.
So – data behind graphs can be very important. Kevin says that data used to support statements in publication should be as accessible as the publication – so statements and findings can be examined and challenged.
Kevin showing how you can misrepresent data – e.g. by taking a subset of results (that happen to favour a particular conclusion) – the data published is not always (all of) the data collected. Kevin mentioning a few texts on this – and my favourite that I was googling as he spoke ‘How to Lie with Statistics’ by Darrell Huff
Kevin giving example of studying Biodiversity – requires many different data sources, some of which won’t be published, some of which won’t have been compiled through academic research…
All of these issues mean we really ought to care about data citation.
‘Data is Different’. With traditional bibliographic resources it has basically come from a ‘print’ paradigm – i.e. ‘published’ – we’ve moved online with many of these resources, but still fundamentally the same – you ‘publish’ something and then you cite it.
However, a data set may be being added to on a continuing basis – a telescope maybe collecting more and more data all the time. What you cite now may be different by tomorrow (Kevin draws parallel to citing web resources like blogs)
So – approaches to dealing with this:
- Giving data digital object identifiers (e.g. datacite)
- Capturing data subsets at a point of publication
- Freezing those subsets somewhere
- Publication led
These works well in certain areas
- Dataverse (thedata.org) – submit your data, get a checksum (so you can check if it has changed since publication) and citation and publish
- Ebank/ecrystals – harvest, stor, cite
- DataCite – working at national level with libraries and data centers
However, data changes and can be very very big: – can be changing by the second, and be petabytes in size. If you take a ‘publication’ approach – it may not be apparent that four different references to subsets of data are actually all part of the same dataset.
One way of dealing with ‘big data’ issue – rather than making copies – keep change records – create reference mechanises that allow reference to a specific change point – Kevin mentioning Memento as a possible model for this.
Another alternative is using ‘annotation’ rather than citations. When data sources have many (thousands) of contributors instead of citing data sources in publications, annotate data sources with publications. Example of ‘Mondrian’ approach where blocks of colour are assigned based on what types of annotation there are for different parts of the dataset. Turns data set into something that can be challenged in itself…
Kevin mentioning Buneman’s desiderata (see http://homepages.inf.ed.ac.uk/opb/homepagefiles/harmarnew.pdf)
Kevin concerned that the tools we have now aren’t quite ready for the challenges of data citation.