The when of the web

November 24th, 2009 • 23:11

The when of the web

Any discussion about citing or referencing web-based resources seems to inevitably turn to the issue that web pages, and other resources, are subject to change. This means that if I cite a web page today, there is no guarantee that when you look at it tomorrow the content will be the same. This is in contrast with our expectations of the physical world, where if I cite a book, you’d expect that book to be the same (bar physical damage) as when I cited it.

Because of this it is standard practice to include in a web page reference a date indicating when you accessed the page. For example:

(2009) Citation – Wikipedia, the free encyclopedia, Available from: http://en.wikipedia.org/wiki/Referencing (Accessed 24th November 2009).

This clearly says when I accessed the page. However, if you followed the URL I’ve included in the reference, you’d get the page as it looks today, not the page as I saw it on the 24th November 2009. Although this could be the same, clearly especially with fluid sites such as Wikipedia, the content is quite likely to have changed in some respect. In areas where new information becomes available, or there is disputed information, the page could change quite radically.

A couple of weeks ago Herbert Van de Sompel (@hvdsomp), Rob Sanderson (@azaroth42) and others published a paper on ‘Memento’ – a proposal to enable archived versions of webpages to be served instead of the current one. They have named this approach ‘Memento’ – the detailed paper is available at http://arxiv.org/abs/0911.1112, and some more information is available at the Memento website. You can also see a webinar Herbert did at OCLC recently.

Memento uses what is known as ‘content negotiation’. This is the ability of http to request a specific version of a document from a URL. Typically this is used to ask for a specific format for images (e.g. ‘give me the GIF version rather than the PNG version of this graphic’), or more recently in the context of ‘linked data’, to request a web resource in an RDF representation (as opposed the the usual HTML) – there is some more on using content negotiation for linked data in this tutorial on publishing linked data. Content negotiation can be used for a few different attributes of the content – so as well as requesting a specific file/data format for the content (‘give me the GIF version of an image’) it can also support language (‘give me the French version of this page’). It is important to note that requesting the French version of a web page doesn’t magically create a French version if one doesn’t already exist!

What Memento introduces is a time element to content negotiation, so you can say ‘give me the version of this page that was current on the 24th November 2009’. This isn’t part of http at the moment, so most services won’t know how to respond to this, but clearly the hope of the Memento team is that some sites will implement, and also that it might get adopted as part of http.

Clearly some services are very well placed to respond to requests for historical versions of web pages – the example of Wikipedia is a good one, as you can revert to any previous version of the article (although note that this will show the historical content within the current template – so not necessarily exactly the same as the original page if that’s what you need – but I guess it is usually the content that is of interest – see also the ArchivePress project). The other obvious example is the Internet Archive Wayback Machine – although it’s ability to give you a particular version of the page is limited, as it doesn’t take a copy of the page each time it changes.

As well as services that support delivering the historical versions of pages, the other part of the puzzle is clients that are able to add the time element to the http header, and therefor request the historical version of the web page. I haven’t had a look at this yet, but the Memento project has developed a Firefox plugin to enable this.

If you’ve read some of the other entries on this blog, you may remember the slightly unusual approach TELSTAR is taking in providing links to web resources from references – we push the reference metadata, including the URL of the web resource, to an OpenURL resolver. We do this specifically to allow the library to deal with web resources which move URLs. However, one of the points that was raised when we first looked at this is that given the URL, and the ‘date accessed’ from the reference, we could potentially redirect the user to a historical version of the web page – there was some discussion on the code4lib email list, and again Wikipedia and the Internet Archive were obvious examples. However, with the proposal of Memento, this would become even simpler – when the URL and date are received by the OpenURL resolver, it could request the page, with the time data in the http header ‘behind the scenes’, the response it gets would ascertain the ability of the web server to deliver the historical data, and the user could either be automatically redirected to the historical version of the page or offered a choice of different historical versions (in the resolver menu).

Although this probably wouldn’t work in quite the same way as a dedicated client that supported what Memento refers to as ‘time travel mode’, it would be a way of enabling access to a memento page without requiring the user to install a client etc. As OpenURL is already widely supported for journal and sometimes book references, adding the support for a web reference should be relatively trivial I would think? One suggestion is that at the upcoming Dev8D event, Memento could be something to look at (http://twitter.com/andymcg/status/5886827230) – so if this sounds interesting, leave a comment…

1 Comment

RSS

Jan 13th 2010 • 14:01
by Philip Adams

LOCKSS (http://www.lockss.org/lockss/Home) is a Digital Preservation system that may allow something like this to be achieved. In a recent release a table was added that tracks the versions of files archived (see 1.41.2 in http://www.lockss.org/lockss/LOCKSS_Daemon)
The different versions could be served up by LOCKSS, at least in theory, though I am not sure how you would frame such a request.

Sorry, comments are closed.

Telstar

Categories

Archives

Tags

Search

Subscribe

The when of the web

1 Comment

Leave a Reply