Ghosts On The Internet

Gavin Bell

By rights the internet should be full of poltergeists, poor rootless things looking for their real homes. Many events on the internet are not properly associated with their correct timeframe. I don’t mean a server set to the wrong time, though that happens too. Much of the content published on the internet is separated from any proper reference to its publication time. What does publication even mean? Let me tell you a story…

“It is 2019 and this is Kathy Clees reporting on the story of the moment, the shock purchase of Microsoft by Apple Inc. A Internet Explorer security scare story from 2008 was responsible, yes from 11 years ago, accidently promoted by an analyst, who neglected to check the date of their sources.”

If you think this is fanciful nonsense, then cast your mind back to September 2008, this story in Wired or The Times (UK) about a huge United Airlines stock tumble. A Florida newspaper had a automated popular story section. A random reader looking at a story about United’s 2002 Bankruptcy proceedings caused this story to get picked up by Google’s later visit to the South Florida Sun Sentinel’s news home page.

The story was undated, Google’s news engine apparently gave it a 2008 date, an analyst picked it up and pushed it to Bloomberg and within minutes the United stock was tumbling. Their stock price dropped from $12 to $3, then recovered to $11 over the day. An eight percent fall in share price over a mis-configured date

Completing this out of order Christmas Carol, lets look at what is current practice and how dates are managed, we might even get to clank some chains. Publication date used to be inseparable from publication, the two things where stamped on the same piece of paper. How can we determine when things have been published, now?

Determining publication dates

Time as defined by http://www.w3.org/TR/NOTE-datetime extends ISO 8601, mandating the use of a year value. This is pretty well defined, we can even get very accurate timings down to milliseconds, Ruby and other languages can even handle Calendar reformation. So accuracy is not the issue.

One problem is that there are many dates which could be interpreted as the publication date. Publication can mean any of date written or created; date placed on server; last modified date; or the current date from the web server. Created and modified have parallels with file systems, but the large number of database driven websites means that this no longer holds much meaning, as there are no longer any files.

Checking web server HEAD may also not correspond, it might give the creation time for the HTML file you are viewing or it might give the last modified time for a file from disk. It is too unreliable and lacking in context to be of real value. So if the web server will not help, then how can we get the right timeframe for our content?

We are left with URLs and the actual page content.

Looking at Flickr, this picture (by Douglas County History Research Center) has four date values which can be associated with it. It was taken around 1900, scanned in 1992 and placed on Flickr on July 29th, 2008 and replaced later that day. Which dates should be represented here?

This is hard question to answer, but currently the date of upload to Flickr is the best represented in terms of the date URL, /photos/douglascountyhistory/archives/date-posted/2008/07/29/, plus some Dublin Core RDF for the year. Flickr uses 2008 as the value for this image. Not accurate, but a reasonable compromise for the millions of other images on their site.

Flickr represents location much better than it represents time. For the most part this is fine, but once you go back in time to the 1800s then the maps of the world start to change a lot and you need to reference both time and place.

The Google timeline search offers another interesting window on the world, showing results organised by decade for any search term. Being able to jump to a specific occurrence of a term makes it easier to get primary results rather than later reporting.

The 1918 “Spanish flu” results jump out in this timeline.

Any major news event will have multiple analysis articles after the event, finding the original reporting of hurricane Katrina is harder now. Many publishers are putting older content online, e.g. Harpers or Nature or The Times, often these use good date based URLs, sometimes they are unhelpful database references. If this content is available for free, then how much better would it be to provide good metadata on date of publication.

Date based URLs

A quick word on date based URLs, they can be brilliant at capturing first published date. However they can be hard to interpret. Is /03/04 a date in March or April, what about 08/03/04? Obviously 2008/03/04 is easier to understand, it is probably March 4th. Including a proper timestamp in the page content avoid this kind of guesswork.

Many sites represent the date as a plain text string; a few hook an HTML class of date around it, a very few provide an actual timestamp. Associating the date with the individual content makes it harder to get the date wrong.

Movable Type and TypePad are a notable exceptions, they will embed Dublin Core RDF to represent each posting e.g. dc:date="2008-12-18T02:57:28-08:00". WordPress doesn’t support date markup out of the box, though there is a patch and a howto for hAtom available.

In terms of newspapers, the BBC use <meta name="OriginalPublicationDate" content="2008/12/18 18:52:05" /> along with opaque URLs such as http://news.bbc.co.uk/1/hi/technology/7787335.stm.

The Guardian use nice clear URLs http://www.guardian.co.uk/business/2008/dec/18/car-industry-recession but have no marked up date on the page.

The New York Times are similar to the Guardian with nice URLs, http://www.nytimes.com/2008/12/19/business/19markets.html, but again no timestamps. All of these papers have all the data available, but it is not marked up in a useful manner.

Syndication formats

Syndication formats are better at supporting dates, RSS uses RFC 822 for dates, just like email so dates such as Wed, 17 Dec 2008 12:52:40 GMT are valid, with all the white space issues that entails.

The Atom syndication format uses the much clearer http://tools.ietf.org/html/rfc3339 with timestamps of the form 1996-12-19T16:39:57-08:00. Both syndication formats encourage the use of last modified. This is understandable, but a pity as published date is a very useful value. The Atom syndication format supports “published” and mandates “updated” as timestamps, see the Atom RFC 4287 for more detail.

Marking up dates

However the aim of this short article is to encourage you to use microformats or RDF to encode dates. A good example of this is Twitter, they use hAtom for each individual entry, http://twitter.com/zzgavin/status/1065835819 contains the following markup, which represents a human and a machine readable version of the time of that tweet.

<span class="published" title="2008-12-18T22:01:27+00:00">about 3 hours ago</span>

The spec for datetime is still draft at the minute and there is still ongoing conversation around the right format and semantics for representing date and time in microformats, see the datetime design pattern for details.

The hAtom example page shows the minimal changes required to implement hAtom on well formed blog post content and for other less well behaved content. You have the information already in your content publication systems, this is not some additional onerous content entry task, simply some template formatting.

I started to see this as a serious issue after reading Stewart Brand’s Clock of the Long Now about five years ago. Brand’s book explores the issues of short term thinking that permeate our society, thinking beyond the end of the financial year is a stretch for many people. The Long Now has a world view of a 10,000 year timeframe, see http://longnow.org/ for much more information. Freebase from Long Now Board member Danny Hillis, supports dates quite well – see the entry for A Christmas Carol.

In conclusion

I feel we should be making it easier for people searching for our content in the future. We’ve moved through tagging content and on to geo-tagging content. Now it is time to get the timestamps right on our content. How do I know when something happened and how can I find other things that happened at the same time is a fair question. This should be something I can satisfy simply and easily. There are a range of tools available to us in either hAtom or RDF to specify time accurately alongside the content, so what is stopping you?

Thinking of the long term it is hard for us to know now what will be of relevance for future generations, so we should aim to raise the floor for publishing tools so that all content has the right timeframe associated with it. We are moving from publishing words and pictures on the internet to being able to associate publication with an individual via XFN and OpenID. We can associate place quite well too, the last piece of useful metadata is timeframe.

Gavin Bell designs web applications and social software for the Nature Publishing Group. Large scale web applications covering identity, on-demand media and social software have been the main focus of his work. Since the early 90s he has worked in academia, advertising, publishing and developed multimedia software.

He is the author of a forthcoming book entitled Building Social Web Applications for O’Reilly Media Inc. He lives in London with his wife and two sons. He keeps track of the world on take one onion, you can keep track of him on twitter and gavinbell.com were he generally avoids the third person.

Photo: James Duncan Davidson