HANZO'S WEB ARCHIVING BLOG

Subscribe by Email

Your email:

Current Articles | RSS Feed RSS Feed

Five Tips For Digital Preservation of Web Archives

  
  
  
  

There are two broad concepts for preservation of web content:

  1. Preservation in the context of maintaining accessibility of web content in the long term
  2. Preservation in the context of E-Discovery

This post is concerned with the first of these, maintaining accessibility of web content in the long term. I've already written about preservation in the context of E-Discovery.

Maintaining accessibility of web content in the long term

Looking to the future: the STS-112 Shuttle MissionPreservation for web archives is a subset of the activities and ideas defined more generally for digital preservation. Digital preservation is defined in Wikipedia as "the active management of digital information over time to ensure its accessibility".

These activities require constant and ongoing attention to avoid digital obsolescence, such as abandonment of software or encoding technologies, file-format evolution, protocol enhancements, etc. Imagine a web archive of a typical corporate website, captured now, in March 2011. Then imagine accessing the archived content in 10 or 15 years time. Will the embedded Flash still work? Will the video play. Will the web browser of 2026 tolerate todays poor HTML markup? Will the Javascript run? If I had to guess, the answer to most of these questions will probably be 'not entirely'.

Web browsing technologies of 2026 are likely to be quite different to those of today due to the rapid and accelerating pace of development. Thanks to the Internet Archive' Web Pioneers Collection we can see evidence of past evolution of web technologies. Thanks to open standards and W3C, these websites remain reasonably functional, as their technology is open, and still at core of the web today. However, with the increasing use of proprietary technologies, such as Flash, and dynamic rendering and interactivity using Javascript, the future cannot be so certain for todays websites.

So what does this mean for web content, particularly native format web archive content, as provided by Hanzo Archives? What digital preservation do for us?

Hanzo are focussed on two main processes for active preservation of web content:

  1. Migration
  2. Emulation

Migration

Migration is the transfer of data from one system to another, or conversion of one file-format to another so the resource remains fully accessible and functional.

A web archive example could be as follows. Should an image format become obsolete, a conversion process can be developed, in which files of the original format are converted to a new format. The converter should be devised to avoid loss of image fidelity or functionality.

A web archive can take advantage of such a converter in two ways:

  1. In a batch process, convert all instances of the original file format contained in the archive to the new format and metadata records can be updated appropriately to show this conversion has taken place; or
  2. At access-time, convert files of the original format on-the-fly to the new format.

In either case, the resource remains accessible in the new browser.

Hanzo keep a close watch on a number of projects around the world, particularly amongst the digital library community. Should a migration ever be necessary for a customer, we are able to insert an access-time migration or batch process. This would not compromise the integrity or authenticity of the archive content as the original will never be deleted or changed. We will simply effect the migration using the best practice techniques we developed around the WARC standard and IIPC community.

Emulation

Emulation is the replicating of functionality of a system. Emulators are very popular in other contexts, such as gaming, where one can find many emulators of obsolete systems from DOS, Atari and Commodore 64's, which can be used to play old games on new machines and operating systems.

Hanzo and the preservation community at large believe there is considerable promise in emulation as a preservation strategy for complex media, as it is relatively easy for us to implement compared to migration. This is especially true for proprietary and "dark" file formats and code: in which case maintaining a licensed copy of the original target environments, and an emulator or virtual machine to run them, is relatively straight-forward to accomplish.

A web archive example could be as follows. Create a software emulator for a standard PC that can run Windows today, ensuring the software is well constructed as portable code, it should be possible to keep the emulator running for years to come. It will then be possible in the future to run today's operating systems and browsers within the emulator to access a web archive and see how that archive content would have looked.

Hanzo's Preservation Plan

Preservation is a long term endeavour, aiming to make archived content accessible in the long term. However, today, we don't need to consider the long term ourselves. We only need to consider preservation for the foreseeable future; provided we can preserve our web archives for this generation, using such strategies discussed here, we will be able to hand the problem to the next generation to solve for their foreseeable future. In this way, todays web archives will be preserved for the long term.

Hanzo captures and preserves web content for many commercial and government institutions around the world. Our preservation plan is to ensure our customers have the means to access their archived content at any time through pragmatic initiatives such as:

  1. Keep track of developments in web technologies and file-formats
  2. Ensure we keep sufficient metadata and indexes to be able to identify "at risk" file formats contained in the archive and actively manage them
  3. Ensure we keep virtual machines with images that are representative of key releases of platform software, including operating systems, web browsers, plugins and so on
  4. Work with the web archiving community as a whole, through organisations like the International Internet Preservation Consortium of which we are a member, to ensure full collaboration with the major projects and initiatives around the world.
  5. Continue to work on open standards-based archive technologies, to ensure our customers receive the full benefit of preservation initiatives worldwide.

Through these initiative Hanzo will ensure that in the event of an evolutionary step in technology that adversely affects our customers archived web content, we'll have the means, knowledge and tools to keep them accessible.

Five Tips For Digital Preservation of Web Archives

As a web content owner within a commercial or government organisation you have some pretty major considerations and responsibilities concerning digital preservation. You need to make sure that whichever archive technology you use, you avoid lock-in to proprietary formats, avoid systems that do not adhere to standards, and ensure best practices for web archive content is always followed. So check your options! Here are five tips for digital preservation of web archives.

  1. Your web archive system should store content, without modification, in native format
  2. Make sure it collects metadata about your content and files, captured from the web, and the content itself
  3. Make sure your archive is based on client-side web archiving technology, so that it is independent of publishing platform
  4. Ensure your archive uses ISO 28500 WARC files to store your content - avoid proprietary lock-in?
  5. All of the above to ensure you benefit from the global community of archivists and preservation specialists building on the same best practices and foundations

You should ensure your web archive is affirmative in all these regards for an optimal basis for digital preservation of your web content over the long term.

We describe how Hanzo Enterprise Web Archiving meets all of the above criteria for digital preservation in our white paper. Click the button below to download.

Download Hanzo's White Paper

A Short History of Web Archiving

  
  
  
  

Les Globes De CoronelliI've been meaning to write a short history of web archiving for ages and I never got around to it. Fortunately, Ariel Bleicher did get around to it, with this great article in IEEE Spectrum:

A Memory of Webs Past.

Definitely worth a read. Now a little into the future...

I'm presenting "Commercial Web Archives" at the next IIPC meeting at The Hague on 09 May 2011, with a demonstration of what may be the next stage in the development of web archiving - including demo's of archived social media, form-based content (I don't mean simply logging in, I mean multi-page, multi-variable, complex forms), interactive Ajax pages, streaming media, etc. It's open to the public, so come along.

Advertising Standards Authority Now Covers Websites and Social Media

  
  
  
  

asa logo

From 1st March 2011 the Advertising Standards Authority (ASA) of the UK extends its online remit "to cover companies’ own marketing claims on their own websites and in other non-paid for space they control", which means "marketing communications on companies own websites and in other non-paid space they control, like Facebook and Twitter".

This is an extension of ASA's existing remit for "internet ads in paid-for space, like banner ads, pop-ups and paid search results", resulting much needed broad coverage:

The UK Code of Non-broadcast Advertising, which includes rules to make sure advertisements do not mislead, harm or offend, will be applied to all UK based company websites regardless of the sector or size of business or organisation.

Consumer protection is the key driver of this new initiative:

Since 2008, we have received over 4,500 complaints that we couldn’t deal with, but now anyone who has a concern about an online marketing communication will be able to turn to us. Not only is this good news for consumers, it is also good for business – marketing communications that are trusted are more likely to work and deliver value.

This story coincides with a recent television show, in which I was interviewed as an 'expert' in web archiving. I explained how an archive of a product vendors' website can be used as evidence of their 'product representations' at a given time. There is a non-disclosure on this story at the time of writing, so I'll write more after the show is aired later this year.

More information on ASA's new online remit is available on the Advertising Standards Authority website.

If you are a company using websites and social media for marketing communications and advertising to consumers, and you are interested in learning more about web archiving, please sign-up for our webinar, or contact us for a one-to-one demo.

All Posts