HANZO'S WEB ARCHIVING BLOG

Subscribe by Email

Your email:

Current Articles | RSS Feed RSS Feed

Defining web archive scope

  
  
  
  

Just read a great post by Abbie Grotke of the Web Archiving Team at Library of Congress on what we call "archive scope", i.e. the parameters and contraints we place on a web crawl to ensure we collect what you want, and nothing more. 

Library of Congress

Abbie and her team calls this the “Intellectual” Components of a Website:

"You might imagine that with the web being in its twenties everyone would know exactly what a website is. But you’d be surprised – those of us in the web archiving business spend quite a bit of time pondering what makes up a particular organization or person’s website." 

Abbie describes LOC's definition of scope and provides a useful and overwhelmingly cute example. Although kittens would be more cute; this is "a true fact"! 

Kittens and puppies aside, I couldn't agree more with Abbie's opening paragraph, establishing a shared definition of a website is often challenging. In fact, the term website clouds the issue somewhat, so we've coined the term "Archive Unit". Where an archive unit is a specific scope applied to a website or domain defined by its seed URL's, that results in a consistent, well defined, and repeatable capture of that website or domain.

In simple cases a website is equivalent to an archive unit. But that's not always the case. Here's a short list of scoping considerations to illustrate:

  • Seed URL's - you will need at least one starting point, from there the crawler can discover the rest
  • Broad scope definition - a blunt instrument, we define the broad scope of a crawl from a number of presets, and fine tune from there: 'n'-hops, page, domain, etc. These limit the crawl discovery and collection to those pages within the preset scope
  • Embeds - another blunt instrument really, we consider embedded media and links to documents and other binaries as being inclusive to a page. There is no point capturing a page with an embedded YouTube video if you're not going to include the YouTube video!
  • User-role (we're in fine-tuning territory now) - websites often appear differently by user-role, so we like to include this in our definition of scope. In financial services, we frequently archive websites using a number of user-roles. For example: anonymous (don't login), logged-in as an investor, logged in as a broker-dealer, etc. In this case, the single website = three archive units
  • Social context (more fine tuning and optimising) - many social media sites consist of status updates or short messages, together with links and other attributes that connect the user to others on the social media site, or to external web resources. Archiving just the status updates seems pretty pointless, so we have devised a social context scope. When collecting a status update on Facebook or LinkedIn, a tweet, or a Chatter post, etc, also collect any web pages or media linked or attached to the post, collect any comments or likes, @mentions, #tags, etc. Social context scoping is a critically important scope because it collects the subject alongside the social commentary for a more complete historical record (another post on this soon).
  • POST support (a necessary optimisation for the modern web) - many modern websites, especially those based around Ajax/Javascript, including social media, wikis, etc., use HTTP POST for navigation and user interaction within a page. We include POST support to ensure optimal discovery of resources and comprehensive collection. We also use POST to archive web based forms and surveys. For example, again let me choose financial services, we collect financial calculators by filling in example data provided by the customer, and capture the filling in of the form and the results of the form post. 

We have a few more tricks that can be considered part of crawl scope, but I'll reserve these for a later post.

So, if you haven't already, take a look at Abbie's post.

To discuss your own web archiving needs, or to know more about archiving social media and business social networks, including the context, then contact us and we'll review your needs, show you a demo, and define your scope!

In the meantime, take a look at these white papers:

 

 

Comments

Absolutely agree that it is important to have a concept of what you call the 'archive unit'. Which is why I've found the workflow systems that just emphasise 'target' URLs as being somehow deficient. While I think your enterprise focus takes you into complex territory beyond what some of us do for national collecting institutions (e.g. the user-role context) for the Australian PANDORA Archive we have always worked with this conceptual entity when harvesting. I have to admit we do continue to use a somewhat arcane and bibliographic derived terminology when we refer to the 'archive unit' as a 'title' - but that designation does encompass a much broader concept than simply a target URL. Our 'titles' are defined by seed urls, gather filters, harvester settings, as well as an array of metadata including title description, permissions, notes on the harvesting and technical limitations and interventions that have been undertaken to complete and constitute the archived instance.
Posted @ Thursday, October 04, 2012 7:55 PM by Paul Koerbin
Thanks Paul. I agree there is enduring value in comprehensive metadata capture, cataloguing and annotating archives. The basic premise of course being that the descriptive information about the archive content will enable future users of our archives discover and understand them more completely. This is certainly the case in selected collections like ours, less so for large-scale archives like Internet Archive.  
Arcane or otherwise, these practices are what makes our memory institutions so fascinating and vital to a civilised society, right? Although you wouldn't necessarily believe that when you consider the budgets!  
Interestingly, we discovered that a lot the best practices developed by Internet Archive and our global community of memory institutions for digitial preservation of the web are practically aligned with best practices in Information Governance and eDiscovery in the corporate world. Our implementation has benefitted from this alignment considerably. Were you in IIPC in DC earlier this year? Mark Williamson spoke about how we store all our descriptive information in WARC records, alongside captured metadata, computed metadata, and all relating to the native format archive content.
Posted @ Friday, October 05, 2012 7:59 AM by Mark Middleton
Comments have been closed for this article.