Defining web archive scope
Just read a great post by Abbie Grotke of the Web Archiving Team at Library of Congress on what we call "archive scope", i.e. the parameters and contraints we place on a web crawl to ensure we collect what you want, and nothing more.
Abbie and her team calls this the “Intellectual” Components of a Website:
"You might imagine that with the web being in its twenties everyone would know exactly what a website is. But you’d be surprised – those of us in the web archiving business spend quite a bit of time pondering what makes up a particular organization or person’s website."
Abbie describes LOC's definition of scope and provides a useful and overwhelmingly cute example. Although kittens would be more cute; this is "a true fact"!
Kittens and puppies aside, I couldn't agree more with Abbie's opening paragraph, establishing a shared definition of a website is often challenging. In fact, the term website clouds the issue somewhat, so we've coined the term "Archive Unit". Where an archive unit is a specific scope applied to a website or domain defined by its seed URL's, that results in a consistent, well defined, and repeatable capture of that website or domain.
In simple cases a website is equivalent to an archive unit. But that's not always the case. Here's a short list of scoping considerations to illustrate:
- Seed URL's - you will need at least one starting point, from there the crawler can discover the rest
- Broad scope definition - a blunt instrument, we define the broad scope of a crawl from a number of presets, and fine tune from there: 'n'-hops, page, domain, etc. These limit the crawl discovery and collection to those pages within the preset scope
- Embeds - another blunt instrument really, we consider embedded media and links to documents and other binaries as being inclusive to a page. There is no point capturing a page with an embedded YouTube video if you're not going to include the YouTube video!
- User-role (we're in fine-tuning territory now) - websites often appear differently by user-role, so we like to include this in our definition of scope. In financial services, we frequently archive websites using a number of user-roles. For example: anonymous (don't login), logged-in as an investor, logged in as a broker-dealer, etc. In this case, the single website = three archive units
- Social context (more fine tuning and optimising) - many social media sites consist of status updates or short messages, together with links and other attributes that connect the user to others on the social media site, or to external web resources. Archiving just the status updates seems pretty pointless, so we have devised a social context scope. When collecting a status update on Facebook or LinkedIn, a tweet, or a Chatter post, etc, also collect any web pages or media linked or attached to the post, collect any comments or likes, @mentions, #tags, etc. Social context scoping is a critically important scope because it collects the subject alongside the social commentary for a more complete historical record (another post on this soon).
We have a few more tricks that can be considered part of crawl scope, but I'll reserve these for a later post.
So, if you haven't already, take a look at Abbie's post.
To discuss your own web archiving needs, or to know more about archiving social media and business social networks, including the context, then contact us and we'll review your needs, show you a demo, and define your scope!
In the meantime, take a look at these white papers: