HANZO'S WEB ARCHIVING BLOG

Subscribe by Email

Your email:

Current Articles | RSS Feed RSS Feed

Senior DevOps / Crawl Engineer

  
  
  
  

Engineers at work!Hanzo Archives is a cutting-edge web archiving company. Global corporations use our products and services to capture, archive, preserve, and make discoverable web-based electronically stored information (ESI) in native format. Their needs are primarily driven by eDiscovery, information governance and heritage requirements. Our operations are based in Europe and USA.

Hanzo has implemented the entire technology stack required to capture and archive the modern web with a sophisticated crawler at its core. This job is at the heart of crawler operations: to configure and manage crawls, process archived data, and interact with customers. We call this “Crawl Engineering”.

Also known as:

  • DevOps with Front End Debugging
  • Web Archive Operations Engineer

We are looking for bright, enthusiastic, self-motivated, self-learning Senior Crawl Engineers. Candidates must have strong diagnostic skills, be able to hold their own in a busy and challenging environment, and thrive on learning and optimising operations and systems. Candidates must demonstrate experience in Python and Javascript, a comprehensive knowledge of the workings of the web, solid Unix / Linux skills, and scripting with command line tools like Find, Grep and Awk.

  • Salary: Negotiable base salary plus participation in share options scheme
  • Location: Home-based or office based in Edinburgh, UK or East Coast, USA
  • To find out more or apply for this job, please email an intro plus your CV to Shuba Rao at shuba.rao@hanzoarchives.com.

Agencies, please note: applications forwarded through agents will not be accepted unless you have a prior arrangement with us. We will not make such an arrangement if you contact us!

Job Description

About the Company

Hanzo Archives is a cutting-edge web archiving company. Global corporations use our products and services to capture, archive, preserve, and make discoverable web-based electronically stored information (ESI) in native format. Their needs are primarily driven by eDiscovery, information governance and heritage requirements. Our customers are some of largest and most successful corporations in their industry. We currently operate in Europe and USA.

Hanzo has implemented the entire technology stack required to archive the modern web and at the core is a sophisticated crawler. This job is at the heart of crawler operations: to configure and manage crawls, process archived data, and interact with customers. We call this “Crawl Engineering”.

Job Summary

Reporting to the Head of Development, the Senior Crawl Engineer will primarily drive the technical aspects of the archiving operations for customers and ensure that we continue to deliver innovative and high-quality services. This will include writing software and tools to help with these tasks; running large, distributed, long-running jobs; instrumenting and metrics gathering; processing large volumes of data; managing virtual infrastructure (machines, storage).

Roles and Responsibilities:

  • Run crawler operations, including configuring crawls, making probers, diagnosing and resolving issues
  • Work within our process which includes monitoring SLAs, updating our issue tracking system
  • Translate feedback from customers and operations into software development to enhance our product and service offerings
  • Maintain and enhance existing software (both internal products and our open source projects)
  • Communicate systematically and at the right time
  • Work proactively, enthusiastically seeking problems in the software and systems and finding solutions
  • Be responsible for completing time-critical day-to-day tasks
  • Solve problems independently and as a team

Skills and Abilities Required for the Role:

  • Diagnose technical problems effectively
  • Work in a startup environment and work on any, sometimes disparate, tasks that need to be completed in a timely manner
  • Document software rigorously
  • Work with and without supervision
  • Problem-solving and thinking laterally, both individually and as part of a team
  • Communicate, and offer or ask for advice when needed
  • Ability to actively seek problems and find solutions
  • Ability to work remotely and with geographically dispersed teams

Person Specification:

  • Below are essential demonstrable personal attributes for all candidates.
  • Willing to firefight
  • Write quality code
  • Understand and work with other people’s code
  • Solve technical and operational problems
  • Python and Javascript
  • Regular Expressions
  • Unix / Linux, including scripting with tools like grep, find and awk
  • In depth understanding of HTTP and web
  • Write clearly
  • Responsible and self-motivated
  • Eager to learn, teach, and solve problems

Intranet Archiving at LegalTech NY Day 2

  
  
  
  

Day Two at New York Legal Tech.

Over the years, there’s been snow, rain, bitter cold – but this might be the warmest day for Legal Tech in a long, long time.  Maybe ever.  It might have been warmer outside than inside.

Speaking of inside and outside: today at Legal Tech brought to light that the “web” content many companies are concerned with most isn’t outside the company on social media or public websites, it’s INSIDE the company on intranets, wikis, collaboration sites, chat applications and other web-based tools.

Here are a couple of examples from conversations today (Cliffs Notes version:  Hanzo can help collect internally hosted web content)

Let’s say a company uses an http-based, internally hosted ticketing system to track customer issues, escalation processes and resolution. The company’s response (or maybe lack of response) to the issue, who knew about it, when they knew it, etc., becomes a piece of information relevant in litigation.  There’s no record in email because there was no email – all of the communication and documentation about the issue exists in the ticketing system.  How does the company get at that information?

What about a company that publishes policy (HR, FRCP, security, compliance, etc.) on an intranet site, but the company doesn’t know for sure what policies are published where and in which version. How does the company ensure that it avoids future issues?

These kinds of scenarios (there are many more) made for good conversations on how Hanzo can help companies tackle complex information governance, compliance and e-discovery issues involving web content. 

Talking web content collection and preservation at LegalTech NY

  
  
  
  

It's LegalTech NY Time Again! 

Great to see many friends, colleagues and clients. There’s been exciting buzz around Hanzo Archives and our approach to web content collection and preservation. “We’ve heard great things about you,” and, “we need to start using Hanzo,” are two of the most common (and most gratifying!) comments we’re getting.

One of the things causing the most interest in Hanzo is our "On Demand" web archive service.  We’ve had great meetings today with law firms and service providers who see it as an easy, cost-effective way to perform defensible, standards-driven collections of web content.  With just a few clicks, the collection starts and automatically returns the collected content in true web-native format.

Law firms like it because it is quick, easy to use and easy to get started.  But what might matter more than all of those is that Hanzo’s collections are forensically sound and in compliance with the ISO 28500 standard for web content collection and preservation. Quick, easy and low cost solutions are great, but defensibility is best. 

There are several hot topics at this year’s NY Legal Tech, and web and social media content are certainly near the top of the list.  It’s probably too late to say it’s an emerging area – it’s already here.  

Picture of Flatiron Building thanks to thenails1 and New York Pictures

Social Media Archiving: Ignorance is Futile

  
  
  
  

Miss Jean Lilburne with her class in the Grade One room at Drouin State School, Drouin, Victoria Circa 1916Though ignorance of the law is widely known to be indefensible in court, there are those who still try to make it work.

Such is the case when social media comes into play as evidence during a trial. There's no denying what you post often spreads like wildfire.

This article, published in The Telegraph, gives a clear example. It reports the following, "District Judge Andrew Shaw ordered nine people to pay footballer, Ched Evan's, rape victim £624 after they admitted disclosing her identity on Twitter and Facebook." The article further states that the defendants' claims of ignorance to their crime didn't matter. Hands down, they were guilty.

Now, some may contest that the defendants could have gone back and deleted their posts at the first sign of trouble, but as this Hanzo blog post points out, that wouldn't work either. If the prosecution used social media archiving to capture and preserve the public-facing content the defendants posted, they'd meet the same fate - or worse. Hanzo's native format social media archiving technology captures the seen and unseen across many social media platforms.

The Telegraph's article continues to clearly state the obvious, which all social media users cannot continue to ignore: in this digital age, everyone is an author and, as such, are responsible for what they publish and liable for the associated laws.

On the web, secret conversations don't exist. You can find out massive amounts of information on an individual or company's opinions, personality, political and religious preferences - just about anything, really.

I wonder what it is about social media that seems to wipe away all traces of our collective common sense. If you don't want it discovered, don't put it out there. Andy Warhol's contention in 1968 that everyone in the future will have 15 minutes of fame can now span many lifetimes (especially in the case of social media archiving.) This can be good or detrimental or both.

The wisest thing to do is educate yourself on what technologies could hold you or your business accountable for the content you publish. To get started, download our white paper and consider contacting Hanzo for a one-on-one demo. Remember, in the digital age, ignorance is futile.

Why Social Media Archiving Reduces Regulatory Risk

  
  
  
  

View from pulley-wheels of north side creeper-crane (jibbed right out) looking into box section of south side arch, Sydney Harbour Bridge, May 1930 / Ted Hood (hanging upside down 130 metres - 420 feet - above the harbour.)As brick and mortar companies come around to the idea of social media platforms being powerful business tools, they also struggle with regulatory risks.

In his article, "Social Media Carries Regulatory Risks," author Taylor Provost outlines how these risks impact the compliance of regulated organizations, and how they're adapting existing communication policies to include social media. The article also points out that, before social media, there wasn't a far-reaching way for employees to make public statements about companies. Now, it can be done in seconds.

This is one of several reasons we focus on assisting companies with information management. Our blog contains this post and this one about the necessity of including web and social media archiving in all information governance policies for compliance.

There's no question web and social media archiving serves information governance in many ways. One of the main benefits is that, when captured and preserved properly in native format, all of the media and data contained within each archive features proof of authenticity. Another benefit is the easy retrieval and "playback" of the web and social media content as it first appeared online, which in turn creates clear audit trails.

Though Provost's article focuses more on how to reduce or eliminate risks associated with corporate social media use, including web and social media archiving as part of information governance now is a conservative and practical choice.

Email use is fading as new communication platforms arise. This means current methods of preserving ESI for compliance audits and eDiscovery is destined to become obsolete.

Read more on web archiving and information governance by downloading our white paper.

For a one-on-one demo, contact Hanzo.

 

Web Archiving: Intelligence In Information Governance

  
  
  
  

Boys of the Newtown Scout troop collecting waste paper Circa 1938If you work in information governance, the speed with which change happens on the Web is, no doubt, a huge concern. An important component to that is defensible deletion, which could mean the difference between favorable and unfavorable court rulings.

As regulations on eDiscovery of web and social media content evolve, so do the requirements for data preservation and storage. In Philip Favro's article, "Defensible Deletion: The Cornerstone of Intelligent Information Governance", he discusses organizational failures in data stockpile management. Favro calls out how companies increasingly struggle to keep the costs of electronically stored information (ESI) low and litigious risk at bay. His suggested solution includes web archiving.

In alignment with Favro's article, this blog post from Hanzo highlights the need for corporations to re-visit their auto-delete protocols, and consider new means of storing web content and related data in native format, as well as placing litigation holds on digital content.

What does this mean for your business? Web and social media archiving support comprehensive information governance. When capture and preservation is conducted using native format archiving technology, many information governance processes become easier to manage. The native format web and social media archiving technology Hanzo uses makes your data searchable, immune to browser and software obsolescence, and creates authentic audit trails.

Intrigued and want to learn more? Download our Information Governance white paper.

Ready to see a Hanzo web archive in action? Contact Hanzo for a one-on-one demo.

Why Coca-Cola Uses Hanzo for Web Archiving

  
  
  
  

Coca Cola Website 1994It's exciting to see Hanzo customers' assessment of our web archiving services. After all, they're the reason we do what we do—capture and preserve the brands they've worked so hard to build, among other reasons.

We're especially honored by our association with Coca-Cola's web archivist, Ted Ryan. Hanzo enjoys the opportunity to archive Coca-Cola's web and social media content for many reasons, the most important being that it's an iconic brand with a rich history on- and offline. It goes without saying its online presence should be preserved for the enjoyment of future generations.

Coca-Cola's journey into web archiving is the perfect illustration as to why any corporation would want to preserve their online presence. To tell their web archiving story, Coca-Cola has produced the video, "1's and 0's: The History of the Coca-Cola Website." We invite you to view it to understand the company's thought process surrounding the need for web archiving and why they do it.

Hanzo has also published a new white paper on the preservation of web and social media for saving corporate heritage. Download it now to learn why this type of web and social media archiving is essential and how it works.

Web Archiving: Catching History Before It's Gone

  
  
  
  

1785 illustration of a Dodo birdI recall in the early days at Hanzo Archives, we frequently discussed how much information is disappearing from the web. The numbers have changed over the years, but we certainly know that link-rot (loss of web pages resulting in broken links) and the ephemeral nature of social media, are major problems for information persistence. This is one of the fundamental reasons we developed our web and social media archiving business.

Here's a more recent reminder of the nature of the web: important historical events, which were recorded on Twitter, blogs, and other social media platforms, are now lost.

Reading about historical events as they evolve in real time on social media is a very new experience. Unlike old media, with social media we're able to express our views, share our feelings, and rally support or show concern as events unfold. Think: Arab Spring, Pussy Riot, US Elections, etc. Its a very human experience. But what happens afterwards, as these events slip from our timelines? One study suggests around 30% of the resources shared by social media are lost. Remember how important the events I mentioned were at the time? Their part in important historial records is now lost.

This is discussed at length in the article "Losing My Revolution" by SalahEldeen and Nelson. See also this post for a summary.

The tragedy here is not that we'll lose all account of events like Egypt's recent uprising or Occupy Wallstreet's movement, it's that without the original resources referred to in the social media conversations, the context of those conversations is lost. As a historical record, the input of thousands or millions of people across the world is significantly degraded. These authentic reactions from people who are there when events happen, and those responding from afar, enhance the human dimension to each moment in time. This is what makes the web as a historical record so dynamic, personal, and visceral. But we're losing it!

Hanzo's web archiving capabilities are built to collect and preserve the full context of such events. Not just the tweets and status updates, but also the links, the content they're all referring to, is collected too. If I tweet: "here's a video of this... <link to video>" then we collect the tweet, and the video, archive and preserve them, and enable the full conextual experience to be explored in the future.

We provide this full contextualised archiving capability to a varety of customers for a number of reasons: for financial services companies, government agencies, researchers, brand owners and individuals; for business intelligence, information governance, corporate heritage, and regulatory compliance. 

The explosive use of social media, and most importantly, the resources on the web (documents, attachments, media) they refer to are now business records and need to be archived and preserved in accordance with your information governance policies. Make sure you are doing it right, capture the context, not just the tweets and statuses. 

To learn more about how Hanzo Archives captures and preserves social media, including Twitter, Facebook, LinkedIn, YouTube, and Chatter, as well as the web resources they all refer to:

Legal Counsel 2.0: A Call to upgrade to Web Archives

  
  
  
  

This article (Homo Electronicus) Portrait Parle class, France between 1910-1915is a very interesting read in that it calls out a challenge facing lawyers in the digital age.

Many in the law profession are trained to utilize paper and manual research when putting a case together. This still occurs in eDiscovery quite a bit today.

As the article points out, its as if the legal community is unaware of the tools available for approaching eDiscovery tasks in new ways. These tools are compatible with eDiscovery software platforms and provide easy access to extracting and sorting online data, which virtually eliminates the need for paper files in court.

In an earlier post, Hanzo's Mark Middleton defined the web archive scope, which is very important for anyone in the legal profession to read. It gives you the "what's in it" overview  needed to understand the inherent value in web archiving for many litigious scenarios.

In terms of eDiscovery, imagine being able to access websites or social media case data within minutes, as if you were performing a typical browser search. It's a different approach to the traditional paper file process, but one that must be learned by all lawyers, as the article referenced above points out. Courts and regulators are now requiring web and social media data to be presented in full context. It's unavoidable. Where are you in the era of digital eDiscovery?

If you feel you've been nudged in the right digital eDiscovery direction, learn more about Hanzo web archives' compatability with Symantec Enterprise Vault™ in this joint solution datasheet. Another point to keep in mind: Hanzo archives are compatible with other eDiscovery software as well.

Financial Services Web Archiving: The Growing Need

  
  
  
  

Pinion pine growing out of rock 1972As data gets bigger and more global, the financial services sector, like many others, faces the challenge of big data storage, governance, and management.

In Nancy Turbé's article titled "Five Key Challenges with Financial Services Data", she outlines the looming data crises financial firms face.

With the advent of emergent communication technologies, the challenges Turbé outline promise to get much worse. This digital landscape is changing how every industry conducts business, and virtually none are as heavily regulated as financial services.

Keeping compliant, creating and maintaining a clear and robust audit trail, as well as being able to effectively manage your information governance protocols can be a huge drain of your time and resources. It effects your ability to remain competitive, which is a direct hit to your bottom line.

If you are a financial services provider it's imperative you look into web archiving your data. Not only does it allow you to capture, preserve, and store your web and social media content in foresically sound, native format web archives, but it allows for easy search and quick extraction of the information you need when you need it.Think about the benefits of clean audit trails, easy web and social media content organization, plus hassle-free data storage—that's only part of the web archiving picture.

If Turbé's article and this post have piqued your interest, I invite you to download our white paper, then contact Hanzo for your tailored web archiving solution.

 

 

All Posts