Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain. Our vision is that by 2016
- The British Library will be the primary collector and provider of a web archive which is representative of the UK domain.
- The web archive will be used for scholarly research in a range of disciplines.
- Researchers will be able to search and use the web archive as a part of the Library's overall digital collections.
- The UK Web Archive will be known as the place where researchers and general public look for inactive and or historical versions of UK websites.
Selective web archiving
Since 2004, the British Library has been selectively archiving websites with research value that are representative of British social history and cultural heritage. Our Collection Development Policy states the criteria we use to select websites from the UK domain. Archived websites to date are made available through the UK Web Archive, along with additional material archived by National Library of Wales, the Joint Information Systems Committee and the Wellcome Library.
The UK Web Archive contains regular snapshots of thousands of websites and offers rich search functionalities including full-text, title and URL search. The archive in addition can be browsed by Title, by Subject and by Special Collection.
Exploring domain-scale web archiving
The implementation of Legal Deposit for UK online publications means that the Library will have a mandate to collect and preserve freely available UK online publications. The Web Archiving Team is exploring the technical and curatorial challenges of collecting in future a much larger proportion of the UK domain. Through large scale discovery crawls and semantic analysis, we aim to build a better understanding of the boundaries and characteristics of the UK domain. We will also put in place a system which is capable of scaling up to the size of the challenge – particularly given the size of the UK web space, which contains over 9 million websites and growing.
Integration with Library collections and systems
We work closely with colleagues across the Library on the ingest, storage and long-term preservation of web archives in the Digital Library System, initially involving our selective archive. Access to copies of websites archived under Legal Deposit will be provided in Reading Rooms, though the Library’s resource discovery system 'Explore the British Library'. This system is already includes websites from the UK Web Archive in search results (hint: try searching for ‘Robin Cook’ or ‘Argotist’ or ‘rhyming slang’).
Developing web archiving tools
In recent years, the British Library has been leading the development of key web archiving software tools on behalf the international web archiving community.
The Web Curator Tool (WCT), which has been designed to manage the selective web archiving process, started as a collaborative project with the National Library of New Zealand, and has since been adopted by the National Library of Norway. The BL releases periodic revisions via Sourceforge.
Heritrix is open source crawler software which has been commonly used by national libraries and archives around the world for web archiving. The British Library is part of a multinational group of libraries working on ‘smart’ extensions to Heritrix, which was released in December 2009 as version 3.0, to provide better support for large-scale domain crawling.
We are developing a new crowd sourcing application that will use Twitter to support an automated selection process. We envisage that in the future, automated selection of this sort will compliment manual selection by subject experts, resulting in a more representative and well-rounded set of collections.
Working with others
The British Library is a founder member of the International Internet Preservation Consortium (IIPC), which brings together national libraries and other organisations interested in web archiving, sharing experience and promoting the use of common standards and tools. Members of the Web Archiving Team currently chair two of the four IIPC working groups, the Access and the Harvesting Working Group.
From 2004 to 2008 the British Library was also the lead partner in the UK Web Archiving Consortium (UKWAC) comprising six organisations: the BL, the Joint Information Systems Committee, the National Archives, the National Library of Wales, the National Library of Scotland and the Wellcome Trust. UKWAC shared a common infrastructure to get selective web archiving started in the UK. It has now evolved to become a strategic group within the Digital Preservation Consortium (DPC), providing leadership and encouraging collaboration for UK web archiving activities. For more information on UKWAC / DPC, see http://www.dpconline.org/about/working-groups-and-task-forces/524-web-archiving-and-preservation-task-force.
Head of Web Archiving - eIS
The British Library
96 Euston Road
Tel: +44 (0)20 7412 7184