The British Library and other legal deposit libraries are entitled to copy UK-published material from the internet for archiving under legal deposit. This page describes how we identify UK websites, and explains legal deposit restrictions.
Each year, our UK Web Archive collects a ‘snapshot’ of all the UK websites that we can identify. This includes at least 4 million websites, with several billion files.
How we identify UK websites
We can identify a UK website if it has:
- a domain name that relates to the UK (for example, websites that end “.uk”, “.scot” or “.london”)
- a UK address as contact information
- other information on it which identifies that the website's content was created in the UK
- a server in the UK hosting it.
We don't collect any material that is only shared between restricted private groups, such as material posted to private networks on social media.
Recorded sound and film
We don't collect material on the web where the content is solely either a sound recording or a recording of moving image. This includes material where there is some accompanying text, but the text would not make sense without the sound or moving image.
We do collect material from the web that includes sound and moving image alongside text or other elements, where the text or other elements have meaning independent to the sound and moving image.
How we archive websites
The legal deposit libraries use web crawling software wherever possible, especially when collecting for the UK Web Archive. Web crawling is an automated process used to collect content and metadata that is available without access restriction on the open web. Crawled websites and material are preserved in the legal deposit libraries’ web archive. We may also use manual or other methods of downloading content and metadata when necessary.
The web crawling process
A seed list of domain addresses is programmed into the web crawling software as Uniform Resource Locators (URLs). The software uses these to initiate the process, requesting a copy of the root or home page. Then it automatically follows links to the next levels down within the same domain, issuing a separate request for each URL identified. Target websites respond automatically, delivering a copy of the page or file to which the URL relates.
Web crawling politeness and protocols
The web crawling software is also programmed with politeness rules and parameters designed to ensure that there is no harmful impact upon the performance of the target website. For example, they include a limit on how many levels are crawled or how much content is requested from an individual website. Also, when multiple requests for different pages and files are issued to the same website, the software is programmed to leave an interval between each request, to safeguard against using up too much bandwidth and overloading the website.
The web crawling software uses standard automated protocols to identify itself and to inform the publisher’s webmaster (via information called a “user-agent string” submitted to the web server’s log of server requests) on each occasion that a page is crawled. The website owner can choose whether or not to use this information, but is not required to take any action such as changing the website’s “robots.txt” permission file.
Where the web crawling software encounters a login facility, it cannot access any material behind the login facility without the appropriate password or access credentials.
Our Framework for UK Legal Deposit (PDF format, 468 KB) provides information on how we are developing our capacity to collect digital publications.