Depositing websites and web pages
The British Library and other Legal Deposit Libraries are entitled to copy UK-published material from the internet for archiving under legal deposit. Web crawling is an automated process used to collect content and metadata that is available without access restriction on the open web.
The Legal Deposit Libraries use web crawling software wherever possible, especially when collecting for the UK Web Archive, but may also use manual or other methods of downloading content and metadata when necessary.
A seed list of domain addresses is programmed into the web crawling software as Uniform Resource Locators (URLs). The software uses these to initiate the process, requesting a copy of the root or home page; then it automatically follows links to the next levels down within the same domain, issuing a separate request for each URL identified. Target websites respond automatically, delivering a copy of the page or file to which the URL relates.
The web crawling software is also programmed with politeness rules and parameters designed to ensure that there is no harmful impact upon the performance of the target website. For example, they include a limit on how many levels are crawled or how much content is requested from an individual website. Also, when multiple requests for different pages and files are issued to the same website, the software is programmed to leave an interval between each request, to safeguard against using up too much bandwidth and overloading the website.
The web crawling software uses standard automated protocols to identify itself and to inform the publisher’s webmaster (via information called a “user-agent string” submitted to the web server’s log of server requests) on each occasion that a page is crawled. The webmaster can choose whether or not to use this information, but is not required to take any action such as changing the website’s “robots.txt” permission file.
Where the web crawling software encounters a login facility, it cannot access any material behind the login facility without the appropriate password or access credentials.
Crawled websites and material are preserved in the Legal Deposit Libraries’ web archive.
For further detail, see