AI and machine learning with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research?

23 December 2024

Author Mia Ridge, Digital Curator

Blog series Digital scholarship

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. The conference themes have been published and the Call for Proposals will be open soon.

You can also watch public events we've previously hosted on AI in libraries (January 2025) and Safeguarding Tomorrow: The Impact of AI on Media and Information Industries (February 2024).

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

‘Using Transkribus for Arabic Handwritten Text Recognition’, ‘Using Transkribus for automated text recognition of historical Bengali Books’.
Investigating the legacies of curatorial voice in the descriptions of incunabula collections at the British Library and student work on Detecting Catalogue Entries in Printed Catalogue Data
Handwritten Text Recognition of the Dunhuang manuscripts: the challenges of machine learning on ancient Chinese texts (eScriptorium)
Reinventing the 'Convert-a-Card' crowdsourcing project as a semi-automated workflow: Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion, Convert-a-Card: Helping Cataloguers Derive Records with OCLC APIs and Python; Convert-a-Card: Extracting Entities from Catalogue Cards to Create E-Records.

“Blackberry” associations change from “gooseberry, raspberry, strawberry, thyme, marjoram” to “smartphone, unlocked, curve, qwerty, android”; “Cloud” associations change from “bright, cloudless, photosphere, mist, dark” to “android, optimizing, leveraging, drive, millman”; “Eta” associations change from “Basque, terrorist, separatist, milf” to “dial, bezel, rolex, dural, ss”; “Follow” associations change from “wish, same, indicated, body, optional” to “us, connect, feed, rss, google”. — From blackberries to clouds – word associations change over time.

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

Identifying upside-down images in the Endangered Archives Project – projects within this important collection were often digitised under trying circumstances, so training machine learning to identify image attributes is useful.
Languid: Language Identification Project (2020) – Metadata Services' Victoria Morris experimented with machine learning (and ‘human review’ from c40 enthusiastic language experts checking the results) and was able to add language codes to over 3 million catalogue records. Her project identified 471 languages in the records, 141 of which were not previously represented. Resources: short video, longer video, publication Automated Language Identification of Bibliographic Resources: Cataloging & Classification Quarterly: Vol 58, No 1.
Flyswot (2021) – BL staff trained a machine learning model to find images of digitised manuscripts incorrectly labelled as ‘flysheets’.
Trialling a book genre classification model (2022) - while the team concluded that the model worked well, but not well enough to use for creating catalogue data yet, they shared their model on Hugging Face and training data created by British Library staff on Zooniverse. Resources: blog post and tutorial.

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'.

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

Digital scholarship series

This blog is part of our Digital Scholarship series, tracking exciting developments at the intersection of libraries, scholarship and technology.

Blogs in this series All blogs

Find out more

A woman sits, looking serious and wearing in-ear headphones, at a laptop, in the public study spaces of the British Library in London.

Research

How to use the British Library for your personal or professional research. Our collections, study spaces and services are open to everyone.

A view of a woman from above. She sits at a desk in the British Library Boston Spa Reading Room, consulting a number of open books.

Our collections

You can access millions of collection items for free. Including books, newspapers, maps, sound recordings, photographs, patents and stamps.

Research

View from above of the Humanities Reading Room in the British Library in London, with people sat at desks.

Reading Rooms and study spaces

Use our free study spaces in London and West Yorkshire. They have comfy seats, power outlets and free Wi-Fi.

Research

View of the British Library building and piazza in St Pancras, London from above

Visit us

Planning your visit to the British Library? Opening hours, facilities, and what you can do at the Library.