Computational analysis of book descriptions: a placement project

Jeanette Croen, a PhD placement student, reflects on her time at the British Library and her research in using computational analysis on early printed book descriptions.

13 April 2026

Blog series Digital scholarship

Author Jeanette Croen, PhD placement student

Folio from the Book of Hawking, Hunting and Heraldry — Example of an incunabulum: folio from the *Book of Hawking, Hunting and Heraldry*.

One of the big issues with libraries, archives and museums is that much of the information they contain can only be accessed in person. This project has extracted information previously only held in print catalogues so that it can be added to the British Library’s online database and accessed around the world. My PhD placement project used various methods of computational analysis on the BMC or the 'Catalogue of Books Printed in the XVth Century Now at the British Museum [British Library]' to complete this goal. The BMC was published between 1908-2007 and comprises of detailed descriptions of the incunabula, or printed books up to the year 1500, at the British Library. My project focused on BMC volume XI, published in 2007, which covers English incunabula. The information I was able to extract will be used to update records in relevant specialist databases: Material Evidence in Incunabula (MEI), Incunabula Short Title Catalogue (ISTC) as well as the British Library Catalogue (ALMA). It builds off my supervisor, Rossitza Atanassova, AHRC-RLUK funded Professional Practice Fellowship Project (2022-23) entitled Legacies of Curatorial Voice in the Descriptions of Incunabula Collections at the British Library which worked with BMC volumes I-X to extract free text descriptions for computational analysis.

I used a hybrid mixture of tools Transkribus, AntConc, coding and human review to extract information from the scans in a structured way. Having structured data was key to make possible the updating of the databases en masse instead of individually. The AI powered tool Transkribus is specifically made to work with digitised images of historical documents like the images we have of the BMC. The work I have done reused layout recognition from the previous project, also completed with Transkribus, to ensure information is correctly assigned to each text in the catalogue. Optical Character Recognition (OCR) was run to gather text.

BMC XI has 200 pages of records and 50 were manually tagged to create a model the AI could be trained on to tag the remaining pages for regions. First, I used Transkribus to create field models (or layout recognition) and tag regions that fall under binding, dating, tables or provenance information (as shown in the image below). Unfortunately, tables will not presently be extracted as superscripts were not consistently identified correctly, but OCR can be rerun on top of the field model when the large language model (LLM) has improved to accurately capture these. This information is then sorted through code I programmed, so that it can be mapped and uploaded to relevant database to enrich the catalogue.

Tagging by Field Model 3 on Transkribus — Example of tagging by Field Model 3 on Transkribus.

I ran Text Titan Ter 1, a general-purpose language recognition model, over all the 2 column and 4 column pages of the BMC XI (these pages had originally been separated to get clean layout recognition of the columns). Multiple iterations of field models were created to build the most accurate model. Originally, more regions had been tagged but this confused the model and resulted in sections overlapping and general chaos. Then a third field model was run over this with the advanced setting 'keep existing layout' turned on to preserve the two columns. This would enable the output file (.XML) to contain all the information for reading order so that text could be extracted and combined with the correct entry.

On reflection, I could have taken another approach to start with applying the field model, then the layout model, then the text model. However, this would not have given text recognition for the entire page, only for the text that was within the field models. It does not seem as though doing it this way would fix the reading order issues.

While the models created on Transkribus generally correctly recognize and tag regions, there are occasional errors. These errors could include incorrectly identifying a section or missing text. Therefore, I wrote bespoke Python code to identify these errors for review. This code uses the keywords at the start of each section to move it into a correctly tagged or incorrectly tagged group.

Keywords

Binding: Binding, Bound, Rebound, Inlaid

Dating: Dating

Provenance: Provenance, Presented, From, Grenville Copy, King George III’s Copy, Bought

The incorrectly tagged groups will then be looked over to see if they should be included or not in the mass upload. On initial examination, it appeared approximately 2/3 of the regions were correctly tagged while 1/3 will require further human review.

Finally, the curators and I mapped the derived entries onto a spreadsheet that will enable an upload into the appropriate fields on MEI and ALMA databases. This work will enrich over 200 entries in the online catalogue with binding, provenance and dating information.

AntConc

AntCon is a corpus linguistic tool that I used for analysis of the incunabula descriptions. Here are the most common words found in BMC XI excluding some common stop words. Stop words are words with little to no semantic meaning and instead obscure the words with semantic content that we were more interested in. This highlights the importance of Caxton, a prolific 15th-century printer, and Duff, an expert on 15th-century books whose works are often cited for dating.

There are further opportunities for research in computational analysis and named entity recognition with this project. Hopefully, some of the work completed can be applied to BMC XIII, which focuses on Hebrew incunabula. However, this volume contains both Hebrew and English and there is not currently an LLM on Transkribus that can successfully process these languages together. The field model was also successfully able to detect tables; however, the language detection of superscripts was poor so this will need to be revisited.

AI, computation analysis and the field of digital humanities are developing at an incredibly rapid rate. While these tools are made to be user friendly, they still have a learning curve if you have never worked with them before. As the need to digitise collections data is present across the GLAM sector, the skills of learning how to work with these tools are invaluable. The work that I have been able to complete on this project can hopefully be used as guidance as this research continues and the tools advance further, making it possible to extract the data I was unable to reach on this project.

Personal reflections

My placement at the British Library enabled me to learn more and apply the skills I’ve learned in digital humanities in a meaningful way rather than the more theoretical practice I have applied in my PhD research. It enabled me to learn and practice python coding in ways that I found interesting and purposeful which would not have happened in my own research.

The staff at the library were incredibly kind, helpful and generous with their time in teaching me how to work with these tools and I’ve left feeling much more confident in my digital humanities knowledge and skills than when I began. I was so pleased to attend Fantastic Futures 2025 in December and learned so much from all the speakers at the conference about the future of AI in libraries.

The opportunity to learn more and work with collections as data has been invaluable and I look forward to sharing my experiences and insights with my colleagues and students at the University of Lincoln.

Digital scholarship series

This blog is part of our Digital Scholarship series, tracking exciting developments at the intersection of libraries, scholarship and technology.

Blogs in this series All blogs

Find out more

A woman sits, looking serious and wearing in-ear headphones, at a laptop, in the public study spaces of the British Library in London.

Research

How to use the British Library for your personal or professional research. Our collections, study spaces and services are open to everyone.

A view of a woman from above. She sits at a desk in the British Library Boston Spa Reading Room, consulting a number of open books.

Our collections

You can access millions of collection items for free. Including books, newspapers, maps, sound recordings, photographs, patents and stamps.

Research

View from above of the Humanities Reading Room in the British Library in London, with people sat at desks.

Reading Rooms and study spaces

Use our free study spaces in London and West Yorkshire. They have comfy seats, power outlets and free Wi-Fi.

Research

View of the British Library building and piazza in St Pancras, London from above

Visit us

Planning your visit to the British Library? Opening hours, facilities, and what you can do at the Library.