Skip to content

Transcription of Kaye and Johnston's catalogue of manuscripts

This small but moderately complex project used AI to transcribe a catalogue, but still required considerable human effort to review and refine the output.

5 November 2025

Blog series Untold lives

Author Maddy Clark, Digitisation Officer

Page 1169 of the printed catalogue of European Manuscripts Volume II, Part II showing an overview summary of the Wilson Manuscripts

Page 1169 of the printed catalogue of European Manuscripts Volume II, Part II showing an overview summary of the Wilson Manuscripts.

A recent addition to the Research Repository is the transcription of the Catalogue of Manuscripts in European Languages Volume II Part II: Minor Collections and Miscellaneous Manuscripts, Section II- Nos. 539-842 by George Rusby Kaye (1866–1929) and Edward Hamilton Johnston (1885–1942).

This catalogue is a continuation of Kaye and Johnston Vol. II, Part II, Section 1, Nos. 1–538, (1937), and was completed by Johnston following Kaye’s death in 1929, covering manuscripts numbered 539–842. It was printed for publication but never published.

This transcription has been created from the copy held by the India Office Records & Private Papers section at the British Library. This particular copy contains some useful additional annotations by later curators that broaden the original text’s descriptions, including further details regarding the provenance of some collections, as well as serving to correct some mistakes. These notes have been added in the published transcription.

This project was completed using Transkribus, an AI text recognition software for historical documents. For the initial transcription and creation of ground truth data, 75 pages were digitised using a scanning tent. A community-developed AI model was initially used for the first phase of transcription, however, as it is a more ‘generic’ model it was not attuned to the conventions of Kaye and Johnston. It struggled with the following:

  • Recognising minor layout inconsistencies (e.g., margin notes, headers, tables),
  • Recognising characters of a similar shape (e.g., a/e/o, h/l/i), particularly toward line-ends.
  • Non-English characters, specifically Sanskrit diacritics.
Editing the transcribed pages pp.1208-1209 of Kaye and Johnston in Transkribus.

Pages 1208-1209 of the Printed Catalogue of European Manuscripts Volume II, Part II being edited in the Transkribus programme.

The transcriptions created from the community model, once corrected, provided the ground truth data that could be used to create a more precise custom transcription model. Transcription models aren’t always perfect on the first try and are improved by 'feeding’ it more ground truth data based on corrected pages to make it as accurate as possible. The first custom model had a ‘Character Error Rate’ (CER) of 3.77%, based on 150 pages of text. That percentage indicates that out of 100 characters, only 3.77 would be wrong. This was definitely not an accurate reflection of its abilities, but it was a promising start.

In total three custom models were developed, with the final model being trained on 99 ground truth manually corrected pages, comprised of 80,668 words, and with a CER of just 0.91%. Success! While some human quality control was required on all pages, human text correction time was significantly reduced by 3-10 minutes. Some other strange quirks did make themselves known in later models, such as upside-down text and the occasional complete gibberish being transcribed.

One of the clear conclusions from this project is that AI when used on a small and moderately complex project such as this still relies heavily on human effort. It can make the repetitive, time-consuming parts more manageable, but a person still needs to review and refine the output. In this case, that meant around 320 hours of human effort.

The final PDF of the transcription is now available on the Research Repository, along with an additional document outlining the project in more detail. Both are free to download and search.

Further resources

Kaye and Johnston Vol. II, Part II, Section 1, Nos. 1–538, (1937) is accessible via Google Books (as of 2025)

Other published catalogues in this series include:

  • Catalogue of manuscripts in European languages belonging to the library of the India Office: Vol 1 Pt 1 Mackenzie Collections: 1822 Collection and Private Collection, by C O Bhagden. London: OUP, 1916
  • Catalogue of manuscripts in European languages belonging to the library of the India Office: Vol 2 Pt 1 Orme Collection, by S C Hill. London: OUP, 1916.
Illustration of a policeman directing directing busy horse-drawn traffic.

Untold lives series

This blog is part of our Untold Lives series, sharing stories of people’s lives from our collections. Stories from around the world, from the dawn of history to the present day, are told through the written word, images, audio-visual and digital materials.

We hope to inspire new research and encourage enjoyment, knowledge and understanding of the British Library and its collections.

Transcription of Kaye and Johnston's catalogue of manuscripts