Skip to content

blplaybills.org: better search results using LLMs

Search for, view and download 300 years of theatre posters from Great Britain and Ireland, 1600–1902.

7 October 2025
Playbill poster next to an illustration of a brain

Blog series Digital Scholarship blog

The blplaybills.org website provides a way to search for, view and download 300 years of theatre posters from Great Britain and Ireland, 1600–1902, as curated by the British Library (BL).

The site relies on extracting text from scans of historical documents using a machine learning technique called Optical Character Recognition (OCR). Historically, this has been a pattern matching technique that compares patterns on the document with image patterns known to the technology.

This can produce reasonable results, but there are a number of problems when used with historical documents that can lead to unexpected results:

  • Historical content, unexpected decorationsComplex layouts can confuse the software, and the order that content is presented back isn’t always as expected
  • Sometimes multi-language content can cause problems
  • No awareness of context: a simple pattern matching algorithm does not understand 17th century grammar or vocabulary, so can’t make ‘educated guesses’ about what a partially recognised word or character might be
  • Poor quality source images generally produce poor results
  • For historical content, unexpected decorations (manicule ☞, braces etc) and typographical conventions like diphthongs and the long “s” (ſ) are not handled well.

Despite these challenges, blplaybills.org used open source OCR projects like Tesseract and Textra to produce usable content that has given researchers valuable access to the playbill collection.

However there is now another option: using modern LLMs to do the job.

This post discusses how experiments using modern Large Language Models (LLMs) for OCR led to significantly improved text extraction from playbill scans, which means that blplaybills.org now provides more accurate search results.

Improving Optical Character Recognition

The pace of change in the world of LLMs is rapid.

In the last six to 12 months a number of LLMs have proven to be especially adept at character recognition, and they are now used routinely for extracting text in personal settings (photographs and receipts on your phone) and commercial applications (invoices, document summary, RAG – Retrieval-Augmented Generation).

My application is very simple in comparison, and as we shall see this simplicity created problems for the LLMs.

The general approach to evaluate a number of LLMs was as follows.

  • Research open source and commercial, multimodal and vision LLMs that have an OCR capability and compile a shortlist
  • Prefer open source to commercial (in many situations cost is an important factor)
  • Look for an off-the-shelf, trained model that produces better results than Tesseract/Textra. This model, or a pretrained variant, can then be fine-tuned to steadily build better results
  • Make the evaluation process easy and repeatable; I would expect to repeat this process on new or improved LLMs every 6 months to see if better results could be achieved
  • Write software that can be reused with minimal changes, to evaluate each of the short-listed (and future) LLMs, testing their capabilities on a small, fixed sample of ‘difficult’ images (poor scan, accents, strange characters, unusual layouts etc).

Choosing the model

After initial research, the short-listed models were:

I’ve excluded OpenAI models since although one was originally on the short list, I never got as far as testing it. More on that later.

There are a number of challenges using larger models with a higher number of parameters like qwen/qwen2.5-vl-72b-instructand mistral-large-latest.In practical terms the main issue is that these models will only run on powerful hardware that is not readily available, locally or remotely, to every organisation.

One good solution to this is to access the models on remote hardware, using an Application Programming Interface (API) provided by a third party. So test software was written to use APIs from OpenRouter (provides a unified API to access many AI models), allowing me access to the models on their hardware.

The commercial offering from Google could only be used with an API, and costs for using it are relatively low compared to some of the competition.

Tests were done and the final winner was gemini-2.5-flash.

Reasons for excluding Qwen and Mistral LLMs:

  • Inconsistent results - asking the LLM to generate text from the same image twice produced significantly different results
  • Occasionally truncating content: I suspect this relates to problems with the amount of material a model can process at once when sending large images, and could have been dealt with after further experimentation
  • Poor handling of unusual layouts and characters found in historic documents
  • Occasional text ordering issues
  • Occasional difficulty convincing the model not to add descriptive text or summaries to image content
  • Poor response to prompt changes.

The Google model, gemini-2.5-flash, did not exhibit these problems and the OCR text was close to perfect, even when presented with very poor scans and difficult to process multi-column layouts.

Consider this extract from a playbill image:

Playbill poster

The previous OCR software produced this extract:

I The most Successi Pantomime Buer Nishio

COBURG

The sp

SO MEDAKIDCE

สงONPLURNUNK

831 AND DURING.

ariug Ex

RICHAR

HIGHWAYMAN,

I TURPIN!

sundere

•SAUNDER

ghosyman, Mr.

dele, Mr, MAr. ELLIOTT.

T PALN

sheelert

ANISFER SCARBRO, &c.

Hetines

MAN.

Mrs. Remnant,

Torse Chach

BRO

เธอไร loo i roteng e.usata bus"

The updated LLM workflow produced this extract:

The most Successful Pantomime Every Night.

COBURG THEATRE

Under the sole Management of Mr. DAVIDGE

ON MONDAY, JANUARY 3d, 1831, AND DURING THE WEEK,

To commence with the Melo-Dramatic Spectacle, written by Mr. W. Barrymore, founded on the Daring Exploits of the Notorious HIGHWAYMAN,

RICHARD TURPIN!

Score, the Landlord of the Orry Mere, Mr. PORTEUS. Farmer Saunders, Mr. SAUNDERS.

Game Keeper, Mr. KING. Clump, Mr. ENNIS. Richard Turpin, the Highwayman, Mr. ELLIOTT.

Ned Rust, Dick Swift, Bob Fielding, Gregory, &c. Mr. Bradley, Mr. MAYNARD.

Hannibal Remuant, a Retired Linen Draper, Mr. SCARBRO, &c.

Mrs. Remnant, Mrs. GOMERSAL. Dame Saunders, Mrs. WESTON.

Robbery of the MAIL Coach,—A HORSE Race,—A desperate Chaise Affair—Turpin Shoots his Horse,—and Capture of Turpin.

There is a significant improvement.

At this point I was satisfied with the results and decided to abandon testing the OpenAI model.

Instead, my work was redirected into creating a workflow to use gemini-2.5-flash to process all 84000+ playbills and index the content for the live site.

Better results with ‘prompt engineering’

Generally speaking, when ‘talking’ to a LLM via an API, we can send a ‘prompt’ together with the data that we want the model to work with.

Think of the prompt as a set of instructions to help the LLM understand our intentions better, and therefore return a response that better suits our needs. These instructions can specify things we want it to include in the response, and crucially, also what we want it to exclude.

When deciding on a prompt I consider the following:

  • Describe the high level task accurately and in concise, simple wording
  • Be clear about how the response should be formatted
  • For some models, using capital letters for emphasis can help
  • Where there are a number of inclusions/exclusions/tasks, specify these as a clear, concise list
  • Where there are special things to be aware of (e.g. how to deal with the long ‘s’), add an extra bullet point just for that.

My experience is that these guidelines do not work consistently for all models, but they worked well with the Gemini models.

A key aspect of the success of the Gemini model was that it responded well to prompt changes, so although the final prompt was quite simple, the model did respect it.

The prompt is:

Perform OCR on the following image.

IMPORTANT:

- Return ONLY the text contained in the image, in the order it appears.

- Do not add any summary, context, analysis, or introductory text like "The text in the image is:".

- The image is a historical document and may use unusual fonts, have variable lighting, and contain artifacts.

- Be aware of historical letterforms, such as the 'long s' (ſ), and transcribe it as a modern 's'.

With hindsight, I might add more detail about how to handle diphthongs, but as someone with a print background, I enjoy seeing them unchanged in the final text.

A crucial part of this approach is to spend time experimenting with different prompts, understanding how each changes the behaviour of a specific model, and iteratively changing the prompt till consistent results are achieved.

Processing and indexing the playbills

A new workflow was designed that used Google APIs to OCR all playbills in the collection and send updated text and metadata to the database for indexing.

Key points of the implementation:

  • Reusable code written to process images and return OCR text using Gemini API
  • Workarounds for rate limits and occasional service unavailability for gemini-2.5-flash
  • Use gemini-2.5-pro when gemini-2.5-flash didn’t work (usually for rate limit issues rather than problems handling a specific image)
  • Fully automated, including error handling, and runs without supervision.

blplaybills.org now uses this new text and the quality of search results has been transformed.

Next steps

The Gemini models are commercial, closed source and are constantly updated.

The implications of this are:

  • Cost: using various credits and free tiers offered by Google, costs were kept to several hundred pounds to process tens of thousands of playbills. However for large collections of historical documents that could be a problem, especially if funding is an issue (blplaybills.org is completely unfunded!)
  • Lack of control: the Gemini models, and the APIs that give us access to them, could be updated at any time. Their behaviour may change, making the current software produce unexpected or unwanted results
  • Because the models are closed source, any improvements made using our data might not be shared back with the open source community. That said, there’s a reasonable chance that Google’s open-source models (such as Gemma 3) could eventually benefit from those improvements.

A useful piece of work might be to experiment with non-commercial models:

  • Research open source, multimodal/vision base models
  • Look for versions that are able to run on lower grade hardware, perhaps even locally on a laptop
  • Fine tune using good data (our playbill images and OCR text can now be considered to be ‘good data’)
  • Try and achieve as good, if not better, results for OCR than the Gemini models
  • Investigate the effects of prompt changes; it’s possible that a tuned model can be applied successfully to many different types of source image with prompt changes only i.e we don’t have to fine tune the model for every type of content, just change the prompt.

If this approach produces good results, then a more generalised workflow to apply these techniques to other types of English or multilingual content might look like this:

  • Find and use a commercial LLM to produce good data from a subset of source images
  • Use that good data to tune a local and private open source LLM
  • Test content with different prompts until good results are achieved
  • Process all content using the new open source model and prompt.

In the meantime, researchers who use blplaybills.org have a better chance of finding what they’re looking for than was the case a couple of weeks ago.

Circular abstract image.

Digital scholarship series

This blog is part of our Digital Scholarship series, tracking exciting developments at the intersection of libraries, scholarship and technology.

blplaybills.org: better search results using LLMs