Automatic Transcription of Historical Handwritten Arabic Texts

Image of a plant on a page from manuscript number or_3366
Or 3366 Kitāb Dīsqūrīdis fī mawādd al-‘ilāj كتاب ديسقوريدس في موادّ العلاج Dīsqūrīdis (Dioscorides) ديسقوريدس

Helping advance the state-of-the-art in handwritten text recognition technologies for Arabic, this project aims to train software to read historical Arabic manuscripts, share training models and ground truth, and collaborate on Arabic handwritten text recognition.

Published date:

About

The British Library is home to one of the largest and finest Arabic manuscript collections in Europe and North America, comprising almost 15,000 religious, historical, literary and scientific works. Since 2012, the Library, in partnership with Qatar National Library, has digitised and made freely available over 2M images on Qatar Digital Library (QDL). Having the capability to make vast archives of digitised Arabic texts fully searchable would truly transform research, opening up this rich content for full-text search and enabling large-scale text analysis.

Computer scientists and scholars have been working on this challenge, building systems which can automatically transcribe images of handwritten text. Through this project we aim to support continued research in this area by contributing open image and ground truth datasets of historical handwritten Arabic texts and collaborate on Arabic HTR, to ensure historical Arabic collections continue to benefit from state-of-the-art developments in handwritten text recognition (HTR).

What is ground truth?

Most recognition systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. By knowing what the system is supposed to recognise on a digitised page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Project aims

  • Support the British Library's mission to make our intellectual heritage accessible to everyone “for research, inspiration and enjoyment”, particularly our non-Western materials
  • Raise awareness of our Arabic manuscript collections, specifically our scientific manuscripts, with a wide and diverse audience around the world, from the general public through computer scientists to students
  • Instigate new collaborations in the computer science/recognition domain, creating a dialogue around the challenges and opportunities for automatic transcription of historical Arabic texts
  • Create an openly licensed ground truth dataset from our Arabic manuscripts to aid researchers working on the state-of-the-art in recognition software
  • Pilot the crowdsourcing of transcriptions of handwritten manuscripts in order to create ground truth resources at scale

Activities

Phase 1

Collaborative Transcription Pilot

This proof of concept activity explored whether the creation of a ground truth dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach was the Library’s commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections.

For this we utilised a platform called FromThePage, which allowed anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription. The Digital Research Team was also able to fund the development of right-to-left transcription support, a feature required for this project, but which will also benefit any scholar wishing to use the software for their own transcription needs. A team of four British Library curatorial and translation experts produced the first 10 pages of ground truth to use as an example for volunteers. It took only 18 days for 36 volunteers from around the world to fully transcribe a further collection of 85 pages selected from 9 manuscripts.

Arabic Scientific Manuscripts Transcription Workshop

In conjunction with the launch of the online collaborative transcription platform, we hosted 12 individuals at the British Library for an Arabic Scientific Manuscripts Transcription Workshop. Throughout the day participants had the opportunity to meet the curators, view a selection of original manuscripts from the Arabic collections, learn about the latest developments in OCR/HTR for handwritten Arabic script, and give the new transcription platform a try. Their feedback was invaluable to the project and helped us make changes to the platform ahead of its wider launch. 

ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts

In collaboration with our partners at the Alan Turing Institute and PRImA Research Lab, we launched a competition as part of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR 2018) held in August 2018 in Niagara Falls (USA). The competition focused on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts, utilising the ground truth we created. Two winners were announced:

  • Page segmentation: Berat Kurar Barakat, Ben-Gurion University of the Negev
  • Text line segmentation & text recognition: Hany Ahmed, RDI Company, Cairo University

A paper describing the competition and results was published in the proceedings of ICFHR 2018:

C. Clausner, A. Antonacopoulos, N. McGregor, D. Wilson-Nunn, "ICFHR 2018 Competition on Recognition of Historical Arabic Scientific Manuscripts - RASM2018", Proceedings of the 17th International Workshop on Frontiers in Handwriting Recognition (ICFHR2018), Niagara Falls, USA, August 2018, pp. 471-476.


Phase 2

ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts

Building on the success of the first pilot, we collaborated again with our partners at the Alan Turing Institute and PRImA Research Lab. We launched a second competition as part of the 15th International Conference on Document Analysis and Recognition (ICDAR2019), held in September 2019 in Sydney, Australia. As in the previous year’s competition, we tried to find the best solution for automatic transcription of historical Arabic scientific handwritten manuscripts.

For this follow-up competition, we enhanced and extended the existing ground truth dataset, from 95 to 120 pages. We also added another challenge to the mix – marginalia. This text, written in the margins of the manuscripts, is often less standardised and legible than the main text, and frequently goes in different directions.

Read more about the 2019 competition and its results in these resources:

All ground truth resources created for both competitions are freely available under an open license for anyone wishing to advance the state-of-the-art in text recognition technology. Download here: https://doi.org/10.23636/1135.

 

Phase 3

Following from the two competitions, we have used the generated ground truth dataset to train and evaluate how Transkribus performs on the automated recognition of Arabic manuscripts. Transkribus is a platform for “the digitisation, AI-powered text recognition, transcription and searching of historical documents," and is one of the leading tools for handwritten and printed text recognition. See here for more information on testing Transkribus with Arabic manuscripts: https://blogs.bl.uk/digital-scholarship/2020/01/using-transkribus-for-arabic-handwritten-text-recognition.html.

These activities led to fruitful collaborations with OpenITI and related projects around the development of open source OCR/HTR tools for Arabic, Persian and other Arabic-script languages, e.g. the Automatic Collation for Diversifying Corpora (ACDC) project and the OpenITI AOCP project. Other users have been benefitting from our ground truth dataset as well as the training model available on Transkribus.

We are now considering our next steps, which could include (1) improving the existing Transkribus model for Arabic handwriting (by adding to the ground truth) and making it public; (2) automatically transcribing Arabic manuscripts digitised as part of the British Library Qatar Foundation partnership; and/or (3) running another Arabic HTR competition with PRImA for ICDAR2023.
 
 

Contributors

Project Leads

Curatorial

  • Lynda Barraclough, Head of Curatorial Operations, BL/Qatar Foundation Partnership
  • Dr Bink Hallum, Curator of Arabic Scientific Manuscripts, BL/Qatar Foundation Partnership
  • Daniel Lowe, Curator Arabic Collections, British Library
  • Dr Mariam Abolezz, Translation Support Officer, BL/Qatar Foundation Partnership
  • Julia Ihnatowicz, Translation Support Officer, BL/Qatar Foundation Partnership
  • George Samaan, Translation Support Officer, BL/Qatar Foundation Partnership

Collaborative Transcription Platform (FromThePage)

ICFHR 2018 and ICDAR 2019 Competition Partner