Automatic Transcription of Historical Handwritten Arabic Texts

Image of a plant on a page from manuscript number or_3366
Or 3366 Kitāb Dīsqūrīdis fī mawādd al-‘ilāj كتاب ديسقوريدس في موادّ العلاج Dīsqūrīdis (Dioscorides) ديسقوريدس

This project aims to train software to read historical Arabic manuscripts.

Published date:

About

The British Library is home to one of the largest and finest Arabic manuscript collections in Europe and North America, comprising almost 15,000 religious, historical, literary and scientific works. Since 2012, the Library, in partnership with the Qatar National Library, has digitised and made freely available over 1,625,000 images, of which 70,000 are from this collection, on Qatar Digital Library (QDL).

Having the capability to make vast archives of digitised Arabic texts fully searchable would truly transform research, opening up this rich content for full-text search and enabling large-scale text analysis.

Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.

Through this project we aim to support continued research in this area by contributing an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections continue to benefit from state-of-the-art developments in handwritten text recognition (HTR).

What is ground truth?

Most recognition systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. By knowing what the system is supposed to recognise on a digitised page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Project aims

  • Support the British Library's mission to make our intellectual heritage accessible to everyone “for research, inspiration and enjoyment”, particularly our non-western materials
  • Raise awareness of our Arabic manuscript collections, specifically our lesser known Arabic Scientific Manuscripts with a wide and diverse audience around the world, from the general public, computer scientists, to students
  • Instigate new collaborations in the computer science/recognition domain, creating a dialogue around the challenges/opportunities for automatic transcription of historical Arabic texts
  • Create an openly licensed ground truth dataset from our Arabic manuscripts to aid researchers working on the state-of-the-art in recognition software
  • Gather evidence to inform a much larger commitment to crowdsourcing transcriptions of handwritten manuscripts and creating ground truth resources at scale

Activities

Phase 1

Collaborative Transcription Pilot

This proof of concept activity explored whether the creation of a ground truth dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach is the Library’s commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections.

For this we utilised a free and open-source platform, From the Page, which allowed anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription. The Digital Scholarship Department was also able to fund the development of the open source platform to support Right-to-Left transcription, a feature required for this project, but which will also benefit any scholar wishing to use the software for their own transcription needs. A team of four curatorial & translation experts at British Library produced the first 10 pages of ground truth to use as an example for volunteers. It took only 18 days for 36 volunteers from around the world to fully transcribe a further collection of 85 pages selected from 9 manuscripts.

Arabic Scientific Manuscripts Transcription Workshop

In conjunction with the launch of the online collaborative transcription platform, we hosted 12 individuals at the British Library for an Arabic Scientific Manuscripts Transcription Workshop. Throughout the day participants had the opportunity to meet the curators, view a selection of original manuscripts from the Arabic collections, learn about the latest developments in OCR for handwritten Arabic script and give the new transcription platform a try. Their feedback was invaluable to the project and helped us make changes to the platform ahead of its wider launch. 

ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts

In collaboration with our partners at the Alan Turing Institute and PRImA Research Lab, we launched a competition as part of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR 2018) held August 5-8, 2018 in Niagara Falls (USA). The competition focussed on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts, utilising the ground truth we created. Two winners were announced:

  • Page segmentation: Berat Kurar Barakat, Ben-Gurion University of the Negev
  • Text lines segmentation & Text recognition: Hany Ahmed, RDI Company, Cairo University

A paper describing the competition and results was published in the proceedings of ICFHR 2018.

C. Clausner, A. Antonacopoulos, N. McGregor, D. Wilson-Nunn, "ICFHR 2018 Competition on Recognition of Historical Arabic Scientific Manuscripts - RASM2018", Proceedings of the 17th International Workshop on Frontiers in Handwriting Recognition (ICFHR2018), Niagara Falls, USA, August 2018, pp. 471-476.

Phase 2

ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts

Building on the success of the first pilot, we are now collaborating again with our partners at the Alan Turing Institute and PRImA Research Lab. We’ve launched a second competition as part of the 15th International Conference on Document Analysis and Recognition (ICDAR2019), to be held in September 2019 in Sydney, Australia. As in last year’s competition, we are trying to find the best solution for automatic transcription of historical Arabic scientific handwritten manuscripts.

For this follow up competition, we are enhancing and extending the existing ground truth dataset (from 95 to 120 pages). We are also adding another challenge to the mix – marginalia. This text, written in the margins of the manuscripts, is often less standardised and legible than the main text, and frequently goes in different directions. Read more about the current competition in this blog post.

After this next competition all ground truth resources created will be hosted by the British Library and made freely available under an open license, under its own DOI, for anyone wishing to advance the state-of-the-art in text recognition technology.

Contributors

Project Leads

Curatorial

  • Lynda Barraclough, Head of Curatorial Operations, BL/Qatar Foundation Partnership
  • Dr Bink Hallum, Curator of Arabic Scientific Manuscripts, BL/Qatar Foundation Partnership
  • Daniel Lowe, Curator Arabic Collections, British Library
  • Mariam Abolezz, Translation Support Officer, BL/Qatar Foundation Partnership
  • Julia Ihnatowicz, Translation Support Officer, BL/Qatar Foundation Partnership
  • George Samaan, Translation Support Officer, BL/Qatar Foundation Partnership

Collaborative Transcription Platform (From the Page)

ICFHR 2018 and ICDAR 2019 Competition Partner