- Published date:
This project aims to train software to read historical Arabic manuscripts.
The British Library is home to one of the largest and finest Arabic manuscript collections in Europe and North America, comprising almost 15,000 religious, historical, literary and scientific works. Since 2012, the Library, in partnership with the Qatar National Library, has digitised and made freely available over 1,921,252 images, of which 76,092 are from this collection, on Qatar Digital Library (QDL).
Having the capability to make vast archives of digitised Arabic texts fully searchable would truly transform research, opening up this rich content for full-text search and enabling large-scale text analysis.
Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.
Through this project we aim to support continued research in this area by contributing an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections continue to benefit from state-of-the-art developments in handwritten text recognition (HTR).
What is ground truth?
Most recognition systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. By knowing what the system is supposed to recognise on a digitised page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.
- Support the British Library's mission to make our intellectual heritage accessible to everyone “for research, inspiration and enjoyment”, particularly our non-western materials
- Raise awareness of our Arabic manuscript collections, specifically our lesser known Arabic Scientific Manuscripts with a wide and diverse audience around the world, from the general public, computer scientists, to students
- Instigate new collaborations in the computer science/recognition domain, creating a dialogue around the challenges/opportunities for automatic transcription of historical Arabic texts
- Create an openly licensed ground truth dataset from our Arabic manuscripts to aid researchers working on the state-of-the-art in recognition software
- Gather evidence to inform a much larger commitment to crowdsourcing transcriptions of handwritten manuscripts and creating ground truth resources at scale
For this we utilised a free and open-source platform, From the Page, which allowed anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription. The Digital Scholarship Department was also able to fund the development of the open source platform to support Right-to-Left transcription, a feature required for this project, but which will also benefit any scholar wishing to use the software for their own transcription needs. A team of four curatorial & translation experts at British Library produced the first 10 pages of ground truth to use as an example for volunteers. It took only 18 days for 36 volunteers from around the world to fully transcribe a further collection of 85 pages selected from 9 manuscripts.
In conjunction with the launch of the online collaborative transcription platform, we hosted 12 individuals at the British Library for an Arabic Scientific Manuscripts Transcription Workshop. Throughout the day participants had the opportunity to meet the curators, view a selection of original manuscripts from the Arabic collections, learn about the latest developments in OCR for handwritten Arabic script and give the new transcription platform a try. Their feedback was invaluable to the project and helped us make changes to the platform ahead of its wider launch.
In collaboration with our partners at the Alan Turing Institute and PRImA Research Lab, we launched a competition as part of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR 2018) held August 5-8, 2018 in Niagara Falls (USA). The competition focussed on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts, utilising the ground truth we created. Two winners were announced:
- Page segmentation: Berat Kurar Barakat, Ben-Gurion University of the Negev
- Text lines segmentation & Text recognition: Hany Ahmed, RDI Company, Cairo University
A paper describing the competition and results was published in the proceedings of ICFHR 2018.
Building on the success of the first pilot, we collaborated again with our partners at the Alan Turing Institute and PRImA Research Lab. We launched a second competition as part of the 15th International Conference on Document Analysis and Recognition (ICDAR2019), held in September 2019 in Sydney, Australia. As in the previous year’s competition, we tried to find the best solution for automatic transcription of historical Arabic scientific handwritten manuscripts.
For this follow up competition, we enhanced and extended the existing ground truth dataset (from 95 to 120 pages). We also added another challenge to the mix – marginalia. This text, written in the margins of the manuscripts, is often less standardised and legible than the main text, and frequently goes in different directions.
Read more about the 2019 competition and its results in these blog posts:
All ground truth resources created for both competitions are freely available under an open license for anyone wishing to advance the state-of-the-art in text recognition technology. Download here: https://doi.org/10.23636/1135.
Phase 3As a next step, we would like to test these materials with other software and platforms, e.g. Transkribus. Watch this space for developments!
- Nora McGregor, Digital Curator, British Library (Phase 1)
- Dr Adi Keinan-Schoonbaert, Digital Curator, British Library (Phase 2)
- Daniel Wilson-Nunn, PhD student at the University of Warwick & Turing PhD Student
- Lynda Barraclough, Head of Curatorial Operations, BL/Qatar Foundation Partnership
- Dr Bink Hallum, Curator of Arabic Scientific Manuscripts, BL/Qatar Foundation Partnership
- Daniel Lowe, Curator Arabic Collections, British Library
- Mariam Abolezz, Translation Support Officer, BL/Qatar Foundation Partnership
- Julia Ihnatowicz, Translation Support Officer, BL/Qatar Foundation Partnership
- George Samaan, Translation Support Officer, BL/Qatar Foundation Partnership