Cleaning up Hebrew catalogue records with OpenRefine (Adi Keinan-Schoonbaert)

Cleaning up Hebrew catalogue records with OpenRefine

The catalogue records for the British Library’s collection of Hebrew manuscripts are remarkably diverse and informative. Dr Adi Keinan-Schoonbaert explores how this data can be further improved and standardised in preparation for scholarly analysis.
The first phase of the Hebrew Manuscripts Digitisation Project (2013-2016) has come to an end. This project had two main outcomes: the first is the images of 1,300 manuscripts, with a total of ca. 435,000 images. These can be viewed using the British Library’s Digitised Manuscripts platform. The second outcome is digital catalogue records for these manuscripts, in TEI XML format. These catalogue records hold the metadata on each manuscript, such as author, title, date of creation, shelfmark, description, scribe, and provenance.
In order to make these metadata records more user friendly, a TEI Viewer was created by Alex Mendes, the Project Support Officer for the Library’s Asian and African Studies Department. Alex created a very useful tool that parses the XML code into tabular format – which allows viewing the data in its entirety. Among other things, you can export this data into a spreadsheet, and in turn conduct data analysis of your choice. From manuscripts’ scribes, handwriting and language, to decorations and binding – there is interesting information for everyone!
But before conducting any form of analysis or data visualisations, you need to make sure that the data is ‘clean’, meaning: normalised and standardised. A closer look at the Hebrew metadata reveals that author names or place names, for example, were not always written in the same way. Or, that sometimes there are unnecessary spaces or punctuation marks in data fields.
OpenRefine is free and open source, and you can download it here: http://openrefine.org (note: it’s better not to use Internet Explorer as your browser – Google Chrome is a good alternative). There are many free tutorials and user guides (see below), but I would highly recommend Owen Stephens Introduction to OpenRefine, developed on behalf of the British Library. If you wish to try this out, do consult the tutorial below on how to download, run, and use OpenRefine.

Get to know your data


How many times does a particular value appear in your data fields? To find out, you can use OpenRefine’s ‘Facets’. ‘Facets’ are a really useful feature that can be used to see how a certain column of data breaks down into its different values. This feature also helps you ‘clean’ that data and make it more consistent. Using a ‘Facet’, you can group values that appear in a column, and then also filter and edit data across a number of records at the same time.
For example, say I’m interested in the titles of digitised Hebrew manuscripts. In the Title column, I clicked on the drop-down menu at the top and selected Facet > Text facet. The facet is visible on the left-hand panel, listing all possible titles – and how many times each one appears in the digitised collection.

Viewing all available titles in the digitised collection of Hebrew manuscripts
Viewing all available titles in the digitised collection of Hebrew manuscripts


You can see that the title Moreh Nevukhim (Guide for the Perplexed) by Maimonides appears eight times in the collection, and it’s either written as Moreh Nevukhim or More Nevukhim. Similarly, the collection has several manuscripts named Nevi’im u-Khetuvim (Prophets and Writings) – but these titles are not always spelled in the same way. In order to standardise the way these titles are written, you can use OpenRefine to decide on the best default title option, and apply it to all – see next section.

Resolve your inconsistencies


There are several ways to standardise your data, when you have a list of terms that differ from each other but actually mean the same thing – such as the example of the manuscript titles mentioned above. Let’s look at another example of an unstandardised field – and clean it up.
Our Hebrew manuscripts are written in a variety of languages – indicated in the Language field. I chose to view the ‘Text facet’ of the Language column, to see all the options available in that field. This field could be normalised and simplified, for example: joining together the values ‘Arabic and Hebrew’ with ‘Hebrew, Arabic’; merging ‘Aramaic’ with ‘Aramaic of the Zohar’ (which is a late, specific dialect of Aramaic); and filling in the blank fields with ‘Hebrew’ – the default language in the collection. I can do this by clicking the edit link next to the field that I’d like to change and inserting the right option in the dialogue box.

Language field with its different values, before (left) and after (right) cleaning
Language field with its different values, before (left) and after (right) cleaning


Another way to clean data is through clustering: the ‘Cluster’ function looks for similar values and suggests merging them together to an agreed single value. To exemplify this function, I went to the Author field and chose Text facet in the drop-down menu. I then clicked the Cluster button at the top left hand pane. OpenRefine offers different algorithms to cluster similar values – it’s good to play around with different clustering methods to find the most suitable one for your requirements.
In the case of the Author field, I can standardise how author names are represented, and merge values which represent the same name, according to my research needs or interests. For example, perhaps I want only the English name to appear (and remove the Hebrew or Arabic); I want to remove the dates when the author lived; I prefer the last name to come after the first name. I selected the values that I wanted to merge and clicked Merge Selected & Re-Cluster, to view more clustering options – and repeat the process if necessary.

Selecting similar values in the Author field and merging them to an accepted standardised value
Selecting similar values in the Author field and merging them to an accepted standardised value


Transform your data


Another way to help you standardise data format in specific cells is using column and cell transformations. Here you can remove extra spaces, change the text into uppercase or lowercase, and much more. One common task would be to remove unnecessary punctuation. You can do that by writing a regular expression using GREL – General Refine Expression Language. Don’t let that deter you – it’s pretty simple, and there are a lot of pre-made formulas out there (also see below).

Common cell transformations available in OpenRefine
Common cell transformations available in OpenRefine


To give an example: viewing the different values stored in the Material field revealed that they all end with an unwanted full stop, e.g. ‘Parchment.’. In order to remove that full stop, I selected the Material column and then chose Edit cells > Transform. Removing the full stop required writing a short formula that would run through all cells and do that for us. Such regular expression could look something like: if(isNotNull(value),value.replace(".",""), value), meaning: if the value of the field is not ‘null’, replace the full stop with nothing (i.e. remove it); otherwise (if the field value is null) – leave it as is.

Removing unwanted punctuation using a GREL expression
Removing unwanted punctuation using a GREL expression


If you have several types of information in one column and would like to split them into several columns – OpenRefine can help. The list of Hebrew manuscripts includes a Date field for the creation of manuscripts. Often the date is shown as a range, e.g. 1400-1499. If I want to use this field to visualise the collection by dates, I’d like to create a new column which includes only the latter year, and not the whole range.
I duplicated the Date column (Edit column > Add column based on this column) and named the new column Latest Date. In order for this column to show only the latter year, e.g. transforming '1400-1499' into '1499', I needed to use a regular expression.
After selecting Edit cells > Transform, I inserted the following expression: if(isNotNull(value), if(value.length() > 4, value.substring(5,9), value),value). This reads: If the value in the field is not null, and it is more than 4 characters long (i.e. it’s not a single year, but a range) – make the new value just the last 4 characters (i.e. the latter year); otherwise, leave the value as is.
When you’re done cleaning up your data, you can export it into a new spreadsheet (or other formats), and visualise or analyse it more easily and effectively. There’s a lot more that can be done with OpenRefine – so hopefully this is the beginning of a great friendship!
Adi Keinan Schoonbaert
  • Adi Keinan-Schoonbaert
  • Adi Keinan-Schoonbaert was the digital curator (Polonsky Fellow) working for the Hebrew Manuscripts Digitisation Project (HMDP) at the British Library, and an Honorary Research Associate at the Institute of Archaeology, University College London (UCL).

    Adi's PhD (UCL, 2013) focused on Israeli and Palestinian archaeological inventories in the occupied West Bank.

    In 2014-15 she took part in MicroPasts, a joint UCL-British Museum crowdsourcing initiative.

The text in this article is available under the Creative Commons License.