The Cultural Heritage AI Cookbook (2025 Edition)

#

Gethin Rees
King's College London

Arno Bosse
KNAW HuC

Rossana Damiano
University of Turin

Leif Isaksen
University of Exeter

Tariq Yousef
University of Southern Denmark

Elton Barker
The Open University

Khalid Al Khatib
Rijksuniversiteit Groningen

Anne Chen
Bard College

Enrico Daga
The Open University

Stephen Gadd
University of Pittsburgh

William Mattingly
Yale

Diana Maynard
University of Sheffield

Chiara Palladino
Durham University

Sebastiaan Peeters
University of Twente

Nina Claudia Rastinger
Austrian Academy of Sciences

Mia Ridge
British Library

Matteo Romanello
Odoma

Robert Sanderson
Yale

Marco Antonio Stranisci
University of Turin

William Thorne
University of Sheffield

Erik Tjong Kim Sang
Netherlands eScience Center

Leon van Wissen
University of Amsterdam

Mónica Marrero
Europeana

Margherita Fantoli
Catholic University of Leuven

What, Who, How

The depth and diversity of Cultural Heritage collections are recognised as invaluable for enriching lives, fostering social and cultural cohesion, and acting as a valuable economic resource. Yet making full use of those collections and the individual records within them remains hampered by a series of interrelated problems: 1. digital catalogue metadata tend to exist for only a small proportion of CH collections; 2. where it exists, it is often sparse, unstructured and contains varying forms of bias; 3. where structured, it is often not aligned with external authorities.

This means that it is currently difficult to discover individual items and almost impossible to link them to other records within the same collection, let alone between different resources.

To address these issues, guidelines have been produced to improve the Findability, Accessibility, Interoperability and Reusability of digital assets through machine-actionable methods. Based on FAIR principles, Linked Open Data (LOD) has proven an effective mechanism for identifying, disambiguating and linking key entities, such as place, people, objects and events, but implementing LOD tends to require massive investment in time, resource and expertise. More recently, transformer-based AI Large Language Models (LLMs) have demonstrated a remarkable capacity to interpret and contextualise natural language. However, while LLMs are far more intuitive to use, their probabilistic and variable outputs make data enrichment unstable and unpredictable: they can return simply too many errors to make their use worthwhile for data curation. The particular scenario set out here uses a combination of LOD and LLM technologies to enable digital assets to be enriched through the processes of Named Entity Recognition, Named Entity Disambiguation, and Relationship Extraction.

The following cookbook provides different recipes, derived from LOD and LLM technologies, for enabling CH institutions to enrich their metadata at scale. We envisage two user profiles of the cookbook. One user will be a collections manager who is interested in making use of digital technologies for enriching their objects, but won’t necessarily have the technical expertise to do this for themselves. The second user, who has more technical proficiency, will be able to use our recipes as an inspiration or basis for their own work.

The cookbook has the following structure. It has notebooks for:

Data preparation and processes — in which we set out: (i) how to get the data in a format that can be used in these processes; and (ii) the different ways of identifying named entities and then disambiguating them.
Evaluation - in which we set out how to assess the results of the data processing according to standard metrics.
Applications - in which we set out examples use cases for what you’ll be able to do with the processed data. There is also a Glossary of concepts and an About page.

A final note: this work is very much of the moment: September 2025. Given the rapid pace of technological change, particularly in LLMs, we anticipate that the specific tools and methods that we outline here will not be so cutting edge in a year. In other words, the recipes should not be considered a maintained service or best practice that is future-proofed, nor, indeed, a ready to go implementation. That said, we believe that these simple-to-follow recipes can be easily adapted to different scenarios, updated by new technologies, and extended for greater coverage. If you have any comments or suggestions, please do raise a GitHub ticket on this repo or email officers@pelagios.org.