Abstract - Assembling Legacy Data for AI. The Case of the Paris Bible Project.
F21 Fantastic Futures, 3rd International Conference on Artificial Intelligence for Libraries, Archives and Museums Bibliothèque nationale de France, Paris, France, December 9-10, 2021
Estelle Guéville and David J. Wrisley
Our paper presentation/case study discusses how the Paris Bible Project (PBP, http://parisbible.github.io) uses handwritten text recognition (HTR) to study a large corpus of medieval Bibles found in global manuscript collections. Begun from an inter-institutional collaboration between researchers at the Louvre Abu Dhabi and New York University Abu Dhabi (UAE), the PBP has assembled cultural heritage content into a virtual collection for content-specific research and has set as a priority both the diversity of its project data and the constitution of its team to ensure for a balanced, and nuanced, approach to AI-based research.
WHAT IS A PARIS BIBLE?
The foundation of European universities greatly influenced the production of manuscripts. The Paris Bible manuscript emerged in the thirteenth century as a mass-produced object that spread in response to new literacies, namely teaching and preaching. Paris Bibles have been characterized by modern scholars as objects uniform in style and layout, a thesis that the PBP seeks to revise. It has been estimated that there are some extant 1800 Pocket Bibles, but there are probably many more to be found (Ruzzier, 2010). We have begun to compile an open dataset of Paris Bibles in the world, as well as a representative sample of about 300 digitised manuscripts, thus creating our own “digital collection” of scattered Paris Bibles as well as the training data for HTR.
ASSEMBLING LEGACY DATA FOR AI
In some library collections, Paris Bibles are prestige objects and are among the very first numbered manuscripts, but in other collections they are notoriously undiscoverable. The term “Paris Bible’’ is not universally accepted and such documents can only be found under their specific terms in various European languages: Biblia sacra, Pariser Normbibel, Bibbia dell’ università, Universitetsbibel….
Given this diversity of nomenclature, we have relied on manual and visual identification, using the discipline-specific rich metadata about these objects, primarily in Europe and North America collections which adopted early integral digitization. We took into account accessibility of the documents, including download type, the completeness of digitization, the image/text ratio in collections where digitization is biased towards illumination (e.g. Mandragore). We are slowly incorporating our interlibrary corpus of documents into Transkribus, only a few of which are IIIF compliant. An important lesson here about legacy systems not being ready for new technologies can be drawn here.
UNDERSTANDING DIGITAL COLLECTIONS WITH AI
The PBP aims not only to use AI to understand individual objects better, but also to suggest relationships between objects with very general metadata and to predict the provenance of fragmentary archives of unknown provenance. HTR allows us–building on existing hypotheses about object provenance–to put together ‘human transcription efforts with algorithmic enhancements’ (Widner, p. 133, 2017), making the contents of digitized manuscripts available also for computational research. We have even faced our own variety of AI bias for medieval data that our project design is working to counter. If our models embody existing human knowledge about the manuscripts too closely, we run the risk of reproducing a “history of collections bias” across our corpus. We must build as inclusive of a training corpus as possible and continue to evaluate how models generalize to capture variance.
MANIFOLD EXPERTISE IN TEAMS ENGAGED IN AI RESEARCH FOR CULTURE
Finally, the collaborative nature of the Paris Bible Project is not only interinstitutional, but also involves a network of practitioners from different disciplinary, educational and linguistic backgrounds. The team brings together digital humanities, medieval studies, data science, museum studies, and members who are at different stages of their career and practice. As the project evolves, diversified contributions prove vital to providing dynamism into the research and maintaining ‘a balance between digital and face-to-face communication’ (Siemens, 2009). This inter-institutional approach and the manifold expertise, in turn, extends and diversifies the trajectories of the project’s research.
Estelle Guéville and David J. Wrisley
Further readings and presentations:
- Ruzzier, Chiara. (2010). Entre universités et ordres mendiants. La Miniaturisation de la Bible au XIIIe siècle. PhD Thesis. Université Paris 1 Panthéon-Sorbonne, Paris.
- Siemens, Lynne. (2009). ‘It’s a team if you use “reply all”: an exploration of research teams in digital humanities environments’ in Literary and Linguistic Computing, vol. 24, no. 2, pp. 225-233. doi:10.1093/llc/fqp009
- Widner, Michael. (2017). ‘Toward Text-Mining the Middle Ages’ in The Routledge Research Companion to Digital Medieval Literature, eds. by Jennifer E. Boyle, and Helen J. Burgess, United Kingdom: Taylor & Francis Group, pp. 131-144.
Suggested citation
Guéville, Estelle and Wrisley, David J. (09 December 2021). Abstract - Assembling Legacy Data for AI. The Case of the Paris Bible Project. Paris Bible Project. https://doi.org/10.5281/zenodo.8040632
This post is published with a CC BY-SA-NC 4.0 International license.