These guidelines come from our forthcoming paper to be published in the Journal of Data Mining and Digital Humanities. They will be updated and archived here as the project evolves.
Transcription Guidelines for the Paris Bible Project - v1.0
Each project, each manuscript, each hand being different, the list of possible Unicode characters to be used for transcription can evolve and change accordingly. Our suggestion is that there is not one definitive list that can be used and applied to any project, and especially to all projects, rather a quite large number of possibilities, which must be project-specific and aligned with project objectives. To establish a specific Unicode list, as for any transcription process, setting principles to follow is fundamental: what do we transcribe and how? How do we prioritize the principles? How do we encode variance, exceptions, and aesthetic scribal habits? In what follows, we outline five basic principles for transcribing for machine learning which have emerged at this stage of our research.
- Although transcribing for machine learning is fundamentally an interpretative activity, the first principle to abide by should be: the transcription must be as close as possible as what you see in the manuscript, even if this is not enough to render all the variety and inconsistencies. If there is a basic character in Unicode which corresponds to what you see, and that letter exhibits no significant variance across your document which matches your research problem, there is no reason to opt for a more complex character encoding.
- Since there is always the possibility of variance, the second principle is that it is useful to have a preliminary “scan” of the document you want to transcribe, or through samples of the corpus you will be working with, before beginning transcription. A first pass of transcription allows you to create a working list of Unicode characters.
- Since there is inevitable variety in hands, the third principle is to take care when attempting to encode in maximal granularity the “aesthetic” quality to some graphemes that we don’t want to reproduce (in our case, this meant spacing, the letters v and p), or a variance in the placement certain abbreviations (for example, the macron) which create too great a variety of rare encoding choices or difficulty in ordering the characters in the transcription.
- Since we are creating machine-readable transcriptions for the purpose of computational analysis, the fourth principle is not to choose Unicode characters that will not display in regular text editors (i.e., MUFI green letters) or other platforms.
- The fifth principle is not to make design decisions that will be undone by common NLP practices (lower casing, tokenization, removing punctuation) for working with unstructured text.
- In the case of contradictory decisions, we add a coda to our five principles: there is a need to prioritize the principles.
- For abbreviations (such as the macron), we made an arbitrary decision to consider what letter the abbreviation replaces, rather than where it is placed. Let’s take the example of the word “bien”. In the manuscript BnF français 24428, the scribe wrote it in ways that can be transcribed either biē or bīe, the macron being often written on top of both letters, right in the middle. That is to say that transcription is never really divorced from an interpretation of what was meant by the scribe, even though we do not normalise.
- In the case of spaces and special letter forms, a larger sample should be consulted. A single occurrence is an exception not a principle and doesn’t reflect particular features indicating scribal practices. How you decide on that larger sample is a function of the scope of any particular project. Do not create a new Unicode or a spacing decision based on one example, but rather on a larger sample.
These principles are not universal ones, and as we have found they are enriched by work across different domains, periods of time and languages. The logical conclusion of these principles lies in the fact that a general model for transcribing medieval Latin, or French or Arabic is unlikely, but rather a variety of sub-models is a desirable goal for a language-specific community of scholars in digital manuscript studies.
List of Unicode Characters
|Unicode symbol||Unicode number||Letter(s)||Stands for||Notes||Example||Image|
|̄||0304||a; e; i; u; c; m; n; can be used on most letters||Often for a m or a n||Combining letter|
|́||0301||l, q, ꝺ, h, b, p, t,||varies||Combining letter||ꝺ́s = deus|
|ꝫ||A76B||/||ue, us||Standalone character||often with qꝫ|
|̈||0308||/||ua||Combining letter||q̈i = quasi|
|̾||033E||t||er||Combining letter||t̾re = terre|
|̕||0315||l, q, ꝺ, h, b, p, t,||varies||Combining letter|
|ꝺ||A77A||/||d||Standalone character: insular d|
|ꝛ||A75B||/||r||Standalone character: rotund r||After o, p, b and sometimes ch, y, q etc. (pulchrum)|
|ſ||017F||/||s||Standalone character: long s|
|;||003B||/||Standalone character: upside down semicolon|
|̇||0307||y||not sure if always a part of the letter y|
|ͦ||0366||q, g||Combining letter||quo (qͦ)|
|2||0032||/||MUFI E8B3 gives q + r rotunda||Standalone character||quia = q2|
|ͥ||0365||t, m, s, p, x, h||tibi, mihi, sibi, pri, christi||Superscript letter|
|ͨ||0368||n, h||nec||Superscript letter|
|·||00B7||/||Standalone character: middle dot|