Introduction
TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application.
ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the necessary elements and attributes.
Background: Development of the NDL Classical Book OCR-Lite Application
I am creating an application that outputs NDL Classical Book OCR-Lite results in TEI/XML. The purpose of this application is to perform OCR on Japanese classical books and output the results in standard TEI format.
The output TEI XML was designed to include the following information:
- Text information: Character strings recognized by OCR
- Layout information: Coordinate information for each line (bounding boxes)
- Image references: IIIF (International Image Interoperability Framework) compatible image URLs
- Metadata: Document title, processing information, etc.
I wrote the schema used by this application in ODD. Below, I introduce the customization process.
Customization Approaches
1. Initial Approach: Using Standard Modules
Initially, I created the ODD using TEI standard modules:
Importance of the include Attribute
The include attribute of the moduleRef element is an important feature that selectively includes only specific elements from a module:
Benefits of using the include attribute:
- You can explicitly specify only the required elements
- The schema size is smaller than including the entire module
- It is clear which elements are being used
When the include attribute is not used:
In this case, all elements of the header module (encodingDesc, profileDesc, revisionDesc, etc.) are included.
How to specify multiple elements:
Exclusion using the exclude attribute:
The exclude attribute is the opposite of include, excluding specific elements from a module. It is useful when most elements are needed but a few are not.
Criteria for choosing include vs exclude:
- When few elements are needed -> Use
include - When few elements are unnecessary -> Use
exclude - When clarity is important -> Use
include(it is clear what is being used)
However, even with this method, related model classes and attribute classes are automatically included, so complete minimization was not achievable.
2. Improved Approach: Deleting Unnecessary Elements
Next, I explicitly deleted unnecessary classes:
3. Final Approach: Minimal Configuration Definition
Ultimately, I adopted the method of explicitly defining only the necessary elements and attributes:
Implementation Details
Managing Coordinate Information
To manage OCR coordinate information, I defined a dedicated attribute class:
IIIF Support Implementation
To integrate with IIIF manifests, I added the sameAs attribute:
Line Number Format Constraints
Using Schematron to constrain the line number format:
How to Write Examples
Basic Structure of exemplum and egXML
In ODD, you use the exemplum and egXML elements to write usage examples:
0
Writing Complex Examples
When showing examples containing multiple elements:
1
Namespace Issues and Solutions
Problem: TEI Elements Not Recognized
When using TEI elements (especially root elements) within egXML, namespace issues may arise:
2
Solution 1: Use Namespace Prefixes
3
Solution 2: Simplify with Comments
4
Solution 3: Omit the Example
To avoid validation errors, completely omitting problematic examples is also an option.
Providing Examples in Multiple Languages
When providing multilingual examples:
5
Showing Attribute Usage Examples
When showing various attribute values:
6
Display in Roma
These examples are automatically included in the HTML documentation generated by the Roma tool. Having examples:
- Makes element usage clear
- Shows actual attribute values
- Makes it easier for schema users to implement
Japanese Language Support
Multilingual Descriptions
Providing descriptions in both Japanese and English within the ODD file:
7
Document Language Setting
To use the Japanese interface in the Roma tool:
8
Actual Output Example
An example of TEI XML generated from this ODD:
9
Using the Roma Tool
Loading the ODD File
- Access Roma
- Upload the created ODD file from “Upload ODD”
- Perform additional customization as needed
Schema Generation
Schemas can be generated from Roma in the following formats:
- RelaxNG Schema
- W3C Schema (XSD)
- DTD
- Schematron
HTML Document Generation
HTML documentation can be generated from the “Documentation” tab in Roma. In the minimal configuration version, only the elements and attributes actually used are documented.
Troubleshooting
Common Issues and Solutions
Errors with TEI elements inside egXML
- Problem: The
<TEI>element causes errors inside<egXML> - Solution: Use namespace prefixes or simplify the example
- Problem: The
mode=“keep” is invalid
- Problem:
mode="keep"is not recognized inattDef - Solution: Use
mode="change"
- Problem:
Too many unnecessary classes
- Problem: Using standard modules includes unnecessary classes
- Solution: Define only what is needed using
mode="add"
Summary
There are multiple approaches to TEI ODD customization:
- Using standard modules: Easy but includes many unnecessary elements
- Deletion method: Delete unnecessary items from the standard
- Addition method: Explicitly add only what is needed (recommended)
It is important to choose the appropriate method based on project requirements. In the case of NDL Classical Book OCR, defining a minimal configuration resulted in a clear and manageable schema.
References
- TEI Guidelines
- ODD: One Document Does it all
- Roma: ODD customization tool
- NDL Classical Book OCR
- IIIF (International Image Interoperability Framework)
License
This ODD file is provided under the Creative Commons Attribution 4.0 International License.