Introduction

TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application.

ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the necessary elements and attributes.

Background: Development of the NDL Classical Book OCR-Lite Application

I am creating an application that outputs NDL Classical Book OCR-Lite results in TEI/XML. The purpose of this application is to perform OCR on Japanese classical books and output the results in standard TEI format.

The output TEI XML was designed to include the following information:

  • Text information: Character strings recognized by OCR
  • Layout information: Coordinate information for each line (bounding boxes)
  • Image references: IIIF (International Image Interoperability Framework) compatible image URLs
  • Metadata: Document title, processing information, etc.

I wrote the schema used by this application in ODD. Below, I introduce the customization process.

Customization Approaches

1. Initial Approach: Using Standard Modules

Initially, I created the ODD using TEI standard modules:

<s/c<<<<<shmmmmmceooooohmdddddeauuuuumSlllllapeeeeeSeRRRRRpceeeeeefffffci>dkkkkkeeeeeenyyyyyt======""""""thcttneeoerdiarxal"detn_/e"ssk>rtco"irrtnu"eiccnnlti_cuunoldrccueelrd="u"e"d=pies"n=tttc"aeilfrituatHldc=eees"a=iTdn"mEeaTiIrmEl"eIefpirtsrleeueesxrfDptfieaxsrbc=ceoe"sdttpyzeiS"ittn_lme"et">S/tl>mbtppbubglriacpahtiico"n/S>tmtsourceDesc"/>

Importance of the include Attribute

The include attribute of the moduleRef element is an important feature that selectively includes only specific elements from a module:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

Benefits of using the include attribute:

  • You can explicitly specify only the required elements
  • The schema size is smaller than including the entire module
  • It is clear which elements are being used

When the include attribute is not used:

<<!m-o-duIlnecRleufdeketyh=e"heenatdierre"/m>odule(notrecommended)

In this case, all elements of the header module (encodingDesc, profileDesc, revisionDesc, etc.) are included.

How to specify multiple elements:

<<!m-o-duLliesRtefmukletyi=p"lceoreel"emiennctlsudsee=p"aprattietdlebynasmpeacreessprespStmtlbpbgraphic"/>

Exclusion using the exclude attribute:

<<!m-o-duTloeReexfclkuedye="scpoercei"fiecxcelluedmee=n"thsideladdnote"/>

The exclude attribute is the opposite of include, excluding specific elements from a module. It is useful when most elements are needed but a few are not.

Criteria for choosing include vs exclude:

  • When few elements are needed -> Use include
  • When few elements are unnecessary -> Use exclude
  • When clarity is important -> Use include (it is clear what is being used)

However, even with this method, related model classes and attribute classes are automatically included, so complete minimization was not achievable.

2. Improved Approach: Deleting Unnecessary Elements

Next, I explicitly deleted unnecessary classes:

<<<<<<!cc!cc-ll-ll-aa-aassssDssDsseSSeSSlpplppeeeeeetcctcceeiiiiudduddneeneennnnnnettettc==c==e""e""smmsaasoosttaddattreer..yllyde..admehatiomitatdpgtbLehhrlilLliekiib"eckgu"lehtta"teytsepystdcepey"l=espa"=etsa"=ysta"pettmessto="sd""emmloom"ddoeedml=eo""=dd"emed=ole"deldetee=etl""eed/"te>/el>"e/t>e"/>

3. Final Approach: Minimal Configuration Definition

Ultimately, I adopted the method of explicitly defining only the necessary elements and attributes:

<s/c<<sh!c/ce-l<chm-aa/least<<aamSDsta/a/tsapeSLt<<at<<atsSefpitddttddtLSpciesDeatDeatipenctestDestDsecie>fcaefcaetc>ditftf>>eodixy>ixy>nnedmpdmptlneleele=ytn:>n:>"=tl<tl<nt"=ad=addha"na"naletxgtxgt_tm=am=akr.l"Rl"Roeg:je:jetqliaflafeuod"a"nib">kn>k_raegeoelmy"ycd"o==rd"m"_ateIotmty=<Dd<eitp"/"e/inread/=ddii=de>"eamb"ds<astaua"c/dcaltt>>dd>."eta"lst>asc"antltgaamyursopatsdeg=ee>e"s="T"/Ea>Id<"d/"dp>arteaftiyxp=e">tei_"docLang="ja">

Implementation Details

Managing Coordinate Information

To manage OCR coordinate information, I defined a dedicated attribute class:

<c/l<<cada/lset<<<<aassta/a/a/a/tsScLt<<at<<at<<at<<atspitddttddttddttddtLSexsDeatDeatDeatDeatipcmtestDestDestDestDsel>fcaefcaefcaefcaetci:tftftftf>>dlixy>ixy>ixy>ixy>eadmpdmpdmpdmpnneleeleeleeletgn:>n:>n:>n:>==tl<tl<tl<tl<""=ad=ad=ad=adaj"na"na"na"nataugtugtlgtlgtt"l=al=ar=ar=a.>x"Ry"Rx"Ry"Rc"je"je"je"jeoafafafafom"m"m"m"ro>ko>ko>ko>kd<dedededei/eyeyeyeynd=X==Y==X==Y=ae""""""""tsatatatatecd<ed<ed<ed<ed>d/id/id/id/i""dd"dd"dd"dd>ea>ea>ea>eatststststycacacacap>.>.>.>.ennnn=uuuu"mmmmaeeeetrrrrtiiiiscccc"""""////m>>>>o<<<<d////edddd=aaaa"ttttaaaaadttttdyyyy"pppp>eeee>>>>

IIIF Support Implementation

To integrate with IIIF manifests, I added the sameAs attribute:

<e/l<<eeda/lmet<aeesta/tmncLt<<atetitddtLnSxsDeatitpmtestDsSel>fcaetpc:tf>elixy>ciadmp>dneleegn:>n=tl<t"=ad=j"na"asgtf"a=aa>m"RcejesAafis"m">kiIelmIye<oI="/dF"detme=eos"idcade>da=dt""Uaa>R.dLpd<"i>dnetsecr>"/></datatype>

Line Number Format Constraints

Using Schematron to constrain the line number format:

<c/o<cnc/oso<cntns/osrsc<sntaths/csrir:cshtanarhc:ritiu:hranSnla:uitptesalnSe>ssetpcces>>eorecintr>dtteet>nxetts==t""=pt"aemgiae:t:-lcnbhu[em@sbn(.e]@r"ni>,(ng:"1sd.c+1h\,e.m2ed.=+3"$)s'c)h"e>matron">

How to Write Examples

Basic Structure of exemplum and egXML

In ODD, you use the exemplum and egXML elements to write usage examples:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

0

Writing Complex Examples

When showing examples containing multiple elements:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

1

Namespace Issues and Solutions

Problem: TEI Elements Not Recognized

When using TEI elements (especially root elements) within egXML, namespace issues may arise:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

2

Solution 1: Use Namespace Prefixes

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

3

Solution 2: Simplify with Comments

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

4

Solution 3: Omit the Example

To avoid validation errors, completely omitting problematic examples is also an option.

Providing Examples in Multiple Languages

When providing multilingual examples:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

5

Showing Attribute Usage Examples

When showing various attribute values:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

6

Display in Roma

These examples are automatically included in the HTML documentation generated by the Roma tool. Having examples:

  • Makes element usage clear
  • Shows actual attribute values
  • Makes it easier for schema users to implement

Japanese Language Support

Multilingual Descriptions

Providing descriptions in both Japanese and English within the ODD file:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

7

Document Language Setting

To use the Japanese interface in the Roma tool:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

8

Actual Output Example

An example of TEI XML generated from this ODD:

<<!m-o-duSleelReecftkoenyl=y"h5eaedleerm"enitnsclfurdoem="ttheeiHheeaaddeerrfmioldeuDleesctitleStmtpublicationStmtsourceDesc"/>

9

Using the Roma Tool

Loading the ODD File

  1. Access Roma
  2. Upload the created ODD file from “Upload ODD”
  3. Perform additional customization as needed

Schema Generation

Schemas can be generated from Roma in the following formats:

  • RelaxNG Schema
  • W3C Schema (XSD)
  • DTD
  • Schematron

HTML Document Generation

HTML documentation can be generated from the “Documentation” tab in Roma. In the minimal configuration version, only the elements and attributes actually used are documented.

Troubleshooting

Common Issues and Solutions

  1. Errors with TEI elements inside egXML

    • Problem: The <TEI> element causes errors inside <egXML>
    • Solution: Use namespace prefixes or simplify the example
  2. mode=“keep” is invalid

    • Problem: mode="keep" is not recognized in attDef
    • Solution: Use mode="change"
  3. Too many unnecessary classes

    • Problem: Using standard modules includes unnecessary classes
    • Solution: Define only what is needed using mode="add"

Summary

There are multiple approaches to TEI ODD customization:

  1. Using standard modules: Easy but includes many unnecessary elements
  2. Deletion method: Delete unnecessary items from the standard
  3. Addition method: Explicitly add only what is needed (recommended)

It is important to choose the appropriate method based on project requirements. In the case of NDL Classical Book OCR, defining a minimal configuration resulted in a clear and manageable schema.

References

License

This ODD file is provided under the Creative Commons Attribution 4.0 International License.