Trying "oitei" - An Automatic Conversion Tool from OpenITI mARkdown to TEI XML

Introduction

In the OpenITI (Open Islamicate Texts Initiative) project, which handles historical texts from the Islamicate world, texts can be tagged using a lightweight notation called mARkdown instead of TEI/XML.

While TEI/XML is a powerful international standard for structuring texts, it has problems with right-to-left (RTL) languages like Arabic, where mixing XML tags causes display issues in editors. mARkdown was designed to solve this problem.

In this article, we will try running oitei, a Python tool that automatically converts mARkdown texts to TEI XML.

What is oitei?

A Python library for converting OpenITI mARkdown to TEI XML
Outputs XML conforming to the OpenITI TEI Schema
Published on PyPI and installable via pip install
Dependencies: oimdp (mARkdown parser), lxml

https://github.com/OpenITI/oitei

Installation

Python 3.8 or later is required. oimdp (OpenITI mARkdown Parser) and lxml are automatically installed as dependencies.

OpenITI mARkdown Notation

mARkdown files consist of three parts:

Magic value (line 1): ######OpenITI#
Metadata: Lines starting with #META#
Body text: Written after #META#Header#End#

Main Tags

Notation	Meaning
`###	`
`###
`### $`	Biographical entry
`#`	Start of paragraph
`@P02 name`	Person name (includes the following 2 words)
`@T11 place`	Place name (includes the following 1 word)
`@YB732`	Birth year (Hijri year 732)
`@YD808`	Death year (Hijri year 808)
`%~%`	Hemistich (verse line) separator

The two-digit number after named entity tags (@P, @T, etc.) specifies: the first digit is the entity number, and the second digit indicates “how many subsequent words to include in the name.” For example, @P02 Ibn Khaldun means “include the following 2 words (Ibn Khaldun) as a person name.”

Creating a Sample File

Create a file named sample_markdown.md with the following content.

Running the Conversion

The conversion is completed in just 4 lines.

Conversion Results

The generated TEI XML is as follows.

Key Conversion Points

Each mARkdown tag is appropriately converted to TEI elements.

mARkdown	TEI XML
`###	` chapter heading
`### $` biography	`<div type="biography" subtype="man">`
`@P02 Ibn Khaldun`	`<persName>Ibn Khaldun</persName>`
`@T11 Tunis`	`<placeName>unis</placeName>`
`@YB732`	`<date type="birth" calendar="#ah" when-custom="732"/>`
`@YD808`	`<date type="death" calendar="#ah" when-custom="808"/>`
`%~%` verse separator	`<caesura/>`
`#META#` metadata	`<xenoData>`

Hijri calendar years are automatically assigned calendar="#ah", making the calendar system explicit.

Applying to Japanese Text

Although oitei is designed for Islamicate texts, let’s try applying it to Japanese text.

Note: The Space-Delimited Problem

Named entity tags in mARkdown (@P, @T, etc.) use a mechanism that captures the following N words delimited by spaces as names. Since Japanese does not separate words with spaces, some workarounds are needed.

@P02 Ibn Khaldun -> following 2 words = “Ibn Khaldun” (works for English and Arabic)
@P02 源頼朝 -> following 2 words = “源” “頼朝” … results in <persName>源頼朝</persName> but looks unnatural

Workaround: Combine Japanese names into a single word without spaces and use @P01 (following 1 word).

Japanese Sample

Let’s try a simple example with fictional people and places.

Japanese Conversion Results

Person names, place names, and verse lines are converted to TEI elements, and I confirmed that this XML also passes validation with tei_all.rng.

Since oitei is a tool designed for Islamicate texts, the following points require attention:

The teiHeader always outputs calendar="#ah" (Hijri calendar) by default
The publisher is fixed as “Open Islamicate Texts Initiative”
When using with Japanese, spaces are required before and after named entity tags

For serious TEI encoding of Japanese texts, consider modifying the oitei output headers or using a different tool.

TEI Schema Validation

Let’s verify whether the generated XML conforms to the TEI standard.

I confirmed that the output passes validation with the official TEI RelaxNG schema (tei_all.rng).

I also attempted validation with the OpenITI custom schema (tei_openiti.rng), but the compilation of the 434KB schema took an extremely long time and did not complete in the local environment. Since tei_all is a superset of tei_openiti, passing tei_all validation confirms basic conformance.

Summary

Using oitei, OpenITI mARkdown can be converted to TEI XML in just a few lines of Python
No need to hand-write XML tags, avoiding editor confusion especially when dealing with RTL languages
The generated XML passes TEI standard schema validation
Named entity tags such as <persName>, <placeName>, and <date> are automatically assigned

Writing in a lightweight notation and converting to TEI XML when needed is a particularly useful workflow for researchers working with historical texts from the Islamicate world.

Introduction#

What is oitei?#

Installation#

OpenITI mARkdown Notation#

Main Tags#

Creating a Sample File#

Running the Conversion#

Conversion Results#

Key Conversion Points#

Applying to Japanese Text#

Note: The Space-Delimited Problem#

Japanese Sample#

Japanese Conversion Results#

TEI Schema Validation#

Summary#

References#