Introduction
In the OpenITI (Open Islamicate Texts Initiative) project, which handles historical texts from the Islamicate world, texts can be tagged using a lightweight notation called mARkdown instead of TEI/XML.
While TEI/XML is a powerful international standard for structuring texts, it has problems with right-to-left (RTL) languages like Arabic, where mixing XML tags causes display issues in editors. mARkdown was designed to solve this problem.
In this article, we will try running oitei, a Python tool that automatically converts mARkdown texts to TEI XML.
What is oitei?
- A Python library for converting OpenITI mARkdown to TEI XML
- Outputs XML conforming to the OpenITI TEI Schema
- Published on PyPI and installable via
pip install - Dependencies:
oimdp(mARkdown parser),lxml
https://github.com/OpenITI/oitei
Installation
Python 3.8 or later is required. oimdp (OpenITI mARkdown Parser) and lxml are automatically installed as dependencies.
OpenITI mARkdown Notation
mARkdown files consist of three parts:
- Magic value (line 1):
######OpenITI# - Metadata: Lines starting with
#META# - Body text: Written after
#META#Header#End#
Main Tags
| Notation | Meaning |
|---|---|
| `### | ` |
| `### | |
### $ | Biographical entry |
# | Start of paragraph |
@P02 name | Person name (includes the following 2 words) |
@T11 place | Place name (includes the following 1 word) |
@YB732 | Birth year (Hijri year 732) |
@YD808 | Death year (Hijri year 808) |
%~% | Hemistich (verse line) separator |
The two-digit number after named entity tags (@P, @T, etc.) specifies: the first digit is the entity number, and the second digit indicates “how many subsequent words to include in the name.” For example, @P02 Ibn Khaldun means “include the following 2 words (Ibn Khaldun) as a person name.”
Creating a Sample File
Create a file named sample_markdown.md with the following content.
Running the Conversion
The conversion is completed in just 4 lines.
Conversion Results
The generated TEI XML is as follows.
Key Conversion Points
Each mARkdown tag is appropriately converted to TEI elements.
| mARkdown | TEI XML |
|---|---|
| `### | ` chapter heading |
### $ biography | <div type="biography" subtype="man"> |
@P02 Ibn Khaldun | <persName>Ibn Khaldun</persName> |
@T11 Tunis | <placeName>unis</placeName> |
@YB732 | <date type="birth" calendar="#ah" when-custom="732"/> |
@YD808 | <date type="death" calendar="#ah" when-custom="808"/> |
%~% verse separator | <caesura/> |
#META# metadata | <xenoData> |
Hijri calendar years are automatically assigned calendar="#ah", making the calendar system explicit.
Applying to Japanese Text
Although oitei is designed for Islamicate texts, let’s try applying it to Japanese text.
Note: The Space-Delimited Problem
Named entity tags in mARkdown (@P, @T, etc.) use a mechanism that captures the following N words delimited by spaces as names. Since Japanese does not separate words with spaces, some workarounds are needed.
@P02 Ibn Khaldun-> following 2 words = “Ibn Khaldun” (works for English and Arabic)@P02 源 頼朝-> following 2 words = “源” “頼朝” … results in<persName>源 頼朝</persName>but looks unnatural
Workaround: Combine Japanese names into a single word without spaces and use @P01 (following 1 word).
Japanese Sample
Let’s try a simple example with fictional people and places.
Japanese Conversion Results
Person names, place names, and verse lines are converted to TEI elements, and I confirmed that this XML also passes validation with tei_all.rng.
!
Since oitei is a tool designed for Islamicate texts, the following points require attention:
- The
teiHeaderalways outputscalendar="#ah"(Hijri calendar) by default - The
publisheris fixed as “Open Islamicate Texts Initiative” - When using with Japanese, spaces are required before and after named entity tags
For serious TEI encoding of Japanese texts, consider modifying the oitei output headers or using a different tool.
TEI Schema Validation
Let’s verify whether the generated XML conforms to the TEI standard.
I confirmed that the output passes validation with the official TEI RelaxNG schema (tei_all.rng).
!
I also attempted validation with the OpenITI custom schema (tei_openiti.rng), but the compilation of the 434KB schema took an extremely long time and did not complete in the local environment. Since tei_all is a superset of tei_openiti, passing tei_all validation confirms basic conformance.
Summary
- Using oitei, OpenITI mARkdown can be converted to TEI XML in just a few lines of Python
- No need to hand-write XML tags, avoiding editor confusion especially when dealing with RTL languages
- The generated XML passes TEI standard schema validation
- Named entity tags such as
<persName>,<placeName>, and<date>are automatically assigned
Writing in a lightweight notation and converting to TEI XML when needed is a particularly useful workflow for researchers working with historical texts from the Islamicate world.