Introduction

In the OpenITI (Open Islamicate Texts Initiative) project, which handles historical texts from the Islamicate world, texts can be tagged using a lightweight notation called mARkdown instead of TEI/XML.

While TEI/XML is a powerful international standard for structuring texts, it has problems with right-to-left (RTL) languages like Arabic, where mixing XML tags causes display issues in editors. mARkdown was designed to solve this problem.

In this article, we will try running oitei, a Python tool that automatically converts mARkdown texts to TEI XML.

What is oitei?

  • A Python library for converting OpenITI mARkdown to TEI XML
  • Outputs XML conforming to the OpenITI TEI Schema
  • Published on PyPI and installable via pip install
  • Dependencies: oimdp (mARkdown parser), lxml

https://github.com/OpenITI/oitei

Installation

pipinstalloitei

Python 3.8 or later is required. oimdp (OpenITI mARkdown Parser) and lxml are automatically installed as dependencies.

OpenITI mARkdown Notation

mARkdown files consist of three parts:

  1. Magic value (line 1): ######OpenITI#
  2. Metadata: Lines starting with #META#
  3. Body text: Written after #META#Header#End#

Main Tags

NotationMeaning
`###`
`###
### $Biographical entry
#Start of paragraph
@P02 namePerson name (includes the following 2 words)
@T11 placePlace name (includes the following 1 word)
@YB732Birth year (Hijri year 732)
@YD808Death year (Hijri year 808)
%~%Hemistich (verse line) separator

The two-digit number after named entity tags (@P, @T, etc.) specifies: the first digit is the entity number, and the second digit indicates “how many subsequent words to include in the name.” For example, @P02 Ibn Khaldun means “include the following 2 words (Ibn Khaldun) as a person name.”

Creating a Sample File

Create a file named sample_markdown.md with the following content.

##################MMMM######EEEE#T#@#T#@#T%T#TTTThPhPh~h#AAAA|i$0e$0|e%i#####s23sOHCIcACfTpTALehibISibAhohceiuaaasnbetubaleonttndpncyuplnIlhgettKtBtomcTeourehhKioaBewolI:ra#reahofkariru#:gElanrknndSenOfdl@rTgieaD:dniudoTawnsme#ernun1laoigpmA:sn1-l:solortGR-lueaIweDaRPairAbnpaoazaogTuitasgmizeshsetcrrraitataxhoabasrmmtodgopcwypsprurrhualhlfcanysseieotpnrihiwavetonaeseOnosrrxpf@esoteTanen.nt1no:Ih1wtTeinhITmeesupdmanodAmirpeRpsthsklayedeinsrontitwtcne@cisxYeaaDtBnnne.7t.dm3eoI2rH%te~ao%dnfwedaAmlsnodedniabseroctdnrarinrainatngive.nas@nYM@hDaTbo8n1ew0y1l8lOsRspicaenhyenoycI@lhTTaaoI1rn1sdtmhACtlrRaraokiatudrvegooerhwln.emtdohwHveoeferrdlkwoasrmtnoodft@oeT@r1Tt11th1aeBgaBgMgaiuhgnqdhgaadddabddiitomo.garh@Ha.Tep1h1diicDeaadlmaiasnncdu@sYgDe.3o1g3ra.phicalinformation.

Running the Conversion

imtwmdeipito=_hrswtotorprpioeietinnnet(g(re"".is=swaarmomipiptltleeee(_i_tm.teaceiroi_kn.sdvxtoemrwrlint"n.(,gmm)dd"")w,."t)"orsa"ts)r.iwrnregia(td)e(r):

The conversion is completed in just 4 lines.

Conversion Results

The generated TEI XML is as follows.

<<TAL?Tiua//xE<ttnx<TmItlhge/t/Ele<<<eounte<tIxxif/p/x:raoexb/e>vmmHi<<<fr<pe:gDito<bxellelt/p/s/ioc/rnSeaH>dd/otrnnaei<<tu<<po<slfa<cooaD:teyi<<<<dd>sssdDttaibpa/uuboeilc/afDmeaa>vhp/d/!iyi=:eeliutluv<abriuDlea<cliapmA>d>e><pi<<d-v>o"xrsettlibapvlcbreenlpaeltloreal>vhp/i->nhi>cSlhecli>aielcsDde>lneaeardbe><<T<a<iHpv=t=>teSailC4icD/eceanAedDAb>>/talp<dndne>>'t"mrttsar.lae>D>srdnnaexTuiC>ydbepada1pht/mihbe0atsecDandrsmetchTp>/rlttCw.:t>>toeiabics>eroaDclxhaheI>saede<r0t>nrltIio>csre>:topi=bNcipoa'/pS>iinln>cxH>ssrts"naetetltbw:tOtvtiS>mecpfebmNydyaebew/mpyeettlg>aoriiKeappcrnwte>rym:icrsoh>meieetec.w>nCn>tireOgaIe=n=Nhvotwoa>da=Ontrlb>""aeidewImt=e"pehadnubdmaii.smi"<pe:epunieeMtn-wlooa/rnhnKira>uegc3annhpeIIfy<hsttaqd=..msa">sTni"/ahhia'ooil>eItrhl""rdbUrrcA<rrssed/odeTggat/vmotuaupccilF//ttpeAdbdnlaamo-n2er>"Rupt>all/aw8s0i>kcayceeph'/0Tbdtrp/ennl.?11euoiaepNdda>./xtwog=eaaac0Xtinnr"rmrre"Iso<amse==NnnD/paN>""acIehhna##mlnNme"miaaeuiooao>enhh>dtndf>""ei>"aCtwww>tohahhimeseevmnneesb--raocc(cmruuOipnsspaltteleioonnmmISt==The""Iax78)rt30<e.28/""pA//ul>>bilkiesher>

Key Conversion Points

Each mARkdown tag is appropriately converted to TEI elements.

mARkdownTEI XML
`###` chapter heading
### $ biography<div type="biography" subtype="man">
@P02 Ibn Khaldun<persName>Ibn Khaldun</persName>
@T11 Tunis<placeName>unis</placeName>
@YB732<date type="birth" calendar="#ah" when-custom="732"/>
@YD808<date type="death" calendar="#ah" when-custom="808"/>
%~% verse separator<caesura/>
#META# metadata<xenoData>

Hijri calendar years are automatically assigned calendar="#ah", making the calendar system explicit.

Applying to Japanese Text

Although oitei is designed for Islamicate texts, let’s try applying it to Japanese text.

Note: The Space-Delimited Problem

Named entity tags in mARkdown (@P, @T, etc.) use a mechanism that captures the following N words delimited by spaces as names. Since Japanese does not separate words with spaces, some workarounds are needed.

  • @P02 Ibn Khaldun -> following 2 words = “Ibn Khaldun” (works for English and Arabic)
  • @P02 源 頼朝 -> following 2 words = “源” “頼朝” … results in <persName>源 頼朝</persName> but looks unnatural

Workaround: Combine Japanese names into a single word without spaces and use @P01 (following 1 word).

@@PT0011<p<eprlsaNcaemNea>me></<p/eprlsaNcaemNea>me>

Japanese Sample

Let’s try a simple example with fictional people and places.

#############MMMM###EEEE#@@#%#TTTTPP~#AAAA|00|%#####11OHpTALeeiuaanttndIlhgeTeourI:ra#@@#:gETTen00:d11#J%a~p%anese@T01

Japanese Conversion Results

<<d/d/i<<<di<<<dvhp/p/ivhp/l/i>e><<p><<pv>e><pg<lvalp>lp>>al>>l/g>dbebedb><<l>>/r/r>/cc>>s>s<>aaNNpeeaalssmmauueecrr>>e<aaN///ah>><me/eah//>depp>aeedrr>ssNN/aapmmleea>>ceNa<<mppell>aacceeNNaammee>>//ppllaacceeNNaammee>>

Person names, place names, and verse lines are converted to TEI elements, and I confirmed that this XML also passes validation with tei_all.rng.

!

Since oitei is a tool designed for Islamicate texts, the following points require attention:

  • The teiHeader always outputs calendar="#ah" (Hijri calendar) by default
  • The publisher is fixed as “Open Islamicate Texts Initiative”
  • When using with Japanese, spaces are required before and after named entity tags

For serious TEI encoding of Japanese texts, consider modifying the oitei output headers or using a different tool.

TEI Schema Validation

Let’s verify whether the generated XML conforms to the TEI standard.

#c#xumDrhVloltalwtlin-pinlssdtoL:aa/tde-trtewehtiilee-taichxT_.nEaoxgIlrmlgltA./lelrriilnen_gltaselc\alhs.eerm/naxgmls/atmepil/ec_utsetio.mx/mslchema/relaxng/tei_all.rng
sample_tei.xmlvalidates

I confirmed that the output passes validation with the official TEI RelaxNG schema (tei_all.rng).

!

I also attempted validation with the OpenITI custom schema (tei_openiti.rng), but the compilation of the 434KB schema took an extremely long time and did not complete in the local environment. Since tei_all is a superset of tei_openiti, passing tei_all validation confirms basic conformance.

Summary

  • Using oitei, OpenITI mARkdown can be converted to TEI XML in just a few lines of Python
  • No need to hand-write XML tags, avoiding editor confusion especially when dealing with RTL languages
  • The generated XML passes TEI standard schema validation
  • Named entity tags such as <persName>, <placeName>, and <date> are automatically assigned

Writing in a lightweight notation and converting to TEI XML when needed is a particularly useful workflow for researchers working with historical texts from the Islamicate world.

References