Overview

I had an opportunity to convert Word files to TEI/XML files. Upon investigation, in addition to official TEI tools such as TEIGarage Conversion, I found a conversion example in TEI Publisher:

https://teipublisher.com/exist/apps/tei-publisher/test/test.docx.xml

The above example appeared to convert Word style information into TEI tags, so I tried this approach. For this project, I used the python-docx library with the goal of using it independently of TEI Publisher.

Word File

I created a prototype Word file like the one below. All styles are provisional, but I created styles such as “tei:persName” and “tei:warichu” and changed their visual styling such as color. The mechanism works by applying styles to perform simple structuring.

Conversion to TEI/XML

I created a script that takes the above Word file as input and converts it to TEI/XML, primarily based on style information. I plan to share it via pip or similar in the future.

An example of the converted TEI/XML is below. There is still much room for improvement, but I was able to convert it into a valid TEI/XML file.

<<<<<<<<<<<<<<<<<<<<<<<<<<<ls/lls/ls/ls</lls<<lls/lls<<lls<</lls/ls/lbesbbesbesber<<<sbbes/srsbbesbbep/ppsbben</n</sbbesbesb/ge//ge/ge/gur/r/rre//geseese//ge//gepeepe//gomnomne//ge/ge/>>g>>g>g>>bbrtrtrug>>>gegdeg>>>g>>>rerreg>>>tiotiog>>g>g>>t>t>y>b>ttb>gg>>srssr>eltelt>t>t>yy>>>p>yt>t>NsNNseeeeyyppl>yyaNaaNts>ts>ppeeappmammaytytee==ceeemeempopo==""e==>e>eenen""ddpp="">>=e=edpaaee"rr""aettrrleeuutreessedd使nneslLoof"""i"iloiinnt>>>t>tinnnlL"==nleeii>""ei"nn使ww"n>eebb>e使"rr">"">使//>>22

Below is an example displayed in a TEI/XML viewer I am developing separately. Styles such as <rt place="left"> and red text have not yet been applied, but person names and interlinear notes have been successfully reproduced.

Summary

While complex structures may be difficult, I believe that being able to convert text created in Word to TEI/XML in a reasonably intended form could help lower the barrier to adopting TEI/XML. I plan to continue experimenting with this approach.