Overview

This is a memo documenting one example method for converting TEI/XML files to vertical-writing (tategaki) PDF.

You can try the program targeting “Koui Genji Monogatari” (Collated Tale of Genji) in the following notebook.

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/TEI_XMLファイルを縦書きPDFに変換する.ipynb

Conversion Workflow

This time, I used Quarto.

https://quarto.org/

Please refer to the following for installation instructions.

https://quarto.org/docs/get-started/

TEI/XML -> qmd

First, convert the contents of the TEI/XML file to a qmd file. Below is a sample conversion script.

fseiitaetfoottaf{"wrolmdiuleopseiuot"iouepttexra.xttre"tmpmo=lhmttmtlhmxhereoeeiihaeoadtfb=ntorn=ffk=:rto.o.sts=ti=e::cspw4Bso.=s"needf"xteresps".t.tfi"{":rrniia=aos=enene"r"t{ei(tmutuolaxaxds"iafpoeptshpusemtmta(-tue(p(oio..pomeeto-ltr)atrfusf.ue++as-ehe}tetuppifpn====/.}onhxl.lni.t=={p"rc,tBSfidnfs"eia}e)eoit(di:"\".dt"-"aune"(npnst}hdwupdxt"db"ee..o"t(Ctia("gxqdc)ioh(tu":"tmi:fpioltb:drauelseho+"n/slnd."odacS(rp)ry"mofofea.""en:uintt))(tpl(he..\oeet.xtfnpn,ebtei"at'xaxnt/rtstdhk'=e()o)Tn",u,rapium"eg"ee)xex,(.inmffsjlriiti"eln_m)cedoou)Cknr)h=os[iTgi0lrav]duterea=e)rTnir(/u)teo)ols/genji-doc-style.docx

Below is an example of a qmd file.

tafiuottrlhmeoad:rto::c"x":refer"ence-d"oc:/content/kouigenjimonogatari/tools/genji-doc-style.docx

qmd -> Word (docx)

Next, convert the qmd file to a Word file. By referencing a pre-prepared vertical-writing Word template, you can convert markdown-formatted text into a vertical-writing Word file.

The following articles are helpful references for this process.

https://www.infoworld.com/article/3671668/how-to-create-word-docs-from-r-or-python-with-quarto.html#toc-4

https://quarto.org/docs/output-formats/ms-word-templates.html

As a result, a Word file like the following is created.

Word (docx) -> PDF

Finally, convert the Word file to PDF. In addition to manual conversion, automatic conversion is also possible using the following library.

https://pypi.org/project/docx2pdf/

This allows you to mechanically convert TEI/XML to PDF files.

Summary

There are likely more efficient conversion methods available. I hope this serves as one example to consider when exploring conversion approaches.

One limitation of this workflow is that going through the qmd file format introduces some constraints on layout. Using a library like python-docx below to create Word files directly from TEI/XML may also be an effective approach.

https://pypi.org/project/python-docx/

Of course, conversion using xslt is also effective. There are many other methods available, and I plan to continue exploring them.

I have also documented an example method for converting to epub below. I hope it serves as a useful reference.