Fixing the 'ref' Bug in DHConvalidator

This article was partially written by AI. Overview DHConvalidator is a tool for converting Digital Humanities (DH) conference abstracts into a consistent TEI (Text Encoding Initiative) text base. https://github.com/ADHO/dhconvalidator When using this tool, the following error occurred during the conversion process from Microsoft Word format (DOCX) to TEI XML format: E R R O R : n u . x o m . P a r s i n g E x c e p t i o n : c v c - c o m p l e x - t y p e . 2 . 4 . a : I n v a l i d c o n t e n t w a s f o u n d s t a r t i n g w i t h e l e m e n t ' r e f ' This article shares the cause and solution for this issue. ...

June 27, 2025 · 23 min · Nakamura

How to Convert Word Files to TEI XML: A Guide to Using the TEIgarage API

This article was created by AI with some human modifications. Introduction In the world of digital humanities, it has become common to store documents in TEI (Text Encoding Initiative) format. TEI is a standard for structuring scholarly texts. This article explains how to convert documents created in Microsoft Word to TEI XML format using Python. What is TEIgarage? TEIgarage is an online service for converting documents in various formats to TEI XML. The service provides an API that can be called directly from programs. In this article, we will call this API from Python to convert Word files. ...

March 3, 2025 · 8 min · Nakamura

Converting Word to TEI/XML

Overview I had an opportunity to convert Word files to TEI/XML files. Upon investigation, in addition to official TEI tools such as TEIGarage Conversion, I found a conversion example in TEI Publisher: https://teipublisher.com/exist/apps/tei-publisher/test/test.docx.xml The above example appeared to convert Word style information into TEI tags, so I tried this approach. For this project, I used the python-docx library with the goal of using it independently of TEI Publisher. Word File I created a prototype Word file like the one below. All styles are provisional, but I created styles such as “tei:persName” and “tei:warichu” and changed their visual styling such as color. The mechanism works by applying styles to perform simple structuring. ...

January 17, 2023 · 5 min · Nakamura

Double-Sided Ruby Annotations Using python-docx

This is a memo on how to achieve double-sided ruby (furigana) in Word using python-docx. You can try it from the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/python_docxを用いた両側ルビ.ipynb An output example is shown below. An input example is shown below. < < < b p 私 < に / p < が / / o > は r < < / 行 p > r < < / あ p b d u r < < / r r き > u r r r り > o y b b r < < / r < < / r t u ま b b t u ま d > y > u r r r u r r r 場 b b し y > b す y > b b t u b b t u > p y た > 入 p y 。 > y > b y > b l > 。 学 l > > 打 p y > 球 p y a 試 a < l > < l > c 験 c / a / a e < e r c r c = / = b e b e " r " > = > = l b a " " e > b r r f o i i t v g g " e h h > " t t ビ > " " リ に > > ヤ ゅ ダ キ ー う < ウ ド が / < < く r / / し t r r け > t t ん > > < / r t > The program is still incomplete, but I hope it serves as a helpful reference. ...

October 4, 2022 · 2 min · Nakamura

An Example Method for Converting TEI/XML Files to Vertical-Writing PDF

Overview This is a memo documenting one example method for converting TEI/XML files to vertical-writing (tategaki) PDF. You can try the program targeting “Koui Genji Monogatari” (Collated Tale of Genji) in the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/TEI_XMLファイルを縦書きPDFに変換する.ipynb Conversion Workflow This time, I used Quarto. https://quarto.org/ Please refer to the following for installation instructions. https://quarto.org/docs/get-started/ TEI/XML -> qmd First, convert the contents of the TEI/XML file to a qmd file. Below is a sample conversion script. ...

October 3, 2022 · 8 min · Nakamura

Creating Microsoft Word Files with python-docx: Using Templates and int2kanji

Overview I had the opportunity to convert tabular information into a vertical writing format in Microsoft Word, so this is a memo. Before conversion Research Title Project Number Direct Cost Development of Digital Archive System Construction Methods Considering Sustainability and Usability 21K18014 2600000 After conversion This uses specified templates and the “Kanjize” library for mutual conversion between numbers and Kanji numerals. Creating Microsoft Word Files with python-docx First, create a Microsoft Word template file as follows. While using the specified layout, place {<variable_name>} in the parts where values should be changed. ...

May 31, 2022 · 5 min · Nakamura