Introduction

TEI (Text Encoding Initiative) is an international standard for digitally structuring texts in the humanities. It is used in libraries, museums, and academic research, but writing TEI/XML directly requires knowledge of markup, making the barrier to entry high.

This is where conversion tools from Microsoft Word (.docx) to TEI/XML come in. A well-known example is TEI Garage (formerly OxGarage), but its multi-purpose nature makes the UI somewhat complex. This time, I created a simple browser-based tool specialized for DOCX to TEI/XML conversion.

https://github.com/nakamura196/tei-converter

Demo site: https://tei-converter.pages.dev/

How It Works

TEI Garage provides a public REST API. Simply POST a DOCX file to the following endpoint and TEI/XML is returned.

POSdTToEcIhx:t:ttaeppxspt:l:/ix/cmtalet/iigoanr:avgned..toepie-ncx.molrfgremgaet-sw-eobfsfeircveidcoec/uCmoennvte.rwsoirodnpsr/ocessingml.document/

This tool is a frontend that calls this API. The actual conversion processing is performed on the server operated by the TEI Consortium.

Key Features

Drag & Drop

Simply drag and drop a .docx file into the browser to upload it. You can also click to select a file.

Formatted XML Preview

The conversion result is formatted with indentation and displayed with syntax highlighting. Tag names, attribute names, and attribute values are each color-coded, making it easy to understand the structure.

Copy & Download

You can copy the result to the clipboard with one click or download it as an .xml file.

Built-in Sample DOCX

Just click “Try with sample .docx” to verify the behavior with a built-in sample file. The sample can also be downloaded.

Japanese/English UI Toggle

The EN / JA buttons next to the title allow switching the UI language between Japanese and English.

Technical Details

Single HTML File

No build tools, frameworks, or external libraries are used. All HTML, CSS, and JavaScript are contained in a single index.html file. Therefore, it works simply by opening the file locally.

XML Pretty Print

The XML returned by the TEI Garage API may not include indentation. To address this, the browser’s DOMParser is used to parse the XML and recursively serialize it for formatting.

For mixed content (where text and elements are interleaved, e.g., <p>text <hi>bold</hi> text</p>), adding indentation would change the meaning, so it is output inline.

// Mixed content detection
let hasElement = false, hasText = false;
for (const c of children) {
  if (c.nodeType === Node.ELEMENT_NODE) hasElement = true;
  if (c.nodeType === Node.TEXT_NODE && c.textContent.trim()) hasText = true;
}
if (hasElement && hasText) {
  // Output inline (no indentation)
}

Syntax Highlighting

Syntax highlighting is implemented with a single-pass tokenizer without external libraries. The XML is scanned character by character to separate tags and text, and the tag portions are further decomposed into tag names, attribute names, and attribute values, then colored with HTML <span> elements.

A multi-pass approach using regular expressions was avoided because the <span class="tag"> attributes inserted by earlier passes would be re-matched by subsequent regular expressions.

Embedded Sample DOCX

The sample DOCX is Base64-encoded and embedded within the JavaScript. It is decoded with atob(), converted to a File object, and then passed through the same processing as file selection, enabling sample loading and downloading without a server.

About the TEI Garage API

TEI Garage supports conversion between many formats beyond DOCX.

Input FormatOutput Format (Examples)
DOCX, ODT, Markdown, RTFTEI P5 XML
TEI P5 XMLHTML, LaTeX, ePub, PDF, DOCX

You can also specify conversion profiles (default, enrich, iso, etc.) and language as API properties. However, for typical DOCX to TEI conversion, the default profile is sufficient.

Conclusion

By leveraging the TEI Garage API, a DOCX to TEI/XML conversion tool was built without any server-side implementation. Since it is a single HTML file with no dependencies, it is easy to host on your own server or use locally.

Even if you are not familiar with TEI/XML, please try checking the conversion results with the sample first. You can experience how a document written in Word is converted into a TEI structure.

https://tei-converter.pages.dev/