Introduction
TEI (Text Encoding Initiative) is an international standard for digitally structuring texts in the humanities. It is used in libraries, museums, and academic research, but writing TEI/XML directly requires knowledge of markup, making the barrier to entry high.
This is where conversion tools from Microsoft Word (.docx) to TEI/XML come in. A well-known example is TEI Garage (formerly OxGarage), but its multi-purpose nature makes the UI somewhat complex. This time, I created a simple browser-based tool specialized for DOCX to TEI/XML conversion.
https://github.com/nakamura196/tei-converter
Demo site: https://tei-converter.pages.dev/
How It Works
TEI Garage provides a public REST API. Simply POST a DOCX file to the following endpoint and TEI/XML is returned.
This tool is a frontend that calls this API. The actual conversion processing is performed on the server operated by the TEI Consortium.
Key Features
Drag & Drop
Simply drag and drop a .docx file into the browser to upload it. You can also click to select a file.
Formatted XML Preview
The conversion result is formatted with indentation and displayed with syntax highlighting. Tag names, attribute names, and attribute values are each color-coded, making it easy to understand the structure.
Copy & Download
You can copy the result to the clipboard with one click or download it as an .xml file.
Built-in Sample DOCX
Just click “Try with sample .docx” to verify the behavior with a built-in sample file. The sample can also be downloaded.
Japanese/English UI Toggle
The EN / JA buttons next to the title allow switching the UI language between Japanese and English.
Technical Details
Single HTML File
No build tools, frameworks, or external libraries are used. All HTML, CSS, and JavaScript are contained in a single index.html file. Therefore, it works simply by opening the file locally.
XML Pretty Print
The XML returned by the TEI Garage API may not include indentation. To address this, the browser’s DOMParser is used to parse the XML and recursively serialize it for formatting.
For mixed content (where text and elements are interleaved, e.g., <p>text <hi>bold</hi> text</p>), adding indentation would change the meaning, so it is output inline.
// Mixed content detection
let hasElement = false, hasText = false;
for (const c of children) {
if (c.nodeType === Node.ELEMENT_NODE) hasElement = true;
if (c.nodeType === Node.TEXT_NODE && c.textContent.trim()) hasText = true;
}
if (hasElement && hasText) {
// Output inline (no indentation)
}
Syntax Highlighting
Syntax highlighting is implemented with a single-pass tokenizer without external libraries. The XML is scanned character by character to separate tags and text, and the tag portions are further decomposed into tag names, attribute names, and attribute values, then colored with HTML <span> elements.
A multi-pass approach using regular expressions was avoided because the <span class="tag"> attributes inserted by earlier passes would be re-matched by subsequent regular expressions.
Embedded Sample DOCX
The sample DOCX is Base64-encoded and embedded within the JavaScript. It is decoded with atob(), converted to a File object, and then passed through the same processing as file selection, enabling sample loading and downloading without a server.
About the TEI Garage API
TEI Garage supports conversion between many formats beyond DOCX.
| Input Format | Output Format (Examples) |
|---|---|
| DOCX, ODT, Markdown, RTF | TEI P5 XML |
| TEI P5 XML | HTML, LaTeX, ePub, PDF, DOCX |
You can also specify conversion profiles (default, enrich, iso, etc.) and language as API properties. However, for typical DOCX to TEI conversion, the default profile is sufficient.
Conclusion
By leveraging the TEI Garage API, a DOCX to TEI/XML conversion tool was built without any server-side implementation. Since it is a single HTML file with no dependencies, it is easy to host on your own server or use locally.
Even if you are not familiar with TEI/XML, please try checking the conversion results with the sample first. You can experience how a document written in Word is converted into a TEI structure.