Challenges and Solutions for Preserving Order in PDF Transparent Text Extraction

Introduction When extracting the transparent text layer from PDF files, I encountered the problem of “the text order being different from the original PDF.” This article explains the cause of this problem and solutions in both JavaScript and Python. There may be some inaccuracies, but I hope it serves as a useful reference. What Is PDF Transparent Text? The transparent text layer of a PDF is searchable text information embedded within a PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features: ...

September 10, 2025 · 26 min · Nakamura

Creating PDFs from TEI/XML of the Koui Genji Monogatari Text Database

Overview The Koui Genji Monogatari (Collated Tale of Genji) Text Database publishes text data from “Koui Genji Monogatari.” https://kouigenjimonogatari.github.io/ This time, I added PDF files like the following to the database. https://kouigenjimonogatari.github.io/output/01/main.pdf This article describes how to create such PDF files using XSLT and TeX. Cloning the Repository Clone the repository as follows. g i t c l o n e - d e p t h 1 h t t p s : / / g i t h u b . c o m / k o u i g e n j i m o n o g a t a r i / k o u i g e n j i m o n o g a t a r i . g i t h u b . i o Then install xslt3 with the following command. ...

January 14, 2025 · 9 min · Nakamura

Creating a Transparent Text PDF from a Single Page Using Google Cloud Vision API

Overview I had the opportunity to create a transparent text PDF from a PDF using Google Cloud Vision API, so this is a personal note for future reference. Below is an example of searching for simple. Background This time, we target PDFs consisting of a single page. Procedure Creating the Image Create an image to be used as the OCR target. With the default settings, the resulting image was blurry, so I set the resolution to 2x and performed position alignment considering the resolution in the process described below. ...

November 2, 2024 · 10 min · Nakamura

Trying Out File Information Tool Set (FITS)

Overview While investigating Archivematica, there were aspects of File Information Tool Set (FITS) behavior I wanted to verify, so I tried it using Docker. This is a memo of that process. https://github.com/harvard-lts/fits Installation The installation method using Docker is described at the following page. https://github.com/harvard-lts/fits?tab=readme-ov-file#docker-installation However, when accessing the following page mentioned in the manual, the latest release (1.6.0) that includes the Dockerfile could not be downloaded. https://projects.iq.harvard.edu/fits/downloads Instead, the latest zip file could be downloaded from the following GitHub releases page. ...

January 26, 2024 · 30 min · Nakamura

Creating PDF Files from IIIF Manifest Files

Overview I had the opportunity to create PDF files from IIIF manifest files. As a solution, I found the following repository, but was unable to get it working. https://github.com/jbaiter/pdiiif While the above repository uses JavaScript, this time I created a conversion tool using Python. Usage You can try it from the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/iiif2pdf.ipynb During the initial installation, img2pdf is installed, but due to PIL version dependencies, a “RESTART RUNTIME” button will appear. Please click it and then re-run the same cell. ...

May 26, 2023 · 1 min · Nakamura

Converting TEI XML to LaTeX Using TEI Critical Apparatus Toolbox

Overview TEI Critical Apparatus Toolbox is “a tool for people preparing a natively digital TEI critical edition.” http://teicat.huma-num.fr/index.php In addition to providing functionality for visualizing critical apparatus information, it offers several other useful features. Among these, I learned that it has a “TEI to LaTeX and PDF conversion” feature, so I decided to try it out. Print an edition Access the following URL. http://teicat.huma-num.fr/print.php Click the link with the text this dummy edition file to download the following sample data. ...

April 19, 2023 · 1 min · Nakamura

An Example Method for Converting TEI/XML Files to Vertical-Writing PDF

Overview This is a memo documenting one example method for converting TEI/XML files to vertical-writing (tategaki) PDF. You can try the program targeting “Koui Genji Monogatari” (Collated Tale of Genji) in the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/TEI_XMLファイルを縦書きPDFに変換する.ipynb Conversion Workflow This time, I used Quarto. https://quarto.org/ Please refer to the following for installation instructions. https://quarto.org/docs/get-started/ TEI/XML -> qmd First, convert the contents of the TEI/XML file to a qmd file. Below is a sample conversion script. ...

October 3, 2022 · 8 min · Nakamura