Ocr | Digital Archive Systems Tech Blog

BDRC Tibetan OCR: Introduction and Implementation Examples of a Tibetan OCR Tool

Introduction Digitizing Tibetan manuscripts is one of the important challenges in digital humanities. Precious Buddhist scriptures and historical documents are preserved in libraries around the world, but many have not yet been converted to text data. Manual transcription requires enormous time and cost, and researchers with the necessary expertise are limited. This article introduces BDRC Tibetan OCR. This tool is an open-source Tibetan OCR system developed by the Buddhist Digital Resource Center (BDRC). ...

Azure OpenAI GPT-4 vs Document Intelligence: Comparative Evaluation of Japanese Vertical Text OCR

Overview We performed OCR processing on Japanese vertical writing manuscript paper using two OCR services provided by Microsoft Azure (Azure OpenAI GPT-4 Vision and Azure Document Intelligence) and conducted a detailed comparative evaluation of the results. Test Image Image Source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image Characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Light grid lines (squares) Distinction between title section and body section Ground Truth 原佐原こ稿藤稿ののち用テタあ紙キイきにスト書トルくをテ使キ用スすトるが場入合りはま、す日。本作語文のや全小角論を文使をう作こっとたでりマ、ス小に説あをっ書たい文た字りをな打どつにこごと活が用でくきだまさすい。。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 1. Recognition Results with Azure OpenAI GPT-4.1 Recognized Text 原佐原こ稿藤稿のの用テタち紙キイあにストき書トルくをテ使キ用スすトるが場入合りはま、す日。本作語文のや全小角論を文使をう作こっとたでりマ、ス小に説あをっ書たい文た字りをな打どつにこごと活が用でくきだまさすい。。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 Evaluation GPT-4.1 demonstrated the following characteristics for vertical writing manuscript paper: ...

LLM-Based Manuscript Paper OCR Performance Comparison: Verification of Vertical Japanese Recognition Accuracy

Introduction In this article, we compared and verified the OCR performance of major LLM models using actual manuscript paper images. While many OCR benchmarks target printed documents and horizontally written text, we evaluate recognition accuracy on the special format of Japanese vertical manuscript paper to more practically verify each model’s Japanese document understanding capabilities. Features of This Verification Using the uniquely Japanese manuscript paper format: Verification with images containing complex elements such as characters placed in grid cells, vertical writing layout, and distinctive margin composition Assuming practical use cases: Performance evaluation on manuscript paper used in actual writing scenarios such as essays, novels, and academic papers Comprehensive comparison of the latest models: Comparison of the latest models – GPT-5, GPT-4.1, Gemini 2.5 Pro, Claude Opus 4.1, and Claude Sonnet 4 – under identical conditions Verification Overview Image Used Image source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Faint grid lines (cells) Distinction between title area and body area ...

Challenges and Solutions for Preserving Order in PDF Transparent Text Extraction

Introduction When extracting the transparent text layer from PDF files, I encountered the problem of “the text order being different from the original PDF.” This article explains the cause of this problem and solutions in both JavaScript and Python. There may be some inaccuracies, but I hope it serves as a useful reference. What Is PDF Transparent Text? The transparent text layer of a PDF is searchable text information embedded within a PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features: ...

TEI ODD File Customization: A Case Study with NDL Classical Book OCR

Introduction TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application. ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the necessary elements and attributes. Background: Development of the NDL Classical Book OCR-Lite Application I am creating an application that outputs NDL Classical Book OCR-Lite results in TEI/XML. The purpose of this application is to perform OCR on Japanese classical books and output the results in standard TEI format. ...

A Scalable OCR Processing System Using NDL Classical Japanese OCR Lite on Azure Container Apps

Important Usage Notice The system introduced in this article may place load on external servers. Please exercise caution when using it. Server Load: Parallel requests place load on target servers DoS Attack Risk: Large numbers of simultaneous accesses may be mistaken for DoS attacks Recommended Approach: It is recommended to download images locally in advance and run only the OCR processing in parallel Check Terms of Use: Always check the terms of use for target servers and obtain prior permission if necessary Appropriate Rate Limiting: For production use, conservative concurrency settings (around 5-10 parallel) are strongly recommended Responsible Usage: Be considerate of server administrators and other users This article is a record of a technical proof of concept. We ask readers to use it responsibly. ...

Trying DToC: Dynamic Table of Contexts

Overview I had an opportunity to try DToC: Dynamic Table of Contexts, so this is a memorandum. https://www.leaf-vre.org/docs/features/dtoc The machine-translated description is as follows: It brings innovation to electronic reading by combining the power of semantic markup with book navigation features. The traditional overview functions of printed books – the table of contents and keyword index – are dynamically integrated with full-text search and tag-based indexing features, creating a new reading experience. ...

Creating TEI/XML Files from IIIF Manifest Files Using NDL Kotenseki OCR-Lite

Overview This article introduces a Gradio app that creates TEI/XML files from IIIF manifest files using NDL Kotenseki OCR-Lite. It can be accessed at the following URL: https://nakamura196-ndlkotenocr-lite-iiif.hf.space/ Background This is a continuation of the following articles: Previously, two separate apps were needed, but with this update, the entire conversion process can be completed within a single Gradio app. Additionally, issues such as difficulty tracking progress when processing manifest files with many image pages, and the inability to copy processing results, have been fixed. ...

Part 2: Creating Annotated IIIF Manifest Files and TEI/XML Files Using NDL Classical Book OCR-Lite

Overview In the following article, I introduced how to create annotated IIIF manifest files and TEI/XML files using NDL Classical Book OCR-Lite. Since the explanation above was insufficient in many areas, I will re-introduce how to use it. Supplement Along with writing this article, the following improvements were made. Process 1: Creating IIIF Manifest Files Added support for IIIF Presentation API v3. Process 2: Creating TEI/XML Files Added a form that accepts string input, considering the connection with Process 1. Usage Process 1: Creating IIIF Manifest Files Access the following. ...

A Program to Create TEI/XML Files with OCR Results from IIIF Manifest Files

Overview I created a program to generate TEI/XML files containing OCR results from IIIF manifest files. This article explains how to use it. How It Works By specifying the URL of an IIIF manifest file, it creates a TEI/XML file containing OCR results from NDL Kotenseki OCR-Lite. https://github.com/ndl-lab/ndlkotenocr-lite Usage Access the following notebook: https://colab.research.google.com/github/nakamura196/000_tools/blob/main/IIIFマニフェストファイルからTEI_XMLファイルを作成するプログラム.ipynb Then press the first play button. Once complete, update the manifest_url and output_dir values in the “Execute” section and run the cell. ...

Created a Similar Text Search App for the Koui Genji Monogatari

Overview I created a similar text search app for the Koui Genji Monogatari. You can try it from the following URL. https://huggingface.co/spaces/nakamura196/genji_predict This article introduces how to use the app. Data The text data published on the following Koui Genji Monogatari DB is used. https://kouigenjimonogatari.github.io/ How the App Works The mechanism is simple: text for each volume and page of the Koui Genji Monogatari is prepared in advance, the edit distance from the input string is calculated, and texts (along with volume and page numbers) with high similarity are returned. ...

Building an NDLOCR Gradio App Using Azure Virtual Machines

Overview In the following article, I introduced a Gradio app using Azure virtual machines and NDLOCR. This article provides notes on how to build this app. Building the Virtual Machine To use a GPU, it was necessary to request a quota. After the request, “NC8as_T4_v3” was used for this project. Building the Docker Environment The following article was used as a reference. https://zenn.dev/koki_algebra/scraps/32ba86a3f867a4 Disabling Secure Boot The following is stated: ...

Created a Gradio App to Try ndlocr_cli (NDLOCR ver.2.1) Application

Overview I created a Gradio app that allows you to try the ndlocr_cli (NDLOCR ver.2.1) application. Please try it at the following URL. https://ndlocr.aws.ldas.jp/ Notes Currently, only single image uploads are supported. I plan to add options such as PDF upload functionality in the future. It uses the “NVIDIA Tesla T4 GPU” installed in the “NC8as_T4_v3” VM available on Azure. Summary I’m not sure how long I can continue providing this in its current form, but I hope it will be useful for verifying the accuracy of the ndlocr_cli (NDLOCR ver.2.1) application. ...

Building a Gradio App Using NDL Kotenseki OCR-Lite

Overview I built a Gradio App using NDL Kotenseki OCR-Lite. You can try it at the following URL. https://huggingface.co/spaces/nakamura196/ndlkotenocr-lite “NDL Kotenseki OCR-Lite” provides a desktop application, so an execution environment is available without the need for a web app like Gradio. Therefore, the intended use cases for this web app include usage from smartphones or tablets, and integration via web API. Development Notes and Bug Fixes Using Submodules The original ndlkotenocr-lite was introduced as a submodule. ...

Using NDL Classical Book OCR-Lite (ndlkotenocr-lite) on Mac OS

Overview On November 26, 2024, NDL Lab released NDL Classical Book OCR-Lite. https://lab.ndl.go.jp/news/2024/2024-11-26/ This article introduces how to use it on Mac OS. Usage (Video) https://www.youtube.com/watch?v=NYv93sJ6WLU Usage (Text) Access the following. https://github.com/ndl-lab/ndlkotenocr-lite/releases/tag/1.0.0 Select the one containing “macos” from the list. Also select the one matching your chip. Clicking the link downloads “ndlkotenocr-lite_v1.0.0_macos_m1.tar.gz” as shown below. After extracting by double-clicking, the application “NDLkotenOCR-Lite” is extracted inside a macos folder. ...

Creating a Transparent Text PDF from a Single Page Using Google Cloud Vision API

Overview I had the opportunity to create a transparent text PDF from a PDF using Google Cloud Vision API, so this is a personal note for future reference. Below is an example of searching for simple. Background This time, we target PDFs consisting of a single page. Procedure Creating the Image Create an image to be used as the OCR target. With the default settings, the resulting image was blurry, so I set the resolution to 2x and performed position alignment considering the resolution in the process described below. ...

Mirador Repository with Vertical Text Support for the Text Overlay Plugin

Overview I have updated the Mirador repository with the Text Overlay plugin that supports vertical text. https://github.com/nakamura196/mirador-integration-textoverlay References The original Text Overlay plugin repository is below. https://github.com/dbmdz/mirador-textoverlay Demo You can check the behavior on the following page. https://nakamura196.github.io/mirador-integration-textoverlay/ Press the “Text visible” button in the upper right to display the text. If it remains in a loading state, try reloading the page. References The Text Overlay plugin was added to Mirador 3 using the method introduced in the following article. ...

Applying Google Cloud Vision to Image Files to Create IIIF Manifests and TEI/XML Files

Overview I created a library that applies Google Cloud Vision to image files and generates IIIF manifest and TEI/XML files. https://github.com/nakamura196/iiif_tei_py This article explains how to use the library. Usage You can check the usage and more at the following page. https://nakamura196.github.io/iiif_tei_py/ Installing the Library Install the library from the GitHub repository. p i p i n s t a l l h t t p s : / / g i t h u b . c o m / n a k a m u r a 1 9 6 / i i i f _ t e i _ p y Creating a GC Service Account Download a GC (Google Cloud) service account key (JSON file) by referring to articles such as the following. ...

Handling Shared Memory Shortage When Running ndlocr_cli and Other Issues

Overview This is a memo about issues I encountered when running ndlocr_cli (the NDLOCR (ver.2.1) application repository) and the steps taken to resolve them. Note that many of these issues were caused by my own configuration oversights or atypical usage, and are unlikely to occur during normal use. Please refer to this article if you encounter similar issues. Shared Memory Shortage When running ndlocr_cli, the following error occurred. P D r a e t d a i L c o t a i d n e g r : w 0 o i r t k e [ r 0 0 ( : p 0 i 0 d , ( s ? ) i t 3 / 9 s 9 ] 9 E ) R R e O x R i : t e U d n e u x n p e e x c p t e e c d t e b d u l s y e r r o r e n c o u n t e r e d i n w o r k e r . T h i s m i g h t b e c a u s e d b y i n s u f f i c i e n t s h a r e d m e m o r y ( s h m ) . The response from ChatGPT was as follows. ...

Disk Space After Installing ndlocr_cli with Docker

Notes on disk space after installing ndlocr_cli with Docker. I set up ndlocr_cli by following the steps described in the following article. As shown below, approximately 50GB of space is used, so you need to process input/output image files etc. with the remaining capacity. (The example below shows a case with 200GB of disk space allocated.) m F t / t t / t d i m d m m d m x l p e p p e p u e f v f f v f s s s / s s / s e y s s r s d d @ t a a u e 2 1 b m u n t u - 2 2 S 5 1 5 1 5 0 i . 9 2 . . . 4 z 7 6 9 0 1 7 : e G G G M G G ~ / n U 1 6 4 d s . 4 . . l e 4 5 1 0 o d M G 0 0 M K c r A _ v 5 1 5 1 5 c a . 4 2 . . . l i 7 3 9 0 1 7 i l G G G M G G $ U d s 2 f e 1 4 0 0 1 1 % % % % % % % - h M / / / / / r d r b r u u e u o u n n v n n t / t e s l u d h e s m c f e o k i r n / 1 0 0 0 I hope this is helpful when specifying the virtual disk size (GB) when launching virtual machines on AWS (Amazon Web Services) or mdx (Data-Driven Society Creation Platform). ...