A Scalable OCR Processing System Using NDL Classical Japanese OCR Lite on Azure Container Apps

Important Usage Notice

The system introduced in this article may place load on external servers. Please exercise caution when using it.

Server Load: Parallel requests place load on target servers
DoS Attack Risk: Large numbers of simultaneous accesses may be mistaken for DoS attacks
Recommended Approach: It is recommended to download images locally in advance and run only the OCR processing in parallel
Check Terms of Use: Always check the terms of use for target servers and obtain prior permission if necessary
Appropriate Rate Limiting: For production use, conservative concurrency settings (around 5-10 parallel) are strongly recommended
Responsible Usage: Be considerate of server administrators and other users

This article is a record of a technical proof of concept. We ask readers to use it responsibly.

Introduction

This article introduces a case study of building a scalable OCR processing system on Azure Container Apps, leveraging NDL Classical Japanese OCR Lite developed by the National Diet Library (NDL). We explain the design and implementation of a system that achieves pay-as-you-go billing and auto-scaling through cloud-native architecture.

System Overview

Architecture

Key Components

OCR Engine: NDL Classical Japanese OCR Lite (specialized for Japanese classical texts)
Infrastructure: Azure Container Apps (serverless containers)
API Design: REST API (Image URL to OCR result)
Output Format: TEI P5 compliant XML
Scaling: Automatic scaling based on demand

Features of NDL Classical Japanese OCR Lite

OCR Optimized for Japanese Classical Texts

Vertical Layout Support: Vertical text document structures specific to classical texts
Reading Order Optimization: Japanese reading order from right to left, top to bottom
Classical Character Recognition: Support for cursive script and variant kana
Lightweight Implementation: Cloud deployment ready through Docker containerization

Reasons for Choosing Azure Container Apps

Benefits of Serverless Containers

Cost Optimization

Pay-as-you-go: Charged only for what you use
0 Replicas: Completely zero cost when idle
Auto-scaling: Resource adjustment based on demand

System Implementation

Server-Side Implementation

Reading Order Algorithm

TEI XML Output

Processing Results Example

Small-Scale Test Processing (Kiritsubo)

Target: “Kiritsubo” held by the University of Tokyo
Pages: 32 pages
Processing Time: Approximately 30 seconds
Success Rate: 100%
Concurrency: 10 parallel
Cost: Approximately $0.05

Performance Characteristics

Technical Features of the System

1. Cold Start Handling

2. Externalized Configuration

3. Swagger UI Integration

Deployment

Azure Container Apps Deployment

Dockerization

Operations and Monitoring

Performance Metrics

Response Time: Average 2-3 seconds/image
Throughput: 10-15 images/second (with 20 replicas)
Success Rate: Over 99%
Cost Efficiency: $0 when idle, charged only during processing

Log Monitoring

Future Prospects

Technical Improvements

Image Caching: Reduction of duplicate processing
Batch Processing: Efficient large-scale processing
GPU Support: Faster OCR processing
Enhanced Metrics: Detailed performance analysis

Application Possibilities

Digital Archives: Utilization in libraries and museums
Research Support: Digitization for humanities research
Education: Creating teaching materials from classical literature
Cultural Preservation: Digital preservation of valuable materials

Summary

By combining NDL Classical Japanese OCR Lite with Azure Container Apps, we were able to build a classical text OCR system that achieves both cost efficiency and scalability. The serverless architecture enables pay-as-you-go billing and auto-scaling, making it practical as a digital humanities tool.

Key Points

Cost Optimization: Charged only during use
Auto-scaling: Resource adjustment based on demand
TEI P5 Compliant: Standardized XML output
Classical Text Specialized: OCR optimized for Japanese classical texts
API Design: Simple and extensible design

This system was developed as a technical proof of concept. For production use, please give sufficient consideration to the load on target servers and comply with appropriate rate limiting and terms of use.

Important Usage Notice#

Introduction#

System Overview#

Architecture#

Key Components#

Features of NDL Classical Japanese OCR Lite#

OCR Optimized for Japanese Classical Texts#

Reasons for Choosing Azure Container Apps#

Benefits of Serverless Containers#

Cost Optimization#

System Implementation#

Server-Side Implementation#

Reading Order Algorithm#

TEI XML Output#

Processing Results Example#

Small-Scale Test Processing (Kiritsubo)#

Performance Characteristics#

Technical Features of the System#

1. Cold Start Handling#

2. Externalized Configuration#

3. Swagger UI Integration#

Deployment#

Azure Container Apps Deployment#

Dockerization#

Operations and Monitoring#

Performance Metrics#

Log Monitoring#

Future Prospects#

Technical Improvements#

Application Possibilities#

Summary#

Key Points#

References#