BDRC Tibetan OCR: Introduction and Implementation Examples of a Tibetan OCR Tool

Introduction

Digitizing Tibetan manuscripts is one of the important challenges in digital humanities. Precious Buddhist scriptures and historical documents are preserved in libraries around the world, but many have not yet been converted to text data. Manual transcription requires enormous time and cost, and researchers with the necessary expertise are limited.

This article introduces BDRC Tibetan OCR. This tool is an open-source Tibetan OCR system developed by the Buddhist Digital Resource Center (BDRC).

It also presents implementation examples from a project to digitize 114 Tibetan manuscript Kanjur texts.

What is BDRC Tibetan OCR?

BDRC Tibetan OCR is a free, open-source tool for automatically extracting text from Tibetan images.

Key Features

1. Desktop Application

It is a GUI application that runs on Windows and macOS (Intel/M1, M2).

Installation:

Download the ZIP file for your OS from the release page
Simply extract and run the executable

2. Multiple Output Formats

Plain text: Extracted Unicode Tibetan characters
PageXML: XML with coordinate information (compatible with Transkribus)
Wylie: Romanized transliteration format

3. Image Correction Features

Dewarping: Corrects page curvature
Rotation correction: Automatically detects and corrects page tilt
Line detection: Line segmentation functionality

4. Batch Processing Support

Batch processing of multiple image files
Direct OCR from PDF files
Automatic retrieval and processing from IIIF (International Image Interoperability Framework) manifests

Four Specialized OCR Models

One of the features of BDRC Tibetan OCR is that it provides four specialized models optimized for different scripts and material types.

1. Uchen Model - For Modern Print

Uchen means “formal script” and is the most standard script in Tibetan. It is used in modern publications and digital fonts.

2. Ume Model - For Handwritten Manuscripts

Ume means “headless script” and was widely used in Buddhist manuscripts.

3. Woodblock Model - For Classical Prints

4. Other Specialized Models

Khyentse Wangpo dataset: Trained on approximately 13,000 lines from modern type editions
Dunhuang manuscript model: Special model for ancient documents dating back to the 8th century

Training Data for Models

These models are trained on datasets collected from the following sources:

BDRC - Buddhist Digital Resource Center
ALL - Asian Legacy Library
Adarsha
NorbuKetaka

Trained models and some datasets are publicly available as open access on the HuggingFace BDRC account and OpenPecha.

Implementation Example: Tibetan Manuscript Kanjur Digitization Project

Here is an implementation example from a project to digitize 114 Tibetan manuscript Kanjur texts.

Project Overview

Target materials: 114 Tibetan manuscript Kanjur texts
Processing method: Automatic image retrieval from IIIF Image API + batch OCR processing
Output format: TEI/XML format (Text Encoding Initiative P5 compliant)
Publication: Parallel display of images and text in a Web viewer

Technical Architecture

1. Efficient Image Retrieval via IIIF Integration

Metadata and high-resolution images are automatically retrieved from image servers compliant with the IIIF (International Image Interoperability Framework) standard.

2. Batch OCR Processing

Key parameters:

k_factor: Line detection sensitivity adjustment (2.5 used for woodblock prints)
bbox_tolerance: Character bounding box tolerance (default: 4.0)
merge_lines: Automatically merge split lines
use_tps: Dewarping via TPS (Thin Plate Spline) transformation

3. TEI/XML Output

TEI/XML structure:

Processing Flow

Usage

Single Image OCR Processing

Batch Processing from IIIF Manifest

Key options:

--model: OCR model to use (Modern, Ume_Druma, Ume_Petsuk, Woodblock, Woodblock-Stacks)
--format: Output format (text, xml, json, all)
--encoding: Character encoding (unicode, wylie)
--dewarp: Apply dewarping
--bbox-tolerance: Bounding box tolerance (default: 4.0)

Project Results

Processed documents: 33 (ongoing)
TEI/XML output: Generated XML with coordinate information for each manuscript
IIIF integration: Achieved a Web viewer integrating images and text
DTS Collections API: Provided standardized metadata API

Technical Details

Architecture

BDRC Tibetan OCR consists of two main neural networks:

Line detection model (PhotiLines)
- Line region detection via semantic segmentation
- Patch size: 512x512
- Provided in ONNX format
OCR model (Easter2 architecture)
- CRNN (Convolutional Recurrent Neural Network) based
- Input: Variable width x fixed height images
- Output: Unicode strings or Wylie transliteration
- Fast inference with ONNX Runtime

Programmatic Usage

Basic Usage in Python

Output Format Details

1. Plain Text (.txt)

2. PageXML (.xml)

3. JSONL (.jsonl)

Performance

Measured values with the Woodblock model (MacBook Pro M1):

Processing speed: Approximately 1 page / 15-20 seconds (7360x4912 pixel high-resolution images)
Line detection accuracy: Over 95%
Character recognition accuracy: 90-95% (varies depending on material condition)
Memory usage: Approximately 2GB

Tesseract OCR

An open-source OCR engine developed by Google.

Supported languages: Over 100 languages (including Tibetan)
Accuracy: Low recognition accuracy for Tibetan (especially classical manuscripts)
Use case: Suitable for general document OCR

Transkribus

A handwritten document recognition platform developed by READ-COOP.

Features: Specialized in HTR (Handwritten Text Recognition)
Accuracy: Custom model training available
Compatibility: BDRC Tibetan OCR is compatible with Transkribus via PageXML format
Limitations: Free version limited to 500 credits per month

BDRC Tibetan OCR Strengths

Provides four specialized models for Tibetan
Completely free and open source
Supports woodblock prints and classical manuscripts
IIIF integration for workflow support
Runs in local environment

Summary

Key Features

Specialized models: Four models optimized by script type and material
Completely free: Available without restrictions as open source
Applications: Both GUI app and CLI tools provided
Standards compliant: Supports international standards including IIIF, TEI/XML, and PageXML
Local processing: Processing completes in the local environment

Applicable Projects

Building digital libraries
Digitizing Buddhist scriptures
Creating research corpora
Archiving historical documents
Developing digital educational materials

Future Possibilities

For such projects, the following feature extensions are conceivable:

Automatic correction: Improving accuracy through post-processing of OCR results
Parallel text display: Comparative display of multiple versions
Full-text search: Searching across OCR text
Annotation features: Adding comments and annotations by researchers

Resources

Official Links

GitHub repository: https://github.com/buda-base/tibetan-ocr-app
Release page: https://github.com/buda-base/tibetan-ocr-app/releases
Trained models (HuggingFace): https://huggingface.co/BDRC
Training code: https://github.com/buda-base/tibetan-ocr-training

References

Buddhist Digital Resource Center: https://www.bdrc.io/
TEI (Text Encoding Initiative): https://tei-c.org/
IIIF (International Image Interoperability Framework): https://iiif.io/
DTS (Distributed Text Services): https://distributed-text-services.github.io/

Acknowledgments

BDRC Tibetan OCR is an open-source tool developed by the Buddhist Digital Resource Center (BDRC). We thank Eric Werner, the developer of the tool.

Published: 2025-11-13

Introduction#

What is BDRC Tibetan OCR?#

Key Features#

1. Desktop Application#

2. Multiple Output Formats#

3. Image Correction Features#

4. Batch Processing Support#

Four Specialized OCR Models#

1. Uchen Model - For Modern Print#

2. Ume Model - For Handwritten Manuscripts#

3. Woodblock Model - For Classical Prints#

4. Other Specialized Models#

Training Data for Models#

Implementation Example: Tibetan Manuscript Kanjur Digitization Project#

Project Overview#

Technical Architecture#

1. Efficient Image Retrieval via IIIF Integration#

2. Batch OCR Processing#

3. TEI/XML Output#

Processing Flow#

Usage#

Single Image OCR Processing#

Batch Processing from IIIF Manifest#

Project Results#

Technical Details#

Architecture#

Programmatic Usage#

Basic Usage in Python#

Output Format Details#

1. Plain Text (.txt)#

2. PageXML (.xml)#

3. JSONL (.jsonl)#

Performance#

Comparison with Related Tools#

Tesseract OCR#

Transkribus#

BDRC Tibetan OCR Strengths#

Summary#

Key Features#

Applicable Projects#

Future Possibilities#

Resources#

Official Links#

References#

Introduction

What is BDRC Tibetan OCR?

Key Features

1. Desktop Application

2. Multiple Output Formats

3. Image Correction Features

4. Batch Processing Support

Four Specialized OCR Models

1. Uchen Model - For Modern Print

2. Ume Model - For Handwritten Manuscripts

3. Woodblock Model - For Classical Prints

4. Other Specialized Models

Training Data for Models

Implementation Example: Tibetan Manuscript Kanjur Digitization Project

Project Overview

Technical Architecture

1. Efficient Image Retrieval via IIIF Integration

2. Batch OCR Processing

3. TEI/XML Output

Processing Flow

Usage

Single Image OCR Processing

Batch Processing from IIIF Manifest

Project Results

Technical Details

Architecture

Programmatic Usage

Basic Usage in Python

Output Format Details

1. Plain Text (.txt)

2. PageXML (.xml)

3. JSONL (.jsonl)

Performance

Comparison with Related Tools

Tesseract OCR

Transkribus

BDRC Tibetan OCR Strengths

Summary

Key Features

Applicable Projects

Future Possibilities

Resources

Official Links

References