Introduction
Digitizing Tibetan manuscripts is one of the important challenges in digital humanities. Precious Buddhist scriptures and historical documents are preserved in libraries around the world, but many have not yet been converted to text data. Manual transcription requires enormous time and cost, and researchers with the necessary expertise are limited.
This article introduces BDRC Tibetan OCR. This tool is an open-source Tibetan OCR system developed by the Buddhist Digital Resource Center (BDRC).
It also presents implementation examples from a project to digitize 114 Tibetan manuscript Kanjur texts.
What is BDRC Tibetan OCR?
BDRC Tibetan OCR is a free, open-source tool for automatically extracting text from Tibetan images.
Key Features
1. Desktop Application
It is a GUI application that runs on Windows and macOS (Intel/M1, M2).
Installation:
- Download the ZIP file for your OS from the release page
- Simply extract and run the executable
2. Multiple Output Formats
- Plain text: Extracted Unicode Tibetan characters
- PageXML: XML with coordinate information (compatible with Transkribus)
- Wylie: Romanized transliteration format
3. Image Correction Features
- Dewarping: Corrects page curvature
- Rotation correction: Automatically detects and corrects page tilt
- Line detection: Line segmentation functionality
4. Batch Processing Support
- Batch processing of multiple image files
- Direct OCR from PDF files
- Automatic retrieval and processing from IIIF (International Image Interoperability Framework) manifests
Four Specialized OCR Models
One of the features of BDRC Tibetan OCR is that it provides four specialized models optimized for different scripts and material types.
1. Uchen Model - For Modern Print
Uchen means “formal script” and is the most standard script in Tibetan. It is used in modern publications and digital fonts.
2. Ume Model - For Handwritten Manuscripts
Ume means “headless script” and was widely used in Buddhist manuscripts.
3. Woodblock Model - For Classical Prints
4. Other Specialized Models
- Khyentse Wangpo dataset: Trained on approximately 13,000 lines from modern type editions
- Dunhuang manuscript model: Special model for ancient documents dating back to the 8th century
Training Data for Models
These models are trained on datasets collected from the following sources:
- BDRC - Buddhist Digital Resource Center
- ALL - Asian Legacy Library
- Adarsha
- NorbuKetaka
Trained models and some datasets are publicly available as open access on the HuggingFace BDRC account and OpenPecha.
Implementation Example: Tibetan Manuscript Kanjur Digitization Project
Here is an implementation example from a project to digitize 114 Tibetan manuscript Kanjur texts.
Project Overview
- Target materials: 114 Tibetan manuscript Kanjur texts
- Processing method: Automatic image retrieval from IIIF Image API + batch OCR processing
- Output format: TEI/XML format (Text Encoding Initiative P5 compliant)
- Publication: Parallel display of images and text in a Web viewer
Technical Architecture
1. Efficient Image Retrieval via IIIF Integration
Metadata and high-resolution images are automatically retrieved from image servers compliant with the IIIF (International Image Interoperability Framework) standard.
2. Batch OCR Processing
Key parameters:
k_factor: Line detection sensitivity adjustment (2.5 used for woodblock prints)bbox_tolerance: Character bounding box tolerance (default: 4.0)merge_lines: Automatically merge split linesuse_tps: Dewarping via TPS (Thin Plate Spline) transformation
3. TEI/XML Output
TEI/XML structure:
Processing Flow
Usage
Single Image OCR Processing
Batch Processing from IIIF Manifest
Key options:
--model: OCR model to use (Modern, Ume_Druma, Ume_Petsuk, Woodblock, Woodblock-Stacks)--format: Output format (text, xml, json, all)--encoding: Character encoding (unicode, wylie)--dewarp: Apply dewarping--bbox-tolerance: Bounding box tolerance (default: 4.0)
Project Results
- Processed documents: 33 (ongoing)
- TEI/XML output: Generated XML with coordinate information for each manuscript
- IIIF integration: Achieved a Web viewer integrating images and text
- DTS Collections API: Provided standardized metadata API
Technical Details
Architecture
BDRC Tibetan OCR consists of two main neural networks:
Line detection model (PhotiLines)
- Line region detection via semantic segmentation
- Patch size: 512x512
- Provided in ONNX format
OCR model (Easter2 architecture)
- CRNN (Convolutional Recurrent Neural Network) based
- Input: Variable width x fixed height images
- Output: Unicode strings or Wylie transliteration
- Fast inference with ONNX Runtime
Programmatic Usage
Basic Usage in Python
Output Format Details
1. Plain Text (.txt)
2. PageXML (.xml)
3. JSONL (.jsonl)
Performance
Measured values with the Woodblock model (MacBook Pro M1):
- Processing speed: Approximately 1 page / 15-20 seconds (7360x4912 pixel high-resolution images)
- Line detection accuracy: Over 95%
- Character recognition accuracy: 90-95% (varies depending on material condition)
- Memory usage: Approximately 2GB
Comparison with Related Tools
Tesseract OCR
An open-source OCR engine developed by Google.
- Supported languages: Over 100 languages (including Tibetan)
- Accuracy: Low recognition accuracy for Tibetan (especially classical manuscripts)
- Use case: Suitable for general document OCR
Transkribus
A handwritten document recognition platform developed by READ-COOP.
- Features: Specialized in HTR (Handwritten Text Recognition)
- Accuracy: Custom model training available
- Compatibility: BDRC Tibetan OCR is compatible with Transkribus via PageXML format
- Limitations: Free version limited to 500 credits per month
BDRC Tibetan OCR Strengths
- Provides four specialized models for Tibetan
- Completely free and open source
- Supports woodblock prints and classical manuscripts
- IIIF integration for workflow support
- Runs in local environment
Summary
Key Features
- Specialized models: Four models optimized by script type and material
- Completely free: Available without restrictions as open source
- Applications: Both GUI app and CLI tools provided
- Standards compliant: Supports international standards including IIIF, TEI/XML, and PageXML
- Local processing: Processing completes in the local environment
Applicable Projects
- Building digital libraries
- Digitizing Buddhist scriptures
- Creating research corpora
- Archiving historical documents
- Developing digital educational materials
Future Possibilities
For such projects, the following feature extensions are conceivable:
- Automatic correction: Improving accuracy through post-processing of OCR results
- Parallel text display: Comparative display of multiple versions
- Full-text search: Searching across OCR text
- Annotation features: Adding comments and annotations by researchers
Resources
Official Links
- GitHub repository: https://github.com/buda-base/tibetan-ocr-app
- Release page: https://github.com/buda-base/tibetan-ocr-app/releases
- Trained models (HuggingFace): https://huggingface.co/BDRC
- Training code: https://github.com/buda-base/tibetan-ocr-training
References
- Buddhist Digital Resource Center: https://www.bdrc.io/
- TEI (Text Encoding Initiative): https://tei-c.org/
- IIIF (International Image Interoperability Framework): https://iiif.io/
- DTS (Distributed Text Services): https://distributed-text-services.github.io/
Acknowledgments
BDRC Tibetan OCR is an open-source tool developed by the Buddhist Digital Resource Center (BDRC). We thank Eric Werner, the developer of the tool.
Published: 2025-11-13