A Program to Create TEI/XML Files with OCR Results from IIIF Manifest Files

Overview I created a program to generate TEI/XML files containing OCR results from IIIF manifest files. This article explains how to use it. How It Works By specifying the URL of an IIIF manifest file, it creates a TEI/XML file containing OCR results from NDL Kotenseki OCR-Lite. https://github.com/ndl-lab/ndlkotenocr-lite Usage Access the following notebook: https://colab.research.google.com/github/nakamura196/000_tools/blob/main/IIIFマニフェストファイルからTEI_XMLファイルを作成するプログラム.ipynb Then press the first play button. Once complete, update the manifest_url and output_dir values in the “Execute” section and run the cell. ...

January 30, 2025 · 10 min · Nakamura

A Library for Creating RDF Files from VSDX Files

Overview This is a memo about a library I created for generating RDF files from VSDX files. https://github.com/nakamura196/vsdx-rdf Background I have been exploring methods for creating RDF data using Microsoft Visio in articles like the following. This article corresponds to the note in the above article that said “This library will be introduced in a separate article.” Usage Please refer to the following. https://nakamura196.github.io/vsdx-rdf/ Google Colab A notebook is available for trying out this library. ...

July 18, 2024 · 1 min · Nakamura

Created Notebooks Using NDLOCR and NDL Classical Japanese OCR ver.2

Notice 2026-02-24 ! The notebooks provided on this page will no longer be updated. For NDLOCR, “NDLOCR-Lite” has been released as a desktop application and command-line tool for easy use. Please use this going forward. https://github.com/ndl-lab/ndlocr-lite 2025-04-02 There is currently a bug. Please refrain from using it until the fix is complete. The bug has been fixed. 2025-03-21 For NDL Classical Japanese OCR, “NDL Classical Japanese OCR-Lite” has been released as a desktop application for easy use. Please use this going forward. ...

September 20, 2023 · 3 min · Nakamura

Bug Fixes and Feature Additions to the NDL Classical Book OCR Tutorial Using Google Colab

Overview I have been creating a tutorial for the NDL “Classical Book” OCR application using Google Colab, as introduced in the following article. This time, the following updates were made. Added terms of use Fixed bugs Added support for IIIF Presentation API v3 manifest file input The updated notebook can be accessed at the same URL as before. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/NDL古典籍OCRの実行例.ipynb Terms of Use Please use the notebook itself under CC0. However, the “NDL Classical Book OCR Application” is released by the National Diet Library under the CC BY 4.0 license, so please include the appropriate credit. Also, please check the terms of use for the materials to which OCR is applied. ...

April 12, 2023 · 1 min · Nakamura

Running Tesseract on Google Colab (with Japanese Support)

I created a notebook for running Tesseract on Google Colab. It also supports Japanese. We hope this serves as a useful reference. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/Tesseractを試す.ipynb At the end, I also introduce a flow for converting hocr files to alto format XML files. Specifically, the following tool is used: https://digi.bib.uni-mannheim.de/ocr-fileformat/ We hope this serves as a useful reference.

November 24, 2022 · 1 min · Nakamura

An Example Method for Converting TEI/XML Files to Vertical-Writing PDF

Overview This is a memo documenting one example method for converting TEI/XML files to vertical-writing (tategaki) PDF. You can try the program targeting “Koui Genji Monogatari” (Collated Tale of Genji) in the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/TEI_XMLファイルを縦書きPDFに変換する.ipynb Conversion Workflow This time, I used Quarto. https://quarto.org/ Please refer to the following for installation instructions. https://quarto.org/docs/get-started/ TEI/XML -> qmd First, convert the contents of the TEI/XML file to a qmd file. Below is a sample conversion script. ...

October 3, 2022 · 8 min · Nakamura

Similar Image Search Using VGG16

In relation to the following article, I created a notebook for performing similar image search using VGG16. The notebook is available here: https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/[vgg16]_Image_Similarity_Search_in_PyTorch.ipynb You can verify the operation by selecting “Runtime” > “Run all.” We hope this serves as a useful reference.

August 19, 2022 · 1 min · Nakamura

Similar Image Search Using an Autoencoder

Based on the following article, I created a notebook for similar image search using an autoencoder. https://medium.com/pytorch/image-similarity-search-in-pytorch-1a744cf3469 The notebook is available below. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/Image_Similarity_Search_in_PyTorch.ipynb You can verify its operation by selecting “Runtime” > “Run all.” I hope this is helpful.

August 19, 2022 · 1 min · Nakamura

Building a Layout Extraction Model Using the NDL-DocL Dataset and YOLOv5

Overview I built a layout extraction model using the NDL-DocL dataset and YOLOv5. https://github.com/ndl-lab/layout-dataset https://github.com/ultralytics/yolov5 You can try this model using the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/NDL_DocLデータセットとYOLOv5を用いたレイアウト抽出モデル.ipynb This article is a record of the training process above. Creating the Dataset The NDL-DocL dataset in Pascal VOC format is converted to YOLO format. For this method, refer to the following article. In addition to the conversion from Pascal VOC format to COCO format, conversion from COCO format to YOLO format was added. ...

July 25, 2022 · 1 min · Nakamura

Getting a Google Drive Folder ID from a Path Using Google Colab

This is based on the following page. https://stackoverflow.com/questions/67324695/is-there-a-way-to-get-the-id-of-a-google-drive-folder-from-the-path-using-colab By writing the following code, you can get a Google Drive folder ID from a path. # f d # ! f # p f p r r p r a i r ド o i k i o 例 t d i ラ m v o p m ) h n イ e r マ = t ブ g . a i k イ = ( の o m の n o ド g " マ o o イ s r ラ " e h ウ g u ン t a イ / t t ン l n ス a . ブ c _ t ト e t ト l x へ o i p . ( ー l a の n d s c ' ル t i t ( : o / k t d e p / l c o r を n a / a o r 取 t t d b n a i 得 / h r t m す d ) i i e p る r v m n o i e p t r v . o / t e g r d / o t r g M o i e y g d v t D l r e _ r e i ' i i . v ) d v c e e o " m / d r i v e / u / 1 / f o l d e r s / { } " . f o r m a t ( f i d ) ) You can also try it from the following notebook. I hope you find this helpful. ...

July 25, 2022 · 2 min · Nakamura

Conversion and Visualization of the NDL-DocL Dataset (Document Image Layout Dataset)

I created a notebook that converts Pascal VOC format XML files to COCO format JSON files and visualizes the contents of the NDL-DocL Dataset (Document Image Layout Dataset) published by NDL Lab. https://github.com/nakamura196/ndl_ocr/blob/main/NDL_DocLデータセット(資料画像レイアウトデータセット)の変換と可視化.ipynb By opening the above notebook and pressing “Runtime” > “Run all cells,” you can perform the conversion and visualization. By using the “/content/img” folder and “/content/dataset_kotenseki.json” file created after execution, you can use the data in machine learning programs that require COCO format data. ...

July 22, 2022 · 1 min · Nakamura

NDL OCR Now Supports Ruby (Furigana) Text Extraction

Overview For NDL OCR, the default setting previously did not include ruby (furigana) text extraction. Thanks to the cooperation of the NDL team, it is now possible to configure whether or not to perform text extraction for ruby. https://github.com/ndl-lab/ndlocr_cli/ Setting the following to True in config.yaml enables the ruby text extraction feature. y i e l d _ b l o c k _ r u b i : F a l s e Please note the following caveats when using this feature: ...

July 6, 2022 · 2 min · Nakamura

Created a Program to Download Data from Omeka Classic

I created a program to download data from Omeka Classic. It is published in the following repository. https://github.com/nakamura196/omekac_backup I also created a Google Colab notebook that demonstrates how to run this program. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/omeka_classic_backup.ipynb In the above tutorial, data download is performed targeting the following Omeka Classic site. https://jinmoncom2017.omeka.net/ After execution, the API download results are output to the docs folder. You can use the above data for backups, etc. I hope this serves as a useful reference when using Omeka Classic. ...

June 23, 2022 · 1 min · Nakamura

Created a Program to Download Omeka S Data

I created a program to download Omeka S data. It is published in the following repository. https://github.com/nakamura196/omekas_backup I also created a Google Colab showing an execution example of this program. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/omekas_backup.ipynb In the above tutorial, data download is executed targeting the following Omeka S sandbox. https://omeka.org/s/download/#sandbox After execution, API download results are output to the docs folder, and an MS Excel file summarizing them is output to the data folder. ...

June 22, 2022 · 1 min · Nakamura

Sample Notebook for Fetching Google Spreadsheet Data from Google Colab

I created a sample notebook for fetching Google Spreadsheet data from Google Colab. You can try it from the following link. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/Google_ColabからGoogle_Spreadsheetのデータを取得するサンプル.ipynb As shown below, you can retrieve the contents of a Google Spreadsheet. Below is the source code. f a i f c g i f # s w w # d d d r u m r r c m r s o o a f f o t p o e p o S _ r r F t m h o m d = o m p i k k e a = . r s r e d b s t g a t g , g t p c o h c = j o u o s a i = o e h s o t g o _ p p n f k e w o g h s g r a d y " t a o n l e p l = e n a < = l r _ e n r e a d s t G = l k n . t e . d d a h o g s o c i a a e . s i e o c w d h r o c d u f a m g . o a e m l a t a u a p s l o r t e a a t h u t s o h e p k a t l b e l h r e e b . i _ i t o p t e S n o g z i u m ( r d t p _ o e e m s p ) i j r b k t ( p e o z s e y . _ d o r r e o a _ g a a r ( t ( n d k e l t t ) c _ s e t l a d r n h y _ _ ) a e e o e ( w r u f d r e s o e t a s m t s r c h u ) a _ k o l l I i s r t i D d h d z > ) e s e " e ( t ) ( 0 ) I hope this serves as a useful reference. ...

May 25, 2022 · 2 min · Nakamura

What to do when

Overview When creating a large number of files on a shared drive, I encountered an error message “An error has occurred in Google Drive. and the file could not be saved. The cause of the above may be that the file was caught by the shared drive limitation shown below. https://support.google.com/a/answer/7338880?hl=en *The maximum number of items that can be stored on a shared drive The maximum number of items that can be stored on a shared drive is 400,000. This includes files, folders, and shortcuts. * ...

May 9, 2022 · 5 min · Nakamura

How to Fix "An error occurred in Google Drive": Script to Empty Shared Drive Trash

Overview When creating a large number of files in a shared drive, I encountered a situation where “An error occurred in Google Drive” was displayed and files could no longer be saved. The likely cause was hitting the following shared drive limitations. https://support.google.com/a/answer/7338880?hl=ja Maximum number of items in a shared drive A shared drive can contain a maximum of 400,000 items. This includes files, folders, and shortcuts. Daily upload limit Individual users can upload up to 750 GB per day to My Drive and all shared drives. ...

May 6, 2022 · 5 min · Nakamura

Running gcv2hocr on Google Colab: Creating Searchable PDFs with Transparent Text Using Google Vision API

Overview gcv2hocr is a repository that converts Google Cloud Vision OCR output to hOCR format and creates searchable PDFs. https://github.com/dinosauria123/gcv2hocr I created a notebook to run the above repository on Google Colab. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/gcv2hocrの実行サンプル.ipynb As shown below, you can create searchable PDF files. How to Use Access the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/gcv2hocrの実行サンプル.ipynb First, obtain an API key to use the Google Cloud Vision API. The following article may be helpful. https://zenn.dev/tmitsuoka0423/articles/get-gcp-api-key ...

May 3, 2022 · 1 min · Nakamura

How to Delete Files on Google Drive Using Google Colab

I created a notebook that demonstrates how to delete files on Google Drive using Google Colab. I hope this is useful when you have accidentally created a large number of unnecessary files on Google Drive. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/Google_Drive上のファイルを削除するノートブック.ipynb

May 2, 2022 · 1 min · Nakamura

Created Version 2 of the NDLOCR App Using Google Colab

Announcements Notebook URL https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/ndl_ocr_v2.ipynb 2022-07-06 A demo video showing how to use it has been created. https://youtu.be/46p7ZZSul0o Additionally, a ruby (furigana) text conversion feature has been added. Overview I created an NDLOCR app using Google Colab and introduced it in the following article. This time, I created Version 2, an improved version of the above notebook. You can access the notebook from the following link. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/ndl_ocr_v2.ipynb Features Support for multiple input formats has been added. The following options are available: ...

May 2, 2022 · 2 min · Nakamura