Overview

For NDL OCR, the default setting previously did not include ruby (furigana) text extraction. Thanks to the cooperation of the NDL team, it is now possible to configure whether or not to perform text extraction for ruby.

https://github.com/ndl-lab/ndlocr_cli/

Setting the following to True in config.yaml enables the ruby text extraction feature.

yield_block_rubi:False

Please note the following caveats when using this feature:

  • Ruby text is not always split at the exact kanji positions where furigana is placed; multiple ruby sections may be merged into a single output
  • Because ruby characters are small, they may sometimes be output as a placeholder character

Tutorial Notebook Updates

The ruby text extraction option has also been added to the Google Colab tutorial.

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/ndl_ocr_v2.ipynb

Checking the “ruby” option enables ruby text extraction. By default, it is set to False (ruby text extraction is not performed), maintaining backward compatibility.

In conjunction with this feature addition, I also fixed a bug related to PDF input and changed the output method for recognition results. The workflow has been unified so that a link to Google Drive where recognition results are saved is output, and the results can be viewed at that link.

Regarding the operation procedure, the following demo video has been prepared (though it does not cover the changes made this time). I hope it serves as a helpful reference for using Google Colab.

https://youtu.be/46p7ZZSul0o