Overview

This article introduces “Extract Ocr,” an Omeka S module that performs OCR on PDF files.

Installation

Refer to the following page.

This module requires a command-line tool called pdftohtml.

In the instructions below, replace omeka-s as appropriate for your environment.

In an environment using AWS Lightsail, it could be installed with the following command:

Additionally, you need to edit omeka-s/config/local.config.php. Change the base_uri portion according to your environment. Example: https://omekas.aws.ldas.jp/sandbox/files

After the above configuration, download and install the module.

On AWS Lightsail, the following error occurred during installation.

The issue was resolved by creating a temp directory under omeka-s/files with the following commands:

After this configuration, the installation completed successfully.

Uploading Files

Create a new item and upload a PDF file as media.

After registering the item, the message “Extracting OCR in background.” is displayed, and as shown below, an XML file is added to the media section in the lower right.

When you examine the newly created XML file, you can confirm that the text has been saved as shown below.

Summary and Challenges

By uploading PDF files, it was possible to simultaneously save OCR text. However, when uploading PDF files containing Japanese text, the OCR text was not generated properly. I plan to continue investigating this issue.

Additionally, the reason I tried this module was to attempt content search using the IIIF-Search module, but this did not work. After some investigation, it appeared that the cause was the MIME type of the created XML files being set to text/xml. The IIIF-Search module expects the MIME type to be application/vnd.pdf2xml+xml, and there was a mismatch in this regard. I plan to continue investigating this issue as well.

I hope the findings from this investigation serve as a helpful reference for others.

Overview#

Installation#

Uploading Files#

Summary and Challenges#

Overview

Installation

Uploading Files

Summary and Challenges