Overview

I investigated accent sensitivity in Omeka S partial match search, and here are my notes.

Symptom

For example, when items containing the string “tako” (in hiragana) are stored, searching for “tako” in hiragana, katakana, or with dakuten (voiced mark) variations all return the same results.

Cause

The issue appeared to be caused by the default collation “utf8mb4_unicode_ci” set during installation.

Specifically, this collation is “case-insensitive” (does not distinguish between uppercase and lowercase) and “accent-insensitive” (does not distinguish between accented and unaccented characters), which is why hiragana, katakana, and dakuten variations return the same search results.

Solution

!

When changing the collation, be careful not to affect existing data, and always back up your data before making changes.

If you want to distinguish between these variations, one approach is to set the collation of the “value” table to “utf8mb4_bin”.

After applying this setting, for example, searching with the dakuten variation returns 0 results.

Summary

I hope this serves as a helpful reference when using Omeka.