Overview

I created a program to calculate edit distance for TEI/XML files containing app elements.

You can use it from the following Google Colab notebook.

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/編集距離を算出するプログラム.ipynb

By uploading an XML file, the similarity between witnesses is calculated.

Example

Let’s try uploading the following XML file.

https://tei-eaj.github.io/koui/data/nakamura.xml

As a result, an Excel file like the following is obtained. You can view the similarity between witnesses at a glance.

indexname1name2distanceratio
0中村式五十音中村式五十音又様100.85
1中村式五十音中村式五十音欠損本70.8947368421052632
2中村式五十音又様中村式五十音欠損本80.868421052631579

The following library is used for calculating similarity.

https://pypi.org/project/python-Levenshtein/

Summary

There is room for further consideration regarding text comparison methods, but I hope this serves as a useful example of quantitative comparison between witnesses.

References

I have also added this feature to the “program for extracting differences between two texts” introduced in the following article.