Overview
I created a program to calculate edit distance for TEI/XML files containing app elements.
You can use it from the following Google Colab notebook.
https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/編集距離を算出するプログラム.ipynb
By uploading an XML file, the similarity between witnesses is calculated.
Example
Let’s try uploading the following XML file.
https://tei-eaj.github.io/koui/data/nakamura.xml
As a result, an Excel file like the following is obtained. You can view the similarity between witnesses at a glance.
| index | name1 | name2 | distance | ratio |
|---|---|---|---|---|
| 0 | 中村式五十音 | 中村式五十音又様 | 10 | 0.85 |
| 1 | 中村式五十音 | 中村式五十音欠損本 | 7 | 0.8947368421052632 |
| 2 | 中村式五十音又様 | 中村式五十音欠損本 | 8 | 0.868421052631579 |
The following library is used for calculating similarity.
https://pypi.org/project/python-Levenshtein/
Summary
There is room for further consideration regarding text comparison methods, but I hope this serves as a useful example of quantitative comparison between witnesses.
References
I have also added this feature to the “program for extracting differences between two texts” introduced in the following article.