Overview

I created a program to extract differences between two texts. You can use it from the following Google Colab notebook.

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/校異情報の生成.ipynb

A well-known service for this purpose is “difff”, but this time I implemented it using Python.

https://difff.jp/

For calculating the differences between texts, I used difflib.SequenceMatcher.

https://docs.python.org/ja/3/library/difflib.html

Usage

You can choose between two output formats: HTML files and TEI files.

HTML

Here is an example of the HTML file output.

XML

Here is an example of the XML file output.

<<<??T/xxETmmIEll<<I-xt/t/>vmmeteteoliexerdnH<it<xsesef/H>b/til=aifeob>o"dliadonhhee<<<ldy<d=rtrDt/p/s/ee>p/y"et>eituposDr>p>1fpstibuuoe>>.=:cl<tl<br<us<<0"/>etliplcl/rca//l<//"h/Siec>ieilc>paabaaatwttSaPcDsieppp/pppetwmlttuaet<<sD<<p<<p>p<<p<<pnpwtemibtsWwwtexlr>lr><lr>lr>cs.>>toliciiiWsmededlxededo:tT>nio>ttticlmgmgbmmgmgd/eiScn>nnt>:/li/ittaSee>iwwww>:wwwwnr-lmttssdiiiiiiiiigacetimss=ttttdtttt=w.<>ot"=========".o/n>xxa"""""""""ugrt<mm1####a####tigi/ll"tttt3ttttft/tp::>1212"1212-hnl>ii"""">""""8usedd>>>>>>//"b/>==>>?u1""<<<>s.ttl/<<<<e012bl/r//r"""/erldlrc>>>>md<eged<o>gam>mgan>p>>>ptppenxxtmm.<llc/::ow<iimi/dd/tw==lni""detaaasn24sse""j>s>>ps8>/tei-example/main/tei_all.rng"schematypens="http://relaxng.org/ns/structure/1.0"type="application/xml"?>

A key design feature is that the output uses the app tag defined by TEI (Text Encoding Initiative). This means the results can be visualized using tools that support the app tag.

For example, let’s try using the “TEI Critical Apparatus Toolbox” below.

http://teicat.huma-num.fr/witnesses.php

By uploading the XML file downloaded from Google Colab to the link above, you can obtain visualization results like the following.

Summary

I hope this serves as a useful reference when working with differences between two texts.