Notice

I have created a more accessible article explaining the workflow introduced in this article. Please also refer to the following.

Overview

I would like to introduce a prototype tool for creating annotated IIIF manifest files and TEI/XML files using NDL Klasseki OCR-Lite.

Creating Annotated IIIF Manifest Files

First, I created a Gradio app that takes an IIIF manifest file as input and outputs an annotated IIIF manifest file using NDL Klasseki OCR-Lite. It is published using Hugging Face Spaces.

https://nakamura196-ndlkotenocr-lite-iiif.hf.space/

As output, you get an annotated IIIF manifest file like the following.

{}""""}"]@itl,icdya"]t{}{}o"pbne,n:eeo"m"""""}"]"]"""""}"]"]t"lnsitwhl,i,aitwhl,i,ae":"e"dyiea"]t{}n{}dyiea"]t{}n{}xh:":"pdibnen"pdibnentt"::etgeo"m"""]o"""]:etgeo"m"""]o""""tM{["hhln1sitititi"hhln2sitititi:pa[":"t"e""dyt{}adyt{}{}{}{}{}{}{}":"t"e""dyt{}adytsn.h:":":"pet"pe,,,,,,h:":":"pet"pe":it":::em"""""}i:em"""""}"""""}"""""}"""""}"""""}"""""}"""""}t":::em"""""}i:emh/ftC6{["sitmtbo"sitmtbitmtbitmtbitmtbitmtbitmtbitmtbtC6{["sitmtbo"st/epa84[":"dyoao""""""]n":"dyoao""dyoao""dyoao""dyoao""dyoao""dyoao""dyoao""pa84[":"dyoao""""""]n":"tds"sn97h:"ptrditfwhssh:"ptrdtv"ptrdtv"ptrdtv"ptrdtv"ptrdtv"ptrdtv"ptrdtvsn97h:"ptrditfwhssh:plt:00t":eigydyoiee{}"t":eigyya:eigyya:eigyya:eigyya:eigyya:eigyya:eigyya:00t":eigydyoiee{}"t":."/a,6tA["ve""prdir:tA["ve"pl"ve"pl"ve"pl"ve"pl"ve"pl"ve"pl"ve"pl/a,6tA["ve""prdir:tA[n,/s,pn":at::emtgv"""pn":at:eu":at:eu":at:eu":at:eu":at:eu":at:eu":at:eu/s,pn":at::emtgv"""pn]/dd"snht""ahhiitp[snht""eht""eht""eht""eht""eht""eht""ed"snht""ahhiitp[snill,:t"i:{":t"tcdyr:t"i:{:"t"i:{:"t"i:{:"t"i:{:"t"i:{:"t"i:{:"t"i:{:"l,:t"i:{":t"tcdyr:i..ttAoh":"e"pottAo:tAo:tAo:tAo:tAo:tAo:tAo:.ttAoh":"e"potign/apnn"t"::":ef/apnn""pnn""pnn""pnn""pnn""pnn""pnn""n/apnn"t"::":ef/afoddtsn"htI6:"idtsn"hT"sn"hT"sn"hT"sn"hT"sn"hT"sn"hT"sn"hT"ddtsn"htI6:"idt..lli::tpm"84":lli::te::te::te::te::te::te::telli::tpm"84":lliij..ottsai97[he.ottxttxttxttxttxttxttx..ottsai97[he.oopgnn/a"p:gm00t""nn/a"pt/a"pt"/a"pt"/a"pt/a"pt/a"pt/a"ptgnn/a"p:gm00t""nn//odPdtss/ea,6tI:dPdtcsudtcsudtcsudtcsudtcsudtcsudtcsuodPdtss/ea,6tI:dPaa.lalic:/"g,pmlalio:alio:alio:alio:alio:alio:alio:a.lalic:/"g,pmlappj.g.o:/d,esa".g.om/l.om/l.om/l.om/l.om/l.om/l.om/lj.g.o:/d,esa".giipgennp/l/:glgennm/Bnnm/Bnnm/Bnnm/Bnnm/Bnnm/Bnnm/Bpgennp/l/:glge///o"d"ad.j/eeo"d"edod"edod"edod"edod"edod"edod"edo/o"d"ad.j/eeo"pia.,l,ilnp/Sv.,l,nldl,nldl,nldl,nldl,nldl,nldl,nld"a.,l,ilnp/Sv.,ripj.n.dedeej.t.y.t.y.t.y.t.y.t.y.t.y.t.ypj.n.dedeejeiipgtnlglrlpgin"gin"gin"gin"gin"gin"gin"ipgtnlglrlpsf//oid.".v2/ond,ond,ond,ond,ond,ond,ond,//oid.".v2/e/ia.nlg,ni"a.gl.gl.gl.gl.gl.gl.glia.nlg,ni"an3ipjg.odcpj".j".j".j".j".j".j".ipjg.odcpt4iip"g.leip,gp,gp,gp,gp,gp,gp,giip"g.leia3f//,oj.2//o/o/o/o/o/o/of//,oj.2/t7/ia.pg"ia.a.a.a.a.a.a./ia.pg"ii63ipj/o,ipjpjpjpjpjpjpj3ipj/o,io84iipa.iipipipipipipip4iipa.in63f//pjf//////////////3f//pjf//7/iaip/iaiaiaiaia"iaia7/iaip/3m63ip//3ipipipipipipip63ip//3/a84iiia4iiiiiiiiiiiiii84iiia4cn63f/ip3f/f/f/f/f/f/f/63f/ip3oi/7/iii7/i/i/i/i/i/i/i/7/iii7nfc63if/63i3i3i3i3i3i3ic63if/6tea84i/i84i4i4i4i4i4i4ia84i/i8esn63f3i63f3f3f3f3f3f3fn63f3i6xtv/7/4i/7/7/7/7/7/7/7/v/7/4i/t.ac633fc63636363636363ac633fc.jsa847/a84848484848484sa847/ajs/n6363n63636363636363/n6363nso1v/784v/7/7/7/7/7/7"/72v/784von"ac663ac6c6c6c6c6c6c6"ac663an",sa8/7sa8a8a8a8a8a8a8,sa8/7s",/n6R6/n6n6n6n6n6n6n6/n6R6/,1v/081v/v/v/v/v/v/v/2v/082/ac06/acacacacacacac/ac06/psa0/asasasasasasasapsa0/aa/n0Rn/n/n/n/n/n/n/na/n0Rng1v00n1v1v1v1v1v1v1vg2v00ne/a00o/a/a/a/a/a/a/ae/a00o"ps10sasasasasasasas"ps20s,a//0"n/n/n/n/n/n/n/,a//0"g1f0,n1n1n1n1n1n1n1g2f0,e"u0o#o#o#o#o#o#o#e"u0/,l1sxsxsxsxsxsxsx/,l2il"/y/y/y/y/y/y/yil"m/,0w1w2w3w4w5w6wm/,af"h"h"h"h"h"h"hafgu,=,=,=,=,=,=,=guel5554446elal2203329aln/7997784n/n0032553,n0o/,,,,,,4o/"d2233229"d,e7020879,ef507455,fa,92,"363au1,,1,,1ul1260460lt413353,t.,8,,,,2.j9,81529jp3404259pg52"7221g""4,5"""",,"",,,,,,"

Creating TEI/XML Files

I created a library that takes the annotated IIIF manifest file obtained above as input and creates TEI/XML files.

https://www.npmjs.com/package/@nakamura196/iiif-to-tei

It can be used from the following configuration.

{}"""""}"""}nvdms,aldaeeac"uie""mrsirttcp@gescnieheenl"ir"psonnao:oi:ttrsdkbnps""eea"""t""::"nm:c:ii::cuoon""ir"n"nd{e""ea^v1"ec,Is11e.:xhS"91r0.oC:6.t."j"0"0"s\,{i.,",""i2,,Ei"rfr-otro:-tneoi"t:es"t^1s.p0e.c1i"f,ied\"&&exit1"

Place manifest files in the data/input folder and run the following to output TEI/XML files in the data/output folder.

cccccci}cj}oooooofos)nnnnnnno;ioicssssssC(GsPnmrmottttttr!fcetrFccfcppnefsotoiosooos{fpgioas.njclnn.nrrtsalnut.msaseescGsSwsttItopteekolosstocetaroiI=hbupxdllns.nonvil{{iItuoiieFfjsn}ceoeteiFr==_tusr.JieostIs)orue.sIfTed_ttSlSlarontii;natXFloIToqrridpsyoOecEnxinnstpMiomIoTueeriuSngNshaDmtcccteuLlgeFTEiqqrtyc(caliollte(FTeIruu=n(`f=JhtOanuutoFfS`uoiCeii=dcoCiS(aulvddeuiiyCnTo(rr"i(urlgOjtieeeitllnocE=n'ee."roteelNs=pzrIFXpeecntIvf((/.eupasoouetmamu(viCres''d/ctutbfnJteaclt=oeooer'pgadtptei.iFStrgsurnnqt)altaou_dnsliO=hei=fptsvue;toatrtdyelNe=smiapi}eirhb/ay_iotne.i:icltuorr''i/druhcpicnloehtnfte}))noii,te(=aioetenn.Fre(;;pufrpp>rfnwr:vajicor'=ut){uiasTvuemolom@tpi)tnt{eoeIetreiem}rr"utrph(TrI,rtn,p'ee;t{edu.fetIue((lifpq"dcitjsieFerroteirou;ouro.(rT.euetio/ieredirjocptiefmirsscinesTolpX:-ienitr(aoEnaumti('voeidnIvctl$of'tercnFDCee_){--@:ytpiaord;pt/tne:oultnt.iaesoaxtrteav(jrtir-kir$y_S)ejs,h'ctasu{dy;rso.;/emteointonpbiiuurcenaan'r}t,(rDwtsd)a)pj(aihee;1;u"s{tt.nx9t*oahba'6_.n)am;/djF;.seiisixe(irolmnji}nelasf`",)mo-))ent;)'(Fo;uji-tsltfoee8n)i'F}')i))l;;e,${.pjastohn.'b)asena.mxem(lo'u)t;putFile)}`);

An example of the output TEI/XML file is as follows.

<<<<???T/xxxE<<<TmmmIt/t/f/Ellle<te<ta<<fI--xif/exb/ecs/s/a>vmmmHi<<<fito<<bxsu<<<<<<<<su<sceoolelt/p/s/iH>dd/dotirgzzzzzzzurgusrddnaei<tu<po<sleyi<<<<<<<did>mfrooooooorfrriseesdDttibpuum/oea>vaaaaaaaivyiaannnnnnnfaafmill=eelitl>brs<muDdbbbbbbbv>lcpeeeeeeeacpaio"rsetliClcDm/sreen>neehcehclnhhh>cSlecoiees<mDcsr=ttttttt=ixxxxxxxeiee=rrt>teSancDsIiseec>"yyyyyyy"sscmmmmmmm>sc>>"eetm>ttvaecddIsD>1ppppppp2aallllllla1ffptmiets>endce"eeeeeee"mmu:::::::mu.==:>toricnoe>s>=======/eeriiiiiiier0""/>nto>t>nc""""""">AAldddddddAl"hh/Seniht>lllllllss========s=ttwtdSftiiiiiiii==""""""""="ettwmtitfnnnnnnn""hzzzzzzz"hnppw.tfmepieeeeeeehhtooooooohtc::.>rtrse"""""""tttnnnnnnntto/to>>:rttpeeeeeeetpd/em/>cccccccpps-------psiwwi</oooooooss:0000000s:nww-/Idrrrrrrr::-------:gwwctIlrrrrrrr/0123456=...iI.eeeeeeed"""""""d"ttotFnsssssssddldlUeerldpppppppll.uuuuuuul.TiigeMl=======..nlllllll.nF--/>a."""""""nndxxxxxxxnd-ccnng#######ddl=======dl8..siozzzzzzzll."""""""l."oo/f.ooooooo..g5554446.g?rr1ejnnnnnnnggo2203329go>gg.speeeeeeeoo.7997784o./0t/-------..j032553".jrr"<a0000000jjp""""""jpee>/p-------ppupllpi0123456/auuuuuulaee>/"""""""aapllllllyapaai>>>>>>>ppiyyyyyy=pissiii======"ieei/i""""""4i//f<<iii2233229iixx///iii7020879iimm3aaiif507455"ifll4bbff"92"36f//3>>/3""""l3tt7334llr34ee6<443rllrllx43ii8/337xrrxrr=37//6a776=xx=xx"76cc/b668"=="==168uum>8865""4""086ssa6635544406ttn/R8517434Rooimc0415824"c0mmfaa0"15"06a0//enn0""""ln0sssiv0llrv0cctfa0rllrllya0hh.<es0yrryrr=s0eej/s/1=yy=yy"/2mmsat1/"=="==32/aaob."f1""1""4"f//n>ju2237339urr<sul1437300ulee/oll035970"llllinx/"32"58/x/aad"=f/"""">=fxxn>"u>///"unno0l>>>0lgg>"l"l//<//tt/u0u0eeal/l/iibydyd__>=e=eaa"f"fll0a0all"u"u..llrrltltnnr.r.ggxjxj""=p=p"g"gtt6"6"yy8/8/pp9>9>ee00==""""aallpprrppyyll==ii""cc44aa77tt<00ii/66ooa""nnb>>//>xxmmll""sscc<hh/eeammbaa>ttyyppeennss==""hhttttpp:://rpeulralx.nogc.locr.grngsdsstdrlu/cstcuhreem/a1t.r0o"n?">?>

You can verify the output using Oxygen XML Editor as shown below.

Reference: Monorepo Development with Turborepo

For developing the npm package mentioned above, I used a monorepo with Turborepo.

The web app was developed using Next.js. It can be used from the following link.

https://iiif-tei-monorepo-web.vercel.app/

The API can be verified through Swagger UI at the following link.

https://iiif-tei-monorepo-web.vercel.app/api-docs

From Python, it can be used as follows.

iiff@ccmmrrdllppooaaaoommtssrras"iibs"dddtttdc"nna"eeeyalC"ccsI"fffrjptaoClleIIesiasnouu_II"A"ssc"CAR"p}ir_"EAR"teeqoncsvndduFI_"r"eeo"ore"afem"xre"rxxungleveerTFi"g"lln"ngt"yta"egt"ycceare__lonsffvvsuloukcsu:eesistrif:Tti:a..ee:morCo"pp}reu:prCr#r#r#irprprtmsOsmaEotpacrrapnoamtan_tanoeeefetataspepiacOI_ipottntsndaiyreysnsCsPsCtiiostogspCT__in_ii:vnolsel:vphpauhnursjsrineitlE(b_vfffoe=ino"""eqAoeosjh}teorleoedrreesetiosmiiIsaberrenrfsaiibluParnese,icnstctreanqomno:ioeesaroosst{e:dnnafeIdtsloamksekrtiurnVOpsplnnclestmmt:es[ccs.s:eefndee=roasree.apo:tbeatof_e___dt"lle_trd.=e'of.rferieesqJltrio:l:n,u_emIoC"ouuUm(eR=cprCuoreros_lstuSuitoo[vrunaIboT:pddrasqeToasotraserumsVuseOeonlbsealrdnIjnEteelkeuqErny=n=ipslsal.sNEndsotrp:lpiFevImiIF"eleuIevl{t3Hsopetg=ltetDraa"=orsiofceaoma:_fseqeoe0Tenor.u.xserlt"l]i_A=iemtrXnnacr,tsXuranT_snrg=regc.co,a"FobPnsa:sMisgsoetMetdtPfesoeeEeeRorca==naIattniLf"eipqpLs_,-oertrsrtped(Dllsp(moiIoe]smtuapteTera.(eur(tqefiasFNAebi=sapfInss"iieyassnyr_sjr'slo'iuE"cseaoP_a_enteItt=:loslyt.dprssesutrtoerItslnIusbfliisFor_entolrppeotJosul.(ensrn,serea"ffotpio{o"s(aoioo'raSnpctgfistovecls{,enmtnbp:.pdansi:stO(oc.e"X.EraAl:Uesssoaigjtba:dgtnuN)negtAmRxlniR_et:bnoeioay(t'ssse(PlecaiyesLul_jincopslD,a(est'I'qesdntrfoOefstnteoip)'(d,uptr(l.bpcesi_acp,'eEeteJ")e.ajtts.oudtletrsi:S":.rpeitinr)[iFrartoO"gsicotnslscarioEnN.t_tnooc.talolrx(,rb:ablirtsrs:cfrialTjun,ie''e"e"psD[Eedco),,{pAsh(eiCIcelAn:etPpt'_cot_un/'riIot/utnXidyjUronp'r[vMme]snonrs:)lseLa_)okree/}trgfnn_aq://rtea'omsula,Oscwse{opp,sngessciAtis}:tta/nimte.rlcyoirrf(ho]nl:rDaeon,seoei)sv],rtl}te'ae":r=)id)3tl:0"Ns0o:{0ns"e{t))dre(tea)i}sl"ts)r}:")
#c#telrxIiEycnex:eina#m}tp#wppttmaeritripSniiStia=lei_nahEnletf"""xtvfxtiIe@itm(eo.c(zI3tscdyl"pwefeI:hto"pCrerp"Fe_n:e=oenitEcTCot"ns(tirlooabe":cvu"eoriTncjxhlelo(noeEvtett"irtutrnIeuc"tMestea:tCrat:panitpisltlsntoou_{i=":i.ntxeeefmh/fcf.m:}nra{t/eofixl"tontdsnrlm))(mipltvoel"f:."em"hmen,r,tas/dtmtntil_a"pii.fnwsfoigri":ebfoof,/sj..me/teij_seicopmtniot//acibaanoofjhppibd-eeiifjitcr//eenetepiscgiritt=-ei("msfmsuoe/autnn3ncfot4ic-ra3fe8et7es"pi6ss)oo8tf-n6_uaw//olse3mb!b/aj"f.cne):voicenftrte)cesextlt...jajspsopon"n")",,

Summary

I introduced the workflow of creating TEI/XML files from OCR text using NDL Klasseki OCR-Lite.

In the future, I would like to build a system that completes the process in a single app without going through multiple apps as described above.

There are various areas that need improvement, but I hope some parts serve as a useful reference.