Overview

“SAT Daizokyo Text Database 2018” is described as follows.

https://21dzk.l.u-tokyo.ac.jp/SAT2018/master30.php

This site is the 2018 version of the digital research environment provided by the SAT Daizokyo Text Database Research Society. Since April 2008, the SAT Daizokyo Text Database Research Society has provided a full-text search service for all 85 volumes of the text portion of the Taisho Shinshu Daizokyo, while enhancing usability through collaboration with various web services and exploring the possibilities of web-based humanities research environments. In SAT2018, we have incorporated new services including collaboration with high-resolution images via IIIF using recently spreading machine learning technology, publication of modern Japanese translations understandable by high school students with linkage to the original text. We have also updated the Chinese characters in the main text to Unicode 10.0 and integrated most functions of the previously published SAT Taisho Image Database. However, this release also provides a framework for collaboration, and going forward, data will be expanded along these lines to further enhance usability. The web services provided by our research society rely on services and support from various stakeholders. For the new services in SAT2018, we received support from the Institute for Research in Humanities regarding machine learning and IIIF integration, and from the Japan Buddhist Federation and Buddhist researchers nationwide for creating modern Japanese translations. We hope that SAT2018 will be useful not only for Buddhist researchers but also for various people interested in Buddhist texts. Furthermore, we would be delighted if the approach to applying technology to cultural materials presented here serves as a model for humanities research.

This article attempts a simple analysis of the text data published by the above database.

Description

We will use the text of “T0220 Mahaprajnaparamita Sutra” as our subject.

Method

Retrieving Text Data

Upon examining the network traffic, I found that text data can be retrieved from a URL like the following.

https://21dzk.l.u-tokyo.ac.jp/SAT2018/satdb2018pre.php?mode=detail&ob=1&mode2=2&useid=0220_,05,0001

Regarding the 0220_,05,0001 portion, changing 05 to 06 retrieves the data for volume 6. Also, changing the trailing 0001 to 0011 retrieves the text including the context around 0011.

Based on this pattern, I executed the following program.

iiifd dddddimmmre eeeeefpppof fffffooom rrrf" rrw"wr"wp"pwg"sirmf_mtttbe" eer"ie"ir"ahe"pfeaoonast" sti"ta"to"git"atilraiort4cF putWhdRhcPel_Ensunsmnseihetorerf_erer_euioieniplxsplrr(pe(qmi_tinn_ioihaoesosrdpflefaataaen)=or_)uemscmshtpltdptsctTlaswgsr=nst:lo_epoheeBteeemseu_err=toe_netastuN[csoue.emsn.lnrvsu=hss:swpeb__cs:_ro"ie=trpss=al(w(H(nos=eu.oorawrsptoinn0ns=st(lu(tfrfTfle:fr=puuig_etasude5suaertshiiiMiBus""lapptepargpl"_"BrneeioeltlLlem0h.fte_aket.=a,ov_eldpqfueeeeaee0ts"h==_sg=_hfslo_a)(uuppp(pcpu(a0tph.hteieist"slmu:p1el,rasaoatvc1plterftr_ndnp_0:uata)sSetotntioh"simxeemse(ldai6miirtofthuhthfl:tliatl=twsa_nd"enfssuit,p)e,u)/(/sdc(r_osas.,(_ue#.pli.:nl:o/"{t_hsgputl[sv_lsg(ef"pt"Sl2uish_oe=apl-p"o"SSerpiwrrou1sd(tsut=g)p(1l0l:oHlteae"ef"umde}omop_e:a"]i7)uTe(std)tr)pezi.plu,lp_gs.t"pMeuphto(kdha(paasept(]Lpro)Haimafb.=tto(osgtae"ln:Tsfsiyl"mhpupterInx.cf)sMyal.)l)ara__D"t"ooeLf(feiu[":tltps,)nr.i)fi.t-1h)hatf[ttcl)ilret]))grrc-e1eoeleeroe:ol1nxn:e:aak_ma]tsttdtyis[e,ea(iodts0fcnn)n.(h_:ro"td,gase=4onhco"]mdttr".uslmoehvjpontblttep)u"he.aumr/p)efprlSoafn.pAogrrispaTbieslag2jveeire0eemr.tss1cna""se8tk)"ru/.Ui"p"ns"Rna)ta"Lgrit".sld"aeb"dn2"ro0ec1qon8unepetwrseetnp.tap.gh"ep"?"imsodfeo=udnedt.a"i"l"&ob=1&mode2=2&useid=0220_{vol}_{page_str}"

The above process downloads the HTML files.

Parsing HTML Files

Execute the following program to extract text by ID.

iiffddddimmrreeeefppooffffoommrrr"we"mtfrs"wmffs_mttbte"ix"aeoea"iaivoanasqa"tt"pxrtv"tilervaigj4ddRhrEptueShneremnlsm_eraxi_bltmr_aj(safsm_e(ooihaoectnblneantvos)lioavt_)bnmitdpttrgloxpoepo:=llupeo_pmmseu_asoc=tpm_sen_eppr_oplnrmcckiajn.smiaj=ro_H(nat=kb=npst(doai=nls=trfTfpssilgpohfurpnglotiMiBp{nobsineimtprs_n"BlLleim}=cl[n(lpeitem(_eteeanatkolgdde(dnqa=ao_aq(cpugpse.cnsaind(gddpvmudfoatsptxsk]tcaagsm_epeatminti(irtp.atmtl(hxiriilthfsn(_ls=,ieao=fttnanfee,uogsbipo,,bimrgl_upnlusoltltfn.{llasl_lat"Spuo(ieia"fg}e_c._"Stro)fpc"txlrwilsftum:ohf"u:r)k<(tey"lo)i_pau)r)po.s/"n)eb:lmdpp:o(ms:s<at,(eaapmafpp/moa"(ptisiBlasesihfpenaleinp)antii(gfeat>a:fdmlnmsfi.u("nJielega,ilrt")>Sln/)spleei<["Oet*(p"e:afs0)N:=.sim.dup][4hona"(la.1f,tugp")Sns]impsp",ot.lsl))iucrseo"n"plip.r)ghapl"t)#stos(i"_.mbs)t"kUjlj=[(eps.e\:"ydopc"-<sanatl1s=t"r.n]pTe)s"\are""nutr">eh""c,e))l[ae1snv:sse]=ur\ra"el#t_lxaS\sdk"cii>icp"it)=it[Foh1ane]la.srfseyip)rlwsiittt(hs"p<nl/eiswtpamanas>p"pi)it[n0gd]so.esstrniopt()containrelevantdata

As a result, the following JSON data is obtained.

{"""""""TTTTTTT0000000222222222222220000000_______.......00000005555555.......0000000000000000000001111111aaaaaaa00000001234567""""""":::::::""""""""N"<,o,b.u2t2t0o.n"",,class=\"ftntf""\,,"style=\"font-size:8px;padding:2px\"title=\"西\">\n1\n/button>\n西",

Analysis Example

Let’s analyze the character occurrence frequency.

ifdddddimreeeeefpofffffomrl"wc"ffrg"rp"fmpdftp_mtco"io"roee"er"oaaarornaoa"tu"ertt"ti"rittepiaijldLhnCqu_RunPnhaq_nmnsl_ortotfrtertrcp(cte(oedaoe_u=ernotn_ihr)===h__)ncadptcnxeputnai:at_ttseuhtCtqf_rfotrn"lcroianraso.rcnrps,tmoosp=o(J(nruiuehse_(aau_=nfSfatnnpqaqctcfpdn=csiOijchtdrt.hho"p_th"lNlsteedaahmaeu{id_ga_ieeoeratceorncnacer_mpdpnrf(tetsatthgthtampaaa.sr)a(ettcoasaa_caotttl(e.tro_tpir.(rttirhahodqvespcen}jpaoent),aauax(orc,sacpr_:fdteltfNmsht{ott_s_Cr"(anu)rm(aocnhec("oorf)ceecotrpo")rht:um"i:ysqhnoa_usaon)l(,a(pccn(rptaeo)rt_thtda_ea)f:taocea}accrfsocphrr"tthiept_ass)aealfa_enr:)rreicnr)sass.lh=s)n()"e3:df":c0br"h)ytea:hqrfe)aricertqeufrerneicqnyu.et"nh"ce"iepsr.o"v"i"deddata."""

The following results were obtained. Since HTML tags and other artifacts were not completely removed in the previous step, the results are not perfectly accurate, but they provide an overview of which characters are most frequently used.

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,41111888766665555555555444444368210865197319654332210986633981043027446424718639093343050494845114580612816591733917600128707830786396114129512123672380090

Summary

This article presented a very simple analysis example of the texts published in “SAT Daizokyo Text Database 2018.”

I am grateful to everyone involved in the publication of this database.