This article was partially written by AI.

Overview

DHConvalidator is a tool for converting Digital Humanities (DH) conference abstracts into a consistent TEI (Text Encoding Initiative) text base.

https://github.com/ADHO/dhconvalidator

When using this tool, the following error occurred during the conversion process from Microsoft Word format (DOCX) to TEI XML format:

ERROR:nu.xom.ParsingException:cvc-complex-type.2.4.a:Invalidcontentwasfoundstartingwithelement'ref'

This article shares the cause and solution for this issue.

Identifying the Cause

Investigation revealed that the cause of the problem was INCLUDEPICTURE field codes embedded within the Word document.

Specifically, when images were copied and pasted from Google Docs, field codes like the following remained in the document:

INCLUDEPICTURE"https://lh7-rt.googleusercontent.com/docsz/..."MERGEFORMATINET

These external image reference links were not properly processed during the TEI conversion process, causing XML validation errors.

Solution

To resolve this issue, a Python script was developed to automatically remove the problematic field codes from DOCX files.

Script Features

  1. Safe processing: Preserves the image content itself and only removes the field code portions
  2. ZIP format support: Properly handles the internal structure of DOCX files (ZIP + XML)
  3. Namespace support: Accurate element searching that considers Word document XML namespaces

Main Processing Logic

  • Extract the DOCX file to a temporary directory
  • Parse the field code structure in word/document.xml
  • Identify fields containing INCLUDEPICTURE
  • Remove only field control elements (begin/separate/end) while preserving image elements
  • Generate a new DOCX file with the modified XML

Implementation Details

Field Code Detection

defifrsoe_rtiunriircunfnlnsutiFdirniaen_sflpttsifer'ecix_IrtettNeuleCtrd=xLue_tUr_rrDnfuuiEinnsPTes.Irl:fnCudioTe(ntUfdRi(NEe'o'l.nd/ei_/nrwau:ninidnsns,sittnrnrs_sTtt)ere:x_xttt'e.,xtten.xstt)e:xt:

Selecting Elements for Removal

defs#h#h#rhaaeoCsCsRtuh_h_eulefeimrdcicmon_kekavrlgeheidieamf_f_esoccl_vioioefetntnmi_tteerhrhenluaoantdnslsts_(crf=a=touichnne(rt(rat,lruurutrdunaunonn.ln.hlsc.f.fa)ofiifiva:ninminentndanddrd(gd(fo('e('inl'.'.eo.c.lteodlwnwhew:tw:cam:ie:posefnndin_nlstrctitdtatrmsCrw'oahTi,lgaenerxgne_'t'slc,',)eo,mnnnietsnssne)s)tn)nstiiosistbsunnNtonoototnnteoNN)oNoinonmeneaego)oerrcontent

Result

With this script, problematic field codes are removed and the TEI conversion process completes successfully. Images are preserved properly embedded within the document.

Usage

pythonfix_docx_fields.pyinput.docx[output.docx]

If no output file name is specified, it is saved as input_fixed.docx.

However, when opening the file, the following warning was displayed. I was unable to figure out how to fix this on the script side, but the file opened successfully by clicking the “Yes” button.

Summary

When copying images from Google Docs or web browsers, such external reference links may be embedded.

Since this issue may also occur in other DOCX processing systems, I hope this serves as a reference when encountering similar errors.

Script

#"DRt"iiiidddddi!"Oeh"mmmmeeeeef/"Cma"ppppfffffuXotoooosvrrrrp"PA"i#wpp"P"#tr#n}#r#f#tpi"C"frs"D"#h#h#rm"iiioite_mrFecttttr"rr"firr"r"roseorrs"h"oeh"e"aaea"mfnufrxna/isao"og"Ctio"o"PeoDFmPrSei_"e"rto"t"CsCsRti"pptycaibeuzxtoccsorhnccaete=ioraenicuueh_h_eunMolupn:emnilpsimesee:iouoe#w#di#wteerf'nvopr#r#iw#fv.tnkriirlrefeimr(arepstuopsppppse(ndreplmssnutuatiofi(sss==i{wdecauuhoew(cunfndmcicmon)itnry_ttryrrtry_)/of.pssptptteEtPcCtfssen'dernFnL=iRrrflins_ikekav:n(isf_isoiis_eCbTief_upupemxhr_orh"_Ete:a_sasisoleti"uftiFrnlgehssn.ifon.cnEn.nolEltidDtutuptzoxspefFdtXTrncs_no0er#fiemrphtRdirniaeeidieafyytelisteetxte=vdeIerloO_t_ttfrzicm.razoiohM.en'doitd=kulflouaeeeetn_sflmf_f_rsuss(xel.(xs(c(x=emeecCf_f_eiaipelpotirxceLpeahuenoinCdsvnr(mphttsoiccu_n."iepfis"efipacexXififmlcp_s_acepeua.mtrna_apfh_feeamxoiefer'evfroronfcaUt=a"t_Pp"t"yPto._lilipetfrsptefrfdmdrgetetcrrlao<=ecl#fj#w#i#i:ii.omvcix_Ireununsitrs(=tE(drtE(_trinEffelelo.ieahsniooeosespmhoelrrchdihftnrdletfettNe_antnteiga1shr1ooir1_hocvliieerTDlfdt.selorDncetp:o=omalrka_Fe=CiCS=+hei_duileCtrrewlovg)ys.r)ccor)moceeel(iaeOe.ohe_wetOtu(ra/vptor.Ieurcololhik=ermffred=xLuurhohnidn)esyeoxenoanefrmles(s=rmC.ecxd.,ffazC_mxoc/e0a.vufNnnfhudilenfnijes#frijuoii{el_tUrnualatt_:.sxr_sri3sisedtsypXZxu=ioDZiiriXxemoesrfeniC(so=an_leiefc_oep1pnvelr_drrDn(nsshcf<a.i:fsanseinstrtNioitmscOidllcpmnltscfaisnLr[rrdr+ejxex+kiRrm+rsedeefuuiEr==ooprasiisp_olot(o)rondrfpreotuCpiee__slt_(hign=dUuiructltn=neoto_(_mirnnsPTusfafnr2ygrtIenr_rdnTi:)npiaiFanssmXFr_poa(.f)eerdiaDn]fuifn1t<_d_ebicmfivo1btrXpoeus.Irnhi(rc(rrit:tvgsnlgeo"rnr:eurrlict.(eisipauvxximlaa[nlEsinsisr_fxr1floifelouMavln:fnCu,oerutruuerch[v(pd:c:ciepeP:teyelt.pdnfl,nattemmladpl]lP)e.ealurlteuvedae_nLtedsioTeulunuunnlooo1[iusceoseumaP_cDeaxaotieth.dlleshlt(I:lfnl=lenud_atdels_fmr)hd(ntUnldn.an..dlmn]2nt(osdstotafti(lmtc_l(ffhw__.c(h'Cdiodlnn_fkhedhrctae,_fcdRsd.fl.ffm]pimseua_vhtiorillh_xeoii=raffpoo'i.Tnt[(=sclipo_ouoetmcio(NE)cfifiicaafufnpisesfehlren(.xmull=isiiapd.s/Ubdbrrr.hdsinrunurioeoen'o':boiniinnonniitiplnsiteycptjmlteeot:lltee//Re(Neuuuraa_clulsncvnult.nenndmnddnddxf_lueg(.Elpot.tueol(pssoseeehns/pwEg'ognnnuprcitynd_ttecnda/eitd(ad((t_fettlTerorotmi_du:s.({_)xwa:i.ni]ssnphsu_thr:ot_i/nrr('g(''rnldli_efi,oier_pnpoti.pfoptm:rrfn/en)se=arfiro+eud}rnwaeo'.e'..ooioel'fdikbnopyf_(ac_npaiuaolpa'i/,u:[nraeine_=niu:niml..ltncne{ileolpul(idtt_fatlttf'g,ewanjdnn_emrfsnInIidnoc/ex()ilseueuta)liehxiothephro,rl:nlt](eiflfoe1igNsNnsvewwweh_s:neu:Itmtpcerm)mlsh._u)erandfdoinxsIidivme=C,Csitelw:nw::laufyp,cNpauea,)p:le..rpt:mmnpsloletNeeeol'LLtnrde:it:poessisuc{CutDt(s__,wjea_oash)pdfkx_nClcl_vduUnUrs_mfnedibm_ae.toeeLtiO''dpaoltfvt)aClftroLdodretDsDTtt(elsnrcjeigla_us}U_cCD.tria'liphies:thdfi_utU(n_u.fE)Eerecndttatenmedrfts"DfXOde'rtwkna,l.ta_oernDftrna-P:Px_xotCrw'cta.sgipf)EifCom),h'((teIoercrlu.NEiru(p8IItttnshT(i,tsg".vluuPlifXcp),trha}Nrr'hdnfoPeonfp'CC'e.taedn'e"p)etlIeeix_a'eo(r"Cgn,at)inIllsie,TT,xta(rxrgn,b_"y}_lC=llf'dswzmofc)L/srheneCd:enUUteib'ta'suc>'fyTNdei,ioipti_Uwn.endT_rldxRRn.xne,'w,)nto<i!Uolrzrp_,lpDosgd(aUrud(mEEsttsg,isni2nl"Rnce:idfdeaEr)ec'nRun_fl.)e:innni)ntnoe)Eeo_p'iif_tPdto.dEnsri_fxfnsngssoepet)))df_,lriphIp(m/s,uediti,)s)inul:eoire)la)Crfp/nf,nlee:e)ensittsffspxe'.:etTo'lweik,dcllsiliom.eor.tefdZ)hUc{e:xene_lddesiestnaduoid:oI,Re{tftlsenrapsmogoNnmo.cPEs{el_d)psurccanenNtecodndu_tsndf:)naooronnooxn"WaomDefisfCli:)tddatottnNc>e)olceEmin[ihdmieettseoor)xnFpeg"ea_aosseN)Nnn[d'tL_lmwlrcgn,oNooeto).Addl"d'he=wbnonr)eudxTi/],aThueenentomErc2}srcritnetpclD)o0}tn.ouldo)ouu')d0}rsgneen,rrtm)e6fu)et)o.eas/lcteptidnsmdt(nrnotwaCufteiscszhihr'smtxiinae{ear]pl'r{rgT"_eT{vee)oynixuppsncttre[go):e'"ns)wite"mer=]anv=}gti}e)n'}s.gbf"el)igdmiCanhg'ae:rsT.ype')=='end':