Introduction

When extracting the transparent text layer from PDF files, I encountered the problem of “the text order being different from the original PDF.” This article explains the cause of this problem and solutions in both JavaScript and Python. There may be some inaccuracies, but I hope it serves as a useful reference.

What Is PDF Transparent Text?

The transparent text layer of a PDF is searchable text information embedded within a PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features:

  • Text search
  • Copy and paste
  • Screen reader narration
  • Machine translation

The Problem: Why Text Order Gets Scrambled

PDF Internal Structure

PDF files store text in a format called “content streams.” These streams contain text and its position information, but it is not necessarily stored in reading order.

E[[[xPPPaooomssspiiilttteiii:ooonnnC:::onxxxc===e131p000t000u,,,alyyy===d243i000a000g,,,raTTTmeeexxxotttf==="""aHFBeooPaodDdtyFinnotcgteo"exn]"tt]"e]ntstream

Problems with Common Extraction Methods

Many PDF processing libraries extract text using the following steps:

  1. Retrieve text and position information from the content stream
  2. Sort by coordinates (top to bottom, left to right)
  3. Output the sorted results

This “sort by coordinates” process is the main cause of text order disruption.

Specific Problem Examples

  • Mixed vertical and horizontal writing: Commonly seen in Japanese documents
  • Multi-column layouts: Newspaper or magazine formats
  • Inserted figures and tables: Elements that break the flow of body text
  • Headers and footers: Elements that span across pages

Solutions: Language-Specific Approaches

JavaScript (PDF.js) Solution

PDF.js is a JavaScript-based PDF rendering library developed by Mozilla.

Implementation That Preserves Order

a}sycc}rOnoo)ercnnr};tdgsise;Uuefetttttxywhsrruteue::ieen-nTtmorxdipceesrntiitgtortxxd:tthhhreittie{ee:tedsoCCsrimm:eenooet..iarrnnadetttirevettnTmrretrdixeee.aameaTntnnaxsnn.myegrttrttssw.xa(rrffihottc)=a=,ooder;etyrrtidxTmatmmhgeteawpe[[,hrxiarx45tetniet]]axWttsC,,stiaeo-rtiprniahnavtscOsgietrenn(idt.gtnoehg.onreeti(thtcupcTeeosaoemoignxosrnettr.dg)eCiminoganP{tnipaDtn(tFseaie.tnltjrtesse(omoa)rrm;d=te>iornr{gd)er

Key Points

  • The getTextContent() method returns text in an order faithful to the PDF’s internal structure
  • The array index represents the original order
  • No re-sorting by coordinates is performed

Python (PyMuPDF) Solution

PyMuPDF (fitz) is a Python binding for the MuPDF library.

Implementation That Preserves Order

idmepforedfdtxoootcrcfr.ia=p#r#ieyctcaaflilztfgMwMseo_iee_enelstt_tttotp#ft:tde#ezihehteaoee(x.doxoxgPrxxp)Ptoxdtdrterttay_p,a__obigMwe1=2wdtclf==euinp::_ieeo_Pt(aptcxscbriDhpgRaEettsklf\adF_deagxxsoonwxofwett=bicr'_,r_i.r.=lnk.tdpntgaspo.llfijeteaeecta[ctgiiofoxertexttrg]keennritx(hnt_iiesxteelntp)utop.t(_slip(dmeen(gi_"itpinapfexx)endtneanega_rttp:tiyxne_egpar(r_ocpbt_t_eata"etrtelitet_tectsei."o=nexeth(teexgg)cxtxe)dixrtiek"lt.tx:ootv(nt=."isstcn"i"a(=gn+t.s))ndl"ee=ra):(gib0t.ipccol:(gsppodtro"ep(ene"dclta)ntt)ek#i(n:dearsn".(ni"Tesgltl,espeiex"atnsd[t,n(et]s"_rs)b["ttet:l],eearo)xxmuc:[ttck]")ot),ru:dr"ee"r))

Key Points

  • get_text("text") preserves the PDF content stream order
  • get_text("dict") allows obtaining detailed structural information
  • Avoid coordinate-based sorting

Python (pdfplumber) Issues

pdfplumber is a popular Python library, but it performs coordinate-based processing by default:

#iwmipptdohfrfptpoldrupfmdpp#tbflaeepugexrlmextubtemeir=xbrnaae.cpmroptappd_glefteen.e.(pxe(patxpdg(trfe)ro_sabp:pclaetetr_mhfta)oetrxiamtcss()apcpdopfo#r:rodOaircndhae)tremsaoyrtbiengdiisnrtueprtneadlly

Implementation Comparison Table

FeaturePDF.js (JavaScript)PyMuPDF (Python)pdfplumber (Python)
Content stream order preservationYesYesNo
Coordinate information retrievalYesYesYes
Processing speedMediumHighLow
Memory usageMediumLowHigh
Japanese supportYesYesPartial
Browser supportYesNoNo

Practical Example: Hybrid Approach

An implementation example that leverages both order preservation and coordinate information:

JavaScript Implementation

c}lac}a}g}g}soseesntyct}rtr}trshnoh)eSTe)GTePticniotxywh)toeti}r;eetDrssPsre::ie;urxufetxuFu.etr.ixdirttrrttrTctxetgtiitgnBn(euiBnetettsei:tthhbyMtrnyxoxreexnee:ttyP[aunOttrtaxrtaimm:ho.trorhE(IctvIlt..iics.hnbriix)ttCetIetttisoi...igsteWoenmrret.ottaaygi.r{minomd.aametrihb.intasttrsesnn.medoisx-naecheixtssw.xins(alxt=Mng=:rffihtn(.a-alOtoeti,oodeIa)t..rIr[tntirrtitteybyodt]a=aenmmhgee{x.;ree{;dlxd[[,hmst-xdrmaate45tsI;e(stwoCx]];wtbr);aaro,,,he.(idnemy{ptetns)Tare]ogpnn.<LpeawteseA)gh.Oeo5ftleiirdr)tor{.ltietegeegd({tbaemi(oodtrsnatyTe.a,rtecmlioixoabgmntrpo)ChCd(rotooi(d=nrnnie>sitgtrigee{dinpmento,ra(sel)iid;tnoidsroeadnxme)erin=lf>ionr(em{ation

Python Implementation

classddddeeeePffffDFsedfdrg#rg#rT_exoooeeeeeeiltcrcttSttMtxnfr.u_ou_auti.a=p#tifcrtrrtirEttcaetolnetnennx_etfgGxeroxxtt_x_ieetmsstbstasr(twt_t__biee_yo_ieas_izndilf(lbrbnlceit.udino)fyctyftlthomecdcb._oe_o.ofe_p,tteklftpodortr)mmeaxooeor(krie::senpi=icrxsdseigxt(al=nktiieygit=apgep.lf_tnl=in_ddeda0tgioiiaflnai[afgeenrtot.aalt]t_iiexteenetmleapnn.t(ss}im(seb_om(afg_"ipe)tssxdorssteoedtnaleewtardehnrtiynfmlh_del)um_cpb.'''''_feixerfmattelitoptbfsi)nt:r,ete."oneraeboin:e(rixg)cxigxonzdnmspaotekltgetxteees-edtn(t=.i_i'''''xe,xlfe"(=gnin:::::d[f_(id"eeta+e')pdnib0t.elpssss=db:aocl:(gm_appppbtcdto"esigaaaa1oh)i"clt.nennnnx):c)k#i(ad_....':tsn"pengggg]i"Tespxueeee[o,espe'mtttt1nx"an:,((((]a[t,nd"""",r]s(itbfsy)b["{teboix:l],exonz[fo)mtxte'oc:[_""""brk]i,,,,bm)noa:d"["0xte"]")'x)))],,,,[0]#))[x0,y0,x1,y1]

Best Practices

1. Choose Based on Use Case

defcieehflloiiouffss#r#r#reeeueue__FtsLtsCtecuueaueouxalr_yr_nrtslnconctnre-auaeat"st"sn"c=eoepetht=xraoyiti=ns=ebo"g=ai=xrnfsiltti_uen"yi"rdmlaalsoca"elrlainoct_c_ys_nththoo:btioe:ruaeodxdtpsnn(tpe_ret:u_rraid_ssi"no"eheeoarxy_arlitbcriytrractsiaishiizcde"zset):e"i::cooonro"ir:gdiinnaalteoridnefrormation

2. Error Handling

a}syt}}nrcyci}c}rccrofo)eaoef{ncrnr;ttntusC(oeCseucsunth!nthttrhorcetsueRunlnttceorcier(eiekxlnktmnie.[oxteeotre]ntfC.[fmv!err;Coow]oseimorsorna;rtsroantr=Ce;)rfteengIm(eemn(atD.{'Enpt'resTxtt.Nbxctetyiolthrxr=teCa.tatetdoricaemenanetwxsxctccxTatthetlteianeurxtfrtrdatoa.sec(ptucistpaentt(iagxdee'ogetrm(ne.Cissc)gon.ifenfda{ttpi:iTeal'lengt)extee;dt.'r:Ci)('ot;i,nettmeeesmrn.rtl=o(e>r)n);g{;th===0){

3. Performance Optimization

#dePfroedtfdcxooooetctrcsra.sa=lsebfycic_tnaoilntfpadtreog_iar_clsltgtihpp#tb#pdelaze_d_aaeaa(ar.sixtggMxtRgb)rgodeeeetceeagep=x=x_mhlte_eti=o=_e=cpnlimsdrtahPd(enixdypesN_Dfpnn=o-axeotF(d(r(icegtnespfdas[n[fespexd_ont]pf..atfpcgaraigags_a)eragcepept(tneitpah0_g_e_eot),ieintnbhd(dtedj,txsxx(eot]pttcbt+ar(etaaro"xtlbtcttc_a_ee)hptisx_acdstsghxi"ie_,n)zssge,ie=zn1bed0a,_)ti:ctdhox_t)sa:ilz_ep)a:ges)

Troubleshooting

Common Issues and Solutions

  1. Japanese character garbling
#teExxtpl=icpiatgley.gsepte_ctiefxyt(U"TtFe-x8t"e)n.ceondciondge('utf-8',errors='ignore').decode('utf-8')
  1. Handling vertical text
a}sycc}rOnoo)ercnnr};tdgsise;Uuefetttttxywhsrruteue::ieen-nTtmorxdipceesrntiitgtortxxd:tthhhreittie{ee:tedsoCCsrimm:eenooet..iarrnnadetttirevettnTmrretrdixeee.aameaTntnnaxsnn.myegrttrttssw.xa(rrffihottc)=a=,ooder;etyrrtidxTmatmmhgeteawpe[[,hrxiarx45tetniet]]axWttsC,,stiaeo-rtiprniahnavtscOsgietrenn(idt.gtnoehg.onreeti(thtcupcTeeosaoemoignxosrnettr.dg)eCiminoganP{tnipaDtn(tFseaie.tnltjrtesse(omoa)rrm;d=te>iornr{gd)er

0 3. Out of memory

a}sycc}rOnoo)ercnnr};tdgsise;Uuefetttttxywhsrruteue::ieen-nTtmorxdipceesrntiitgtortxxd:tthhhreittie{ee:tedsoCCsrimm:eenooet..iarrnnadetttirevettnTmrretrdixeee.aameaTntnnaxsnn.myegrttrttssw.xa(rrffihottc)=a=,ooder;etyrrtidxTmatmmhgeteawpe[[,hrxiarx45tetniet]]axWttsC,,stiaeo-rtiprniahnavtscOsgietrenn(idt.gtnoehg.onreeti(thtcupcTeeosaoemoignxosrnettr.dg)eCiminoganP{tnipaDtn(tFseaie.tnltjrtesse(omoa)rrm;d=te>iornr{gd)er

1

Summary

Order preservation is an important challenge in PDF transparent text extraction. The key points are:

  1. Understanding the problem: Coordinate-based sorting is the main cause of order disruption
  2. Choosing the right library: Use PDF.js (JavaScript) or PyMuPDF (Python)
  3. Implementation method: Maintain the content stream order
  4. Hybrid approach: Leverage both order and coordinate information

By applying this knowledge, you can build more accurate and reliable PDF text extraction systems.

References