Introduction# When extracting the transparent text layer from PDF files, I encountered the problem of “the text order being different from the original PDF.” This article explains the cause of this problem and solutions in both JavaScript and Python. There may be some inaccuracies, but I hope it serves as a useful reference.
What Is PDF Transparent Text?# The transparent text layer of a PDF is searchable text information embedded within a PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features:
Text search Copy and paste Screen reader narration Machine translation The Problem: Why Text Order Gets Scrambled# PDF Internal Structure# PDF files store text in a format called “content streams.” These streams contain text and its position information, but it is not necessarily stored in reading order.
E [ [ [ x P P P a o o o m s s s p i i i l t t t e i i i : o o o n n n C : : : o n x x x c = = = e 1 3 1 p 0 0 0 t 0 0 0 u , , , a l y y y = = = d 2 4 3 i 0 0 0 a 0 0 0 g , , , r a T T T m e e e x x x o t t t f = = = " " " a H F B e o o P a o d D d t y F i n n o t c g t e o " e x n ] " t t ] " e ] n t s t r e a m
Many PDF processing libraries extract text using the following steps:
Retrieve text and position information from the content stream Sort by coordinates (top to bottom, left to right)Output the sorted results This “sort by coordinates” process is the main cause of text order disruption.
Specific Problem Examples# Mixed vertical and horizontal writing : Commonly seen in Japanese documentsMulti-column layouts : Newspaper or magazine formatsInserted figures and tables : Elements that break the flow of body textHeaders and footers : Elements that span across pagesSolutions: Language-Specific Approaches# JavaScript (PDF.js) Solution# PDF.js is a JavaScript-based PDF rendering library developed by Mozilla.
Implementation That Preserves Order# a } s y c c } r O n o o ) e r c n n r } ; t d g s i s e ; U u e f e t t t t t x y w h s r r u t e u e : : i e e n - n T t m o r x d i p c e e s r n t i i t g t o r t x x d : t t h h h r e i t t i e { e e : t e d s o C C s r i m m : e e n o o e t . . i a r r n n a d e t t t i r e v e t t n T m r r e t r d i x e e e . a a m e a T n t n n a x s n n . m y e g r t t r t t s s w . x a ( r r f f i h o t t c ) = a = , o o d e r ; e t y r r t i d x T m a t m m h g e t e a w p e [ [ , h r x i a r x 4 5 t e t n i e t ] ] a x W t t s C , , s t i a e o - r t i p r n i a h n a v t s c O s g i e t r e n n ( i d t . g t n o e h g . o n r e e t i ( t h t c u p c T e e o s a o e m o i g n x o s r n e t t r . d g ) e C i m i n o g a n P { t n i p a D t n ( t F s e a i e . t n l t j r t e s s e ( o m o a ) r r m ; d = t e > i o r n r { g d ) e r Key Points# The getTextContent() method returns text in an order faithful to the PDF’s internal structure The array index represents the original order No re-sorting by coordinates is performed Python (PyMuPDF) Solution# PyMuPDF (fitz) is a Python binding for the MuPDF library.
Implementation That Preserves Order# i d m e p f o r e d f d t x o o o t c r c f r . i a = p # r # i e y c t c a a f l i l z t f g M w M s e o _ i e e _ e n e l s t t _ t t t o t p # f t : t d e # e z i h e h t e a o e e ( x . d o x o x g P r x x p ) P t o x d t d r t e r t t a y _ p , a _ _ o b i g M w e 1 = 2 w d t c l f = = e u i n p : : _ i e e o _ P t ( a p t c x s c b r i D h p g R a E e t t s k l f \ a d F _ d e a g x x s o o n w x o f w e t t = b i c r ' _ , r _ i . r . = l n k . t d p n t g a s p o . l l f i j e t e a e e c t a [ c t g i i o f o x e r t e x t t r g ] k e e n n r i t x ( h n t _ i i e s x t e e l n t p ) u t o p . t ( _ s l i p ( d m e e n ( g i _ " i t p i n a p f e x x ) e n d t n e a n e g a _ r t t p : t i y x n e _ e g p a r ( r _ o c p b t _ t _ e a t a " e t r t e l i t e t _ t e c t s e i . " o = n e x e t h ( t e e x g g ) c x t x e ) d i x r t i e k " l t . t x : o o t v ( n t = . " i s s t c n " i " a ( = g n + t . s ) ) n d l " e e = r a ) : ( g i b 0 t . i p c c o l : ( g s p p o d t r o " e p ( e n e " d c l t a ) n t t ) e k # i ( n : d e a r s n " . ( n i " T e s g l t l , e s p e i e x " a t n s d [ t , n ( e t ] s " _ r s ) b [ " t t e t : l ] , e e a r o ) x x m u c : [ t t c k ] " ) o t ) , r u : d r " e e " r ) )
Key Points# get_text("text") preserves the PDF content stream orderget_text("dict") allows obtaining detailed structural informationAvoid coordinate-based sorting Python (pdfplumber) Issues# pdfplumber is a popular Python library, but it performs coordinate-based processing by default:
# i w m i p p t d o h f r f p t p o l d r u p f m d p p # t b f l a e e p u g e x r l m e x t u b t e m e i r = x b r n a a e . c p m r o p t a p p d _ g l e f t e e n . e . ( p x e ( p a t x p d g ( t r f e ) r o _ s a b p : p c l a e t e t r _ m h f t a ) o e t r x i a m t c s s ( ) a p c p d o p f o # r : r o d O a i r c n d h a e ) t r e m s a o y r t b i e n g d i i s n r t u e p r t n e a d l l y
Implementation Comparison Table# Feature PDF.js (JavaScript) PyMuPDF (Python) pdfplumber (Python) Content stream order preservation Yes Yes No Coordinate information retrieval Yes Yes Yes Processing speed Medium High Low Memory usage Medium Low High Japanese support Yes Yes Partial Browser support Yes No No
Practical Example: Hybrid Approach# An implementation example that leverages both order preservation and coordinate information:
JavaScript Implementation# c } l a c } a } g } g } s o s e e s n t y c t } r t r } t r s h n o h ) e S T e ) G T e P t i c n i o t x y w h ) t o e t i } r ; e e t D r s s P s r e : : i e ; u r x u f e t x u F u . e t r . i x d i r t t r r t t r T c t x e t g t i i t g n B n ( e u i B n e t e t t s e i : t t h h b y M t r n y x o x r e e x n e e : t t y P [ a u n O t t r t a x r t a i m m : h o . t r o r h E ( I c t v I l t . . i i c s . h n b r i i x ) t t C e t I e t t t i s o i . . . i g s t e W o e n m r r e t . o t t a a y g i . r { m i n o m d . a a m e t r i h b . i n t a s t t r s e s n n . m e d o i s x - n a e c h e i x t s s w . x i n s ( a l x t = M n g = : r f f i h t n ( . a - a l O t o e t i , o o d e I a ) t . . r I r [ t n t i r r t i t t e y b y o d t ] a = a e n m m h g e e { x . ; r e e { ; d l x d [ [ , h m s t - x d r m a a t e 4 5 t s I ; e ( s t w o C x ] ] ; w t b r ) ; a a r o , , , h e . ( i d n e m y { p t e t n s ) T a r e ] o g p n n . < L p e a w t e s e A ) g h . O e o 5 f t l e i i r d r ) t o r { . l t i e t e g e e g d ( { t b a e m i ( o o d t r s n a t y T e . a , r t e c m l i o i x o a b g m n t r p o ) C h C d ( r o t o o i ( d = n r n n i e > s i t g t r i g e e { d i n p m e n t o , r a ( s e l ) i i d ; t n o i d s r o e a d n x m e ) e r i n = l f > i o n r ( e m { a t i o n Python Implementation# c l a s s d d d d e e e e P f f f f D F s e d f d r g # r g # r T _ e x o o o e e e e e e i l t c r c t t S t t M t x n f r . u _ o u _ a u t i . a = p # t i f c r t r r t i r E t t c a e t o l n e t n e n n x _ e t f g G x e r o x x t t _ x _ i e e t m s s t b s t a s r ( t w t _ t _ _ b i e e _ y o _ i e a s _ i z n d i l f ( l b r b n l c e i t . u d i n o ) f y c t y f t l t h o m e c d c b . _ o e _ o . o f e _ p , t t e k l f t p o d o r t r ) m m e a x o o e o r ( k r i e : : s e n p i = i c r x s d s e i g x t ( a l = n k t i i e y g i t = a p g e p . l f _ t n l = i n _ d d e d a 0 t g i o i i a f l n a i [ a f g e e n r t o t . a a l t ] t _ i i e x t e e n e t m l e a p n n . t ( s s } i m ( s e b _ o m ( a f g _ " i p e ) t s s x d o r s s t e o e d t n a l e e w t a r d e h n r t i y n f m l h _ d e l ) u m _ c p b . ' ' ' ' ' _ f e i x e r f m a t t e l i t o p t b f s i ) n t : r , e t e . " o n e r a e b o i n : e ( r i x g ) c x i g x o n z d n m s p a o t e k l t g e t x t e e e s - e d t n ( t = . i _ i ' ' ' ' ' x e , x l f e " ( = g n i n : : : : : d [ f _ ( i d " e e t a + e ' ) p d n i b 0 t . e l p s s s s = d b : a o c l : ( g m _ a p p p p b t c d t o " e s i g a a a a 1 o h ) i " c l t . n e n n n n x ) : c ) k # i ( a d _ . . . . ' : t s n " p e n g g g g ] i " T e s p x u e e e e [ o , e s p e ' m t t t t 1 n x " a n : , ( ( ( ( ] a [ t , n d " " " " , r ] s ( i t b f s y ) b [ " { t e b o i x : l ] , e x o n z [ f o ) m t x t e ' o c : [ _ " " " " b r k ] i , , , , b m ) n o a : d " [ " 0 x t e " ] " ) ' x ) ) ) ] , , , , [ 0 ] # ) ) [ x 0 , y 0 , x 1 , y 1 ] Best Practices# 1. Choose Based on Use Case# d e f c i e e h f l l o i i o u f f s s # r # r # r e e e u e u e _ _ F t s L t s C t e c u u e a u e o u x a l r _ y r _ n r t s l n c o n c t n r e - a u a e a t " s t " s n " c = e o e p e t h t = x r a o y i t i = n s = e b o " g = a i = x r n f s i l t t i _ u e n " y i " r d m l a a l s o c a " e l r l a i n o c t _ c _ y s _ n t h t h o o : b t i o e : r u a e o d x d t p s n n ( t p e _ r e t : u _ r r a i d _ s s i " n o " e h e e o a r x y _ a r l i t b c r i y t r r a c t s i a i s h i i z c d e " z s e t ) : e " i : : c o o o n r o " i r : g d i i n n a a l t e o r i d n e f r o r m a t i o n
2. Error Handling# a } s y t } } n r c y c i } c } r c c r o f o ) e a o e f { n c r n r ; t t n t u s C ( o e C s e u c s u n t h ! n t h t t r h o r c e t s u e R u n l n t t c e o r c i e r ( e i e k x l n k t m n i e . [ o x t e e o t r e ] n t f C . [ f m v ! e r r ; C o o w ] o s e i m o r s o r n a ; r t s r o a n t r = C e ; ) r f t e e n g I m ( e e m n ( a t D . { ' E n p t ' r e s T x t t . N b x c t e t y i o l t h r x r = t e C a . t a t e t d o r i c a e m e n a n e t w x s x c t c c x T a t t h e t l t e i a n e u r x t f r t r d a t o a . s e c ( p t u c i s t p a e n t t ( i a g x d e e ' o g e t r m ( n e . C i s s c ) g o n . i f e n f d a { t t p i : i T e a l ' l e n g t ) e x t e e ; d t . ' r : C i ) ( ' o t ; i , n e t t m e e e s m r n . r t l = o ( e > r ) n ) ; g { ; t h = = = 0 ) { # d e P f r o e d t f d c x o o o o e t c t r c s r a . s a = l s e b f y c i c _ t n a o i l n t f p a d t r e o g _ i a r _ c l s l t g t i h p p # t b # p d e l a z e _ d _ a a e a a ( a r . s i x t g g M x t R g b ) r g o d e e e e t c e e a g e p = x = x _ m h l t e _ e t i = o = _ e = c p n l i m s d r t a h P d ( e n i x d y p e s N _ D f p n n = o - a x e o t F ( d ( r ( i c e g t n e s p f d a s [ n [ f e s p e x d _ o n t ] p f . . a t f p c g a r a i g a g s _ a ) e r a g c e p e p t ( t n e i t p a h 0 _ g _ e _ e o t ) , i e i n t n b h d ( d t e d j , t x s x x ( e o t ] p t t c b t + a r ( e t a a r o " x t l b t c t t c _ a _ e e ) h p t i s x _ a c d s t s g h x i " i e _ , n ) z s s g e , i e = z n 1 b e d 0 a , _ ) t i : c t d h o x _ t ) s a : i l z _ e p ) a : g e s )
Troubleshooting# Common Issues and Solutions# Japanese character garbling # t e E x x t p l = i c p i a t g l e y . g s e p t e _ c t i e f x y t ( U " T t F e - x 8 t " e ) n . c e o n d c i o n d g e ( ' u t f - 8 ' , e r r o r s = ' i g n o r e ' ) . d e c o d e ( ' u t f - 8 ' )
Handling vertical text a } s y c c } r O n o o ) e r c n n r } ; t d g s i s e ; U u e f e t t t t t x y w h s r r u t e u e : : i e e n - n T t m o r x d i p c e e s r n t i i t g t o r t x x d : t t h h h r e i t t i e { e e : t e d s o C C s r i m m : e e n o o e t . . i a r r n n a d e t t t i r e v e t t n T m r r e t r d i x e e e . a a m e a T n t n n a x s n n . m y e g r t t r t t s s w . x a ( r r f f i h o t t c ) = a = , o o d e r ; e t y r r t i d x T m a t m m h g e t e a w p e [ [ , h r x i a r x 4 5 t e t n i e t ] ] a x W t t s C , , s t i a e o - r t i p r n i a h n a v t s c O s g i e t r e n n ( i d t . g t n o e h g . o n r e e t i ( t h t c u p c T e e o s a o e m o i g n x o s r n e t t r . d g ) e C i m i n o g a n P { t n i p a D t n ( t F s e a i e . t n l t j r t e s s e ( o m o a ) r r m ; d = t e > i o r n r { g d ) e r 0
3. Out of memory
a } s y c c } r O n o o ) e r c n n r } ; t d g s i s e ; U u e f e t t t t t x y w h s r r u t e u e : : i e e n - n T t m o r x d i p c e e s r n t i i t g t o r t x x d : t t h h h r e i t t i e { e e : t e d s o C C s r i m m : e e n o o e t . . i a r r n n a d e t t t i r e v e t t n T m r r e t r d i x e e e . a a m e a T n t n n a x s n n . m y e g r t t r t t s s w . x a ( r r f f i h o t t c ) = a = , o o d e r ; e t y r r t i d x T m a t m m h g e t e a w p e [ [ , h r x i a r x 4 5 t e t n i e t ] ] a x W t t s C , , s t i a e o - r t i p r n i a h n a v t s c O s g i e t r e n n ( i d t . g t n o e h g . o n r e e t i ( t h t c u p c T e e o s a o e m o i g n x o s r n e t t r . d g ) e C i m i n o g a n P { t n i p a D t n ( t F s e a i e . t n l t j r t e s s e ( o m o a ) r r m ; d = t e > i o r n r { g d ) e r 1
Summary# Order preservation is an important challenge in PDF transparent text extraction. The key points are:
Understanding the problem : Coordinate-based sorting is the main cause of order disruptionChoosing the right library : Use PDF.js (JavaScript) or PyMuPDF (Python)Implementation method : Maintain the content stream orderHybrid approach : Leverage both order and coordinate informationBy applying this knowledge, you can build more accurate and reliable PDF text extraction systems.
References#