This article was partially written by AI.
Overview# DHConvalidator is a tool for converting Digital Humanities (DH) conference abstracts into a consistent TEI (Text Encoding Initiative) text base.
https://github.com/ADHO/dhconvalidator
When using this tool, the following error occurred during the conversion process from Microsoft Word format (DOCX) to TEI XML format:
E R R O R : n u . x o m . P a r s i n g E x c e p t i o n : c v c - c o m p l e x - t y p e . 2 . 4 . a : I n v a l i d c o n t e n t w a s f o u n d s t a r t i n g w i t h e l e m e n t ' r e f '
This article shares the cause and solution for this issue.
Identifying the Cause# Investigation revealed that the cause of the problem was INCLUDEPICTURE field codes embedded within the Word document.
Specifically, when images were copied and pasted from Google Docs, field codes like the following remained in the document:
I N C L U D E P I C T U R E " h t t p s : / / l h 7 - r t . g o o g l e u s e r c o n t e n t . c o m / d o c s z / . . . " M E R G E F O R M A T I N E T These external image reference links were not properly processed during the TEI conversion process, causing XML validation errors.
Solution# To resolve this issue, a Python script was developed to automatically remove the problematic field codes from DOCX files.
Script Features# Safe processing : Preserves the image content itself and only removes the field code portionsZIP format support : Properly handles the internal structure of DOCX files (ZIP + XML)Namespace support : Accurate element searching that considers Word document XML namespacesMain Processing Logic# Extract the DOCX file to a temporary directory Parse the field code structure in word/document.xml Identify fields containing INCLUDEPICTURE Remove only field control elements (begin/separate/end) while preserving image elements Generate a new DOCX file with the modified XML Implementation Details# Field Code Detection# d e f i f r s o e _ r t i u n r i i r c u n f n l n s u t i F d i r n i a e n _ s f l p t t s i f e r ' e c i x _ I r t e t t N e u l e C t r d = x L u e _ t U r _ r r D n f u u i E i n n s P T e s . I r l : f n C u d i o T e ( n t U f d R i ( N E e ' o ' l . n d / e i _ / n r w a u : n i n i d n s n s , s i t t n r n r s _ s T t t ) e r e : x _ x t t t ' e . , x t t e n . x s t t ) e : x t :
Selecting Elements for Removal# d e f s # h # h # r h a a e o C s C s R t u h _ h _ e u l e f e i m r d c i c m o n _ k e k a v r l g e h e i d i e a m f _ f _ e s o c c l _ v i o i o e f e t n t n m i _ t t e e r h r h e n l u a o a n t d n s l s t s _ ( c r f = a = t o u i c h n n e ( r t ( r a t , l r u u r u t r d u n a u n o n n . l n . h l s c . f . f a ) o f i i f i v a : n i n m i n e n t n d a n d d r d ( g d ( f o ( ' e ( ' i n l ' . ' . e o . c . l t e o d l w n w h e w : t w : c a m : i e : p o s e f n n d i n _ n l s t r c t i t d t a t r m s C r w ' o a h T i , l g a e n e r x g n e _ ' t ' s l c , ' , ) e o , m n n n i e t s n s s n e ) s ) t n ) n s t i i o s i s t b s u n n N t o n o o t o t n n t e o N N ) o N o i n o n m e n e a e g o ) o e r r c o n t e n t Result# With this script, problematic field codes are removed and the TEI conversion process completes successfully. Images are preserved properly embedded within the document.
Usage# p y t h o n f i x _ d o c x _ f i e l d s . p y i n p u t . d o c x [ o u t p u t . d o c x ]
If no output file name is specified, it is saved as input_fixed.docx.
However, when opening the file, the following warning was displayed. I was unable to figure out how to fix this on the script side, but the file opened successfully by clicking the “Yes” button.
Summary# When copying images from Google Docs or web browsers, such external reference links may be embedded.
Since this issue may also occur in other DOCX processing systems, I hope this serves as a reference when encountering similar errors.
Script# # " D R t " i i i i d d d d d i ! " O e h " m m m m e e e e e f / " C m a " p p p p f f f f f u X o t o o o o s v r r r r p " P A " i # w p p " P " # t r # n } # r # f # t p i " C " f r s " D " # h # h # r m " i i i o i t e _ m r F e c t t t t r " r r " f i r r " r " r o s e o r r s " h " o e h " e " a a e a " m f n u f r x n a / i s a o " o g " C t i o " o " P e o D F m P r S e i _ " e " r t o " t " C s C s R t i " p p t y c a i b e u z x t o c c s o r h n c c a e t e = i o r a e n i c u u e h _ h _ e u n M o l u p n : e m n i l p s i m e s e e : i o u o e # w # d i # w t e e r f ' n v o p r # r # i w # f v . t n k r i i r l r e f e i m r ( a r e p s t u o p s p p p p s e ( n d r e p l m s s n u t u a t i o f i ( s s s = = i { w d e c a u u h o e w ( c u n f n d m c i c m o n ) i t n r y _ t t r y r r t r y _ ) / o f . p s s p t p t t e E t P c C t f s s e n ' d e r n F n L = i R r r f l i n s _ i k e k a v : n ( i s f _ i s o i i s _ e C b T i e f _ u p u p e m x h r _ o r h " _ E t e : a _ s a s i s o l e t i " u f t i F r n l g e h s s n . i f o n . c n E n . n o l E l t i d D t u t u p t z o x s p e f F d t X T r n c s _ n o 0 e r # f i e m r p h t R d i r n i a e e i d i e a f y y t e l i s t e e t x t e = v d e I e r l o O _ t _ t t f r z i c m . r a z o i o h M . e n ' d o i t d = k u l f l o u a e e e e t n _ s f l m f _ f _ r s u s s ( x e l . ( x s ( c ( x = e m e e c C f _ f _ e i a i p e l p o t i r x c e L p e a h u e n o i n C d s v n r ( m p h t t s o i c c u _ n . " i e p f i s " e f i p a c e x X i f i f m l c p _ s _ a c e p e u a . m t r n a _ a p f h _ f e e a m x o i e f e r ' e v f r o r o n f c a U t = a " t _ P p " t " y P t o . _ l i l i p e t f r s p t e f r f d m d r g e t e t c r r l a o < = e c l # f j # w # i # i : i i . o m v c i x _ I r e u n u n s i t r s ( = t E ( d r t E ( _ t r i n E f f e l e l o . i e a h s n i o o e o s e s p m h o e l r r c h d i h f t n r d l e t f e t t N e _ a n t n t e i g a 1 s h r 1 o o i r 1 _ h o c v l i i e e r T D l f d t . s e l o r D n c e t p : o = o m a l r k a _ F e = C i C S = + h e i _ d u i l e C t r r e w l o v g ) y s . r ) c c o r ) m o c e e e l ( i a e O e . o h e _ w e t O t u ( r a / v p t o r . I e u r c o l o l h i k = e r m f f r e d = x L u u r h o h n i d n ) e s y e o x e n o a n e f r m l e s ( s = r m C . e c x d . , f f a z C _ m x o c / e 0 a . v u f N n n f h u d i l e n f n i j e s # f r i j u o i i { e l _ t U r n u a l a t t _ : . s x r _ s r i 3 s i s e d t s y p X Z x u = i o D Z i i r i X x e m o e s r f e n i C ( s o = a n _ l e i e f c _ o e p 1 p n v e l r _ d r r D n ( n s s h c f < a . i : f s a n s e i n s t r t N i o i t m s c O i d l l c p m n l t s c f a i s n L r [ r r d r + e j x e x + k i R r m + r s e d e e f u u i E r = = o o p r a s i i s p _ o l o t ( o ) r o n d r f p r e o t u C p i e e _ _ s l t _ ( h i g n = d U u i r u c t l t n = n e o t o _ ( _ m i r n n s P T u s f a f n r 2 y g r t I e n r _ r d n T i : ) n p i a i F a n s s m X F r _ p o a ( . f ) e e r d i a D n ] f u i f n 1 t < _ d _ e b i c m f i v o 1 b t r X p o e u s . I r n h i ( r c ( r r i t : t v g s n l g e o " r n r : e u r r l i c t . ( e i s i p a u v x x i m l a a [ n l E s i n s i s r _ f x r 1 f l o i f e l o u M a v l n : f n C u , o e r u t r u u e r c h [ v ( p d : c : c i e p e P : t e y e l t . p d n f l , n a t t e m m l a d p l ] l P ) e . e a l u r l t e u v e d a e _ n L t e d s i o T e u l u n u u n n l o o o 1 [ i u s c e o s e u m a P _ c D e a x a o t i e t h . d l l e s h l t ( I : l f n l = l e n u d _ a t d e l s _ f m r ) h d ( n t U n l d n . a n . . d l m n ] 2 n t ( o s d s t o t a f t i ( l m t c _ l ( f f h w _ _ . c ( h ' C d i o d l n n _ f k h e d h r c t a e , _ f c d R s d . f l . f f m ] p i m s e u a _ v h t i o r i l l h _ x e o i i = r a f f p o o ' i . T n t [ ( = s c l i p o _ o u o e t m c i o ( N E ) c f i f i i c a a f u f n p i s e s f e h l r e n ( . x m u l l = i s i i a p d . s / U b d b r r r . h d s i n r u n u r i o e o e n ' o ' : b o i n i i n n o n n i i t i p l n s i t e y c p t j m l t e e o t : l l t e e / / R e ( N e u u u r a a _ c l u l s n c v n u l t . n e n n d m n d d n d d x f _ l u e g ( . E l p o t . t u e o l ( p s s o s e e e h n s / p w E g ' o g n n n u p r c i t y n d _ t t e c n d a / e i t d ( a d ( ( t _ f e t t l T e r o r o t m i _ d u : s . ( { _ ) x w a : i . n i ] s s n p h s u _ t h r : o t _ i / n r r ( ' g ( ' ' r n l d l i _ e f i , o i e r _ p n p o t i . p f o p t m : r r f n / e n ) s e = a r f i r o + e u d } r n w a e o ' . e ' . . o o i o e l ' f d i k b n o p y f _ ( a c _ n p a i u a o l p a ' i / , u : [ n r a e i n e _ = n i u : n i m l . . l t n c n e { i l e o l p u l ( i d t t _ f a t l t t f ' g , e w a n j d n n _ e m r f s n I n I i d n o c / e x ( ) i l s e u e u t a ) l i e h x i o t h e p h r o , r l : n l t ] ( e i f l f o e 1 i g N s N n s v e w w w e h _ s : n e u : I t m t p c e r m ) m l s h . _ u ) e r a n d f d o i n x s I i d i v m e = C , C s i t e l w : n w : : l a u f y p , c N p a u e a , ) p : l e . . r p t : m m n p s l o l e t N e e e o l ' L L t n r d e : i t : p o e s s i s u c { C u t D t ( s _ _ , w j e a _ o a s h ) p d f k x _ n C l c l _ v d u U n U r s _ m f n e d i b m _ a e . t o e e L t i O ' ' d p a o l t f v t ) a C l f t r o L d o d r e t D s D T t t ( e l s n r c j e i g l a _ u s } U _ c C D . t r i a ' l i p h i e s : t h d f i _ u t U ( n _ u . f E ) E e r e c n d t t a t e n m e d r f t s " D f X O d e ' r t w k n a , l . t a _ o e r n D f t r n a - P : P x _ x o t C r w ' c t a . s g i p f ) E i f C o m ) , h ' ( ( t e I o e r c r l u . N E i r u ( p 8 I I t t t n s h T ( i , t s g " . v l u u P l i f X c p ) , t r h a } N r r ' h d n f o P e o n f p ' C C ' e . t a e d n ' e " p ) e t l I e e i x _ a ' e o ( r " C g n , a t ) i n I l l s i e , T T , x t a ( r x r g n , b _ " y } _ l C = l l f ' d s w z m o f c ) L / s r h e n e C d : e n U U t e i b ' t a ' s u c > ' f y T N d e i , i o i p t i _ U w n . e n d T _ r l d x R R n . x n e , ' w , ) n t o < i ! U o l r z r p _ , l p D o s g d ( a U r u d ( m E E s t t s g , i s n i 2 n l " R n c e : i d f d e a E r ) e c ' n R u n _ f l . ) e : i n n n i ) n t n o e ) E e o _ p ' i i f _ t P d t o . d E n s r i _ f x f n s n g s s o e p e t ) ) ) d f _ , l r i p h I p ( m / s , u e d i t i , ) s ) i n u l : e o i r e ) l a ) C r f p / n f , n l e e : e ) e n s i t t s f f s p x e ' . : e t T o ' l w e i k , d c l l s i l i o m . e o r . t e f d Z ) h U c { e : x e n e _ l d d e s i e s t n a d u o i d : o I , R e { t f t l s e n r a p s m o g o N n m o . c P E s { e l _ d ) p s u r c c a n e n N t e c o d n d u _ t s n d f : ) n a o o r o n n o o x n " W a o m D e f i s f C l i : ) t d d a t o t t n N c > e ) o l c e E m i n [ i h d m i e e t t s e o o r ) x n F p e g " e a _ a o s s e N ) N n n [ d ' t L _ l m w l r c g n , o N o o e t o ) . A d d l " d ' h e = w b n o n r ) e u d x T i / ] , a T h u e e n e n t o m E r c 2 } s r c r i t n e t p c l D ) o 0 } t n . o u l d o ) o u u ' ) d 0 } r s g n e e n , r r t m ) e 6 f u ) e t ) o . e a s / l c t e p t i d n s m d t ( n r n o t w a C u f t e i s c s z h i h r ' s m t x i i n a e { e a r ] p l ' r { r g T " _ e T { v e e ) o y n i x u p p s n c t t r e [ g o ) : e ' " n s ) w i t e " m e r = ] a n v = } g t i } e ) n ' } s . g b f " e l ) i g d m i C a n h g ' a e : r s T . y p e ' ) = = ' e n d ' :