Good day SO,
I am trying to copy a part of a .docx file into another .docx file, while keeping the formatting of the copied part, as well as any images, using python.
I have tried python-docx but i am unable to find anything regarding images. Link to my previous qn here: Extracting .docx data, images and structure
Is there a way for me to copy a part of a document, lets say DocA, and insert it into the ending of DocB (Including images and formatting, basically a clean copy and paste situation)?
Thanks alot!
EDIT:
I have managed to find paragraphs containing images in DocA using the following code. I understand that it is a very hack-ish way as I am a complete beginner in python-docx, but here it is:
for x in document.paragraphs:
if "<w:pict" in x._p.xml:
print(x._p.xml)
Using this code, I successfully managed to find paragraphs containing the said images in the document. However, I am still unable to copy the image over to DocB (It appears as blanks in DocB), which is because (based on my understanding) I didn't extract the image data from the .docx file DocA.
EDIT 2:
Here is the XML of the Paragraph object containing the images:
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" w14:paraId="18A83B04" w14:textId="77777777" w:rsidR="00200C54" w:rsidRDefault="00051C61" w:rsidP="00200C54">
<w:pPr>
<w:jc w:val="center"/>
</w:pPr>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="30C19523">
<v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe">
<v:stroke joinstyle="miter"/>
<v:path gradientshapeok="t" o:connecttype="rect"/>
</v:shapetype>
<v:shape id="Text Box 2" o:spid="_x0000_s1029" type="#_x0000_t202" style="position:absolute;left:0;text-align:left;margin-left:305.1pt;margin-top:112.75pt;width:86.25pt;height:19.5pt;z-index:1;visibility:visible;mso-wrap-distance-top:3.6pt;mso-wrap-distance-bottom:3.6pt;mso-width-relative:margin;mso-height-relative:margin" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQBxa2hSIQIAAB0EAAAOAAAAZHJzL2Uyb0RvYy54bWysU11v2yAUfZ+0/4B4X+x4cdNYcaouXaZJ
3YfU7gdgjGM04DIgsbtfvwtO06h7m+YHxPW9HM4997C+GbUiR+G8BFPT+SynRBgOrTT7mv543L27
psQHZlqmwIiaPglPbzZv36wHW4kCelCtcARBjK8GW9M+BFtlmee90MzPwAqDyQ6cZgFDt89axwZE
1yor8vwqG8C11gEX3uPfuylJNwm/6wQP37rOi0BUTZFbSKtLaxPXbLNm1d4x20t+osH+gYVm0uCl
Z6g7Fhg5OPkXlJbcgYcuzDjoDLpOcpF6wG7m+atuHnpmReoFxfH2LJP/f7D86/G7I7KtaTFfUmKY
xiE9ijGQDzCSIuozWF9h2YPFwjDib5xz6tXbe+A/PTGw7ZnZi1vnYOgFa5HfPJ7MLo5OOD6CNMMX
aPEadgiQgMbO6SgeykEQHef0dJ5NpMLjlfmqfL8sKeGYKxbLqzINL2PV82nrfPgkQJO4qanD2Sd0
drz3IbJh1XNJvMyDku1OKpUCt2+2ypEjQ5/s0pcaeFWmDBlquiqLMiEbiOeThbQM6GMldU2v8/hN
zopqfDRtKglMqmmPTJQ5yRMVmbQJYzNiYdSsgfYJhXIw+RXfF256cL8pGdCrNfW/DswJStRng2Kv
5otFNHcKFuWywMBdZprLDDMcoWoaKJm225AeRNTBwC0OpZNJrxcmJ67owSTj6b1Ek1/GqerlVW/+
AAAA//8DAFBLAwQUAAYACAAAACEAiK7BRuMAAAAQAQAADwAAAGRycy9kb3ducmV2LnhtbExPy26D
MBC8V+o/WFupl6oxQQESgon6UKtek+YDDN4ACl4j7ATy992e2stKuzM7j2I3215ccfSdIwXLRQQC
qXamo0bB8fvjeQ3CB01G945QwQ097Mr7u0Lnxk20x+shNIJFyOdaQRvCkEvp6xat9gs3IDF2cqPV
gdexkWbUE4vbXsZRlEqrO2KHVg/41mJ9PlysgtPX9JRspuozHLP9Kn3VXVa5m1KPD/P7lsfLFkTA
Ofx9wG8Hzg8lB6vchYwXvYJ0GcVMVRDHSQKCGdk6zkBUfElXCciykP+LlD8AAAD//wMAUEsBAi0A
FAAGAAgAAAAhALaDOJL+AAAA4QEAABMAAAAAAAAAAAAAAAAAAAAAAFtDb250ZW50X1R5cGVzXS54
bWxQSwECLQAUAAYACAAAACEAOP0h/9YAAACUAQAACwAAAAAAAAAAAAAAAAAvAQAAX3JlbHMvLnJl
bHNQSwECLQAUAAYACAAAACEAcWtoUiECAAAdBAAADgAAAAAAAAAAAAAAAAAuAgAAZHJzL2Uyb0Rv
Yy54bWxQSwECLQAUAAYACAAAACEAiK7BRuMAAAAQAQAADwAAAAAAAAAAAAAAAAB7BAAAZHJzL2Rv
d25yZXYueG1sUEsFBgAAAAAEAAQA8wAAAIsFAAAAAA==
" stroked="f">
<v:textbox>
<w:txbxContent>
<w:p w14:paraId="467DC1DB" w14:textId="77777777" w:rsidR="00200C54" w:rsidRDefault="00200C54" w:rsidP="00200C54">
<w:pPr>
<w:jc w:val="center"/>
</w:pPr>
<w:r>
<w:t>tLSTM</w:t>
</w:r>
</w:p>
</w:txbxContent>
</v:textbox>
</v:shape>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="0D832600">
<v:line id="Straight Connector 8" o:spid="_x0000_s1028" style="position:absolute;left:0;text-align:left;flip:y;z-index:2;visibility:visible;mso-width-relative:margin;mso-height-relative:margin" from="205.4pt,44.35pt" to="249.05pt,45.55pt" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQBOa9I08gEAAMcDAAAOAAAAZHJzL2Uyb0RvYy54bWysU02P0zAQvSPxHyzfadqyQUvUdA+tlssK
KrVwn3XsxMJf8pim+feM3dIWuCFysGzPvJeZN8+rp5M17Cgjau9avpjNOZNO+E67vuVfD8/vHjnD
BK4D451s+SSRP63fvlmNoZFLP3jTyciIxGEzhpYPKYWmqlAM0gLOfJCOgspHC4mOsa+6CCOxW1Mt
5/MP1ehjF6IXEpFut+cgXxd+paRIX5RCmZhpOdWWyhrL+prXar2Cpo8QBi0uZcA/VGFBO/rplWoL
CdiPqP+islpEj16lmfC28kppIUsP1M1i/kc3+wGCLL2QOBiuMuH/oxWfj7vIdNdyGpQDSyPapwi6
HxLbeOdIQB/ZY9ZpDNhQ+sbtYu5UnNw+vHjxHZnzmwFcL0u9hykQySIjqt8g+YDhDD6paJkyOnzL
qZmOpGCnMpfpOhd5SkzQZV0/vK9rzgSFFvXyoYytgiazZGyImD5Jb1netNxol1WDBo4vmHIdt5R8
7fyzNqZM3jg2EufHeU3mEEAGVAYSbW0gSdD1nIHpydkixUKJ3uguwzMRTrgxkR2BzEWe7Px4oJI5
M4CJAtRH+YoUlH0PzZVuAYczuKPd2YpWJ3oPRlsayD3YuPxDWRx9aeqmZ969+m7axV+ik1tK2xdn
Zzven8tobu9v/RMAAP//AwBQSwMEFAAGAAgAAAAhAMrmE8fjAAAADgEAAA8AAABkcnMvZG93bnJl
di54bWxMj8FOwzAQRO9I/IO1SNyoY1TaJI1TIapyRLRw4ebGJomw15HtNIGvZzmVy0qj3Z15U21n
Z9nZhNh7lCAWGTCDjdc9thLe3/Z3ObCYFGplPRoJ3ybCtr6+qlSp/YQHcz6mlpEJxlJJ6FIaSs5j
0xmn4sIPBmn36YNTiWRouQ5qInNn+X2WrbhTPVJCpwbz1Jnm6zg6CZN9Xj3oYjh87HkQ69efUePu
Rcrbm3m3ofG4AZbMnC4f8NeB+KEmsJMfUUdmJSxFRvxJQp6vgdHBssgFsJOEQgjgdcX/16h/AQAA
//8DAFBLAQItABQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAAAAAAAAAAAAAAAAAAABbQ29udGVu
dF9UeXBlc10ueG1sUEsBAi0AFAAGAAgAAAAhADj9If/WAAAAlAEAAAsAAAAAAAAAAAAAAAAALwEA
AF9yZWxzLy5yZWxzUEsBAi0AFAAGAAgAAAAhAE5r0jTyAQAAxwMAAA4AAAAAAAAAAAAAAAAALgIA
AGRycy9lMm9Eb2MueG1sUEsBAi0AFAAGAAgAAAAhAMrmE8fjAAAADgEAAA8AAAAAAAAAAAAAAAAA
TAQAAGRycy9kb3ducmV2LnhtbFBLBQYAAAAABAAEAPMAAABcBQAAAAA=
" strokecolor="windowText" strokeweight="1.5pt">
<v:stroke dashstyle="dash" joinstyle="miter"/>
</v:line>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="7B559002">
<v:line id="Straight Connector 9" o:spid="_x0000_s1027" style="position:absolute;left:0;text-align:left;z-index:3;visibility:visible;mso-width-relative:margin;mso-height-relative:margin" from="203.6pt,47.3pt" to="249.65pt,114pt" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQDiYn9F7wEAAL4DAAAOAAAAZHJzL2Uyb0RvYy54bWysU02P0zAQvSPxHyzfadJlC23UdA+tlssK
KrX8gFnHSSz8JY9pkn/P2P3YAjdEDtbY43me9+Zl/TQazU4yoHK25vNZyZm0wjXKdjX/fnz+sOQM
I9gGtLOy5pNE/rR5/249+Eo+uN7pRgZGIBarwde8j9FXRYGilwZw5ry0lGxdMBBpG7qiCTAQutHF
Q1l+KgYXGh+ckIh0ujsn+Sbjt60U8VvbooxM15x6i3kNeX1Na7FZQ9UF8L0SlzbgH7owoCw9eoPa
QQT2M6i/oIwSwaFr40w4U7i2VUJmDsRmXv7B5tCDl5kLiYP+JhP+P1jx9bQPTDU1X3FmwdCIDjGA
6vrIts5aEtAFtko6DR4rur61+5CYitEe/IsTP5BZt+3BdjL3e5w8gcxTRfFbSdqgPxePbTAJhARg
Y57GdJuGHCMTdLhYPi4/LjgTlFo+fi5XeVoFVNdiHzB+kc6wFNRcK5vEggpOLxjT81Bdr6Rj656V
1nng2rKBelyVC/KEAPJdqyFSaDwpgbbjDHRHhhYxZEh0WjWpPAHhhFsd2AnIU2TFxg1H6pkzDRgp
QUTylxWg2/elqZ8dYH8ubig6O9CoSL+BVoao3hdrmx6U2cgXUm8ypujVNdM+XLUmk2TaF0MnF97v
80TefrvNLwAAAP//AwBQSwMEFAAGAAgAAAAhAMoUjjTgAAAADwEAAA8AAABkcnMvZG93bnJldi54
bWxMT0tOwzAQ3SNxB2sqsaN2TVSaNE6FIFRiSekBpvGQRI3tKHY+vT1mBZuRnuZ988NiOjbR4Ftn
FWzWAhjZyunW1grOX++PO2A+oNXYOUsKbuThUNzf5ZhpN9tPmk6hZtHE+gwVNCH0Gee+asigX7ue
bPx9u8FgiHCouR5wjuam41KILTfY2pjQYE+vDVXX02gUmCo9jjSV5VGeb3zm/fWjwVKph9Xyto/n
ZQ8s0BL+FPC7IfaHIha7uNFqzzoFiXiWkaogTbbAIiFJ0ydgFwVS7gTwIuf/dxQ/AAAA//8DAFBL
AQItABQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAAAAAAAAAAAAAAAAAAABbQ29udGVudF9UeXBl
c10ueG1sUEsBAi0AFAAGAAgAAAAhADj9If/WAAAAlAEAAAsAAAAAAAAAAAAAAAAALwEAAF9yZWxz
Ly5yZWxzUEsBAi0AFAAGAAgAAAAhAOJif0XvAQAAvgMAAA4AAAAAAAAAAAAAAAAALgIAAGRycy9l
Mm9Eb2MueG1sUEsBAi0AFAAGAAgAAAAhAMoUjjTgAAAADwEAAA8AAAAAAAAAAAAAAAAASQQAAGRy
cy9kb3ducmV2LnhtbFBLBQYAAAAABAAEAPMAAABWBQAAAAA=
" strokecolor="windowText" strokeweight="1.5pt">
<v:stroke dashstyle="dash" joinstyle="miter"/>
</v:line>
</w:pict>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:pict w14:anchorId="1C829DE8">
<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m#4#5l#4#11#9#11#9#5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum #0 1 0"/>
<v:f eqn="sum 0 0 #1"/>
<v:f eqn="prod #2 1 2"/>
<v:f eqn="prod #3 21600 pixelWidth"/>
<v:f eqn="prod #3 21600 pixelHeight"/>
<v:f eqn="sum #0 0 1"/>
<v:f eqn="prod #6 1 2"/>
<v:f eqn="prod #7 21600 pixelWidth"/>
<v:f eqn="sum #8 21600 0"/>
<v:f eqn="prod #7 21600 pixelHeight"/>
<v:f eqn="sum #10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype>
<v:shape id="Picture 6" o:spid="_x0000_s1026" type="#_x0000_t75" style="position:absolute;left:0;text-align:left;margin-left:247.8pt;margin-top:10pt;width:186pt;height:112.15pt;z-index:-1;visibility:visible" wrapcoords="17332 576 17332 2880 3571 3024 1742 3312 1742 5184 348 5328 348 6192 1742 7488 1742 9792 871 12096 871 12816 1481 14400 1742 16704 261 16848 261 18000 2439 19008 2613 21456 3135 21456 5661 21312 18726 19440 19945 19008 21426 17712 21339 16704 19510 14400 19510 7488 20816 7200 20816 5472 19510 4752 18639 3456 17855 2880 18639 2016 18639 864 17768 576 17332 576">
<v:imagedata r:id="rId8" o:title=""/>
<w10:wrap type="tight"/>
</v:shape>
</w:pict>
</w:r>
<w:r w:rsidR="00524183">
<w:rPr>
<w:noProof/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:pict w14:anchorId="63A496C5">
<v:shape id="Picture 5" o:spid="_x0000_i1025" type="#_x0000_t75" style="width:191.7pt;height:128.1pt;visibility:visible">
<v:imagedata r:id="rId9" o:title=""/>
</v:shape>
</w:pict>
</w:r>
</w:p>
The images are in the docx file, but do not show up in document.inline_shapes (python-docx), hence I have no idea how to continue.. any help appreciated :)
Check this code you can identify the location of an image after a specific text:
tags = []
for t in document.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing':
print('Picture Found')
else:
print(t.text)
Check this code. You can extract image position between two texts and image name by:
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.
If you have extracted the image location and image name then you can add the image in your docx file by this code
from docx import Document
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')
You can access the image by unzipping your docx file. when you will unzip you will get different folders. You can access all the images in the file from word/media folder
Check this link for unzipping a docx file
https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122
This may not be a direct answer to your question, but it is worth considering.
If you have control over docA, have you considered the use of a docx template? In my problem I needed to generate reports from a template, so I had to copy information from python variables into a document, to generate a report. I found this project library which does replacement: https://github.com/elapouya/python-docx-template
Finally, you can replace the content from your variables like this:
from docxtpl import DocxTemplate
doc = DocxTemplate("my_word_template.docx")
context = { 'company_name' : "World company" }
doc.render(context)
doc.save("generated_doc.docx")
I have not checked but I believe this does preserve formatting. Here is an example of what my template looked like before replacing variables:
Related
There is a string text
<R-PORT-PROTOTYPE>
<SHORT-NAME>VtDBGO_ReMotTqReqL2_null</SHORT-NAME>
<REQUIRED-COM-SPECS>
<NONQUEUED-RECEIVER-COM-SPEC>
<DATA-ELEMENT-REF DEST="VARIABLE-DATA-PROTOTYPE">/FS_DBGO_pkg/FS_DBGO_if/VtDBGO_ReMotTqReqL2_null/VtDBGO_ReMotTqReqL2_null</DATA-ELEMENT-REF>
<USES-END-TO-END-PROTECTION>false</USES-END-TO-END-PROTECTION>
<ALIVE-TIMEOUT>0</ALIVE-TIMEOUT>
<ENABLE-UPDATE>false</ENABLE-UPDATE>
<FILTER>
<DATA-FILTER-TYPE>ALWAYS</DATA-FILTER-TYPE>
</FILTER>
<HANDLE-NEVER-RECEIVED>false</HANDLE-NEVER-RECEIVED>
<INIT-VALUE>
<NUMERICAL-VALUE-SPECIFICATION>
<VALUE>0</VALUE>
</NUMERICAL-VALUE-SPECIFICATION>
</INIT-VALUE>
</NONQUEUED-RECEIVER-COM-SPEC>
</REQUIRED-COM-SPECS>
<REQUIRED-INTERFACE-TREF DEST="SENDER-RECEIVER-INTERFACE">/FS_DBGO_pkg/FS_DBGO_if/VtDBGO_ReMotTqReqL2_null</REQUIRED-INTERFACE-TREF>
</R-PORT-PROTOTYPE>
How can I replace the SHORT-NAME of VtDBGO_FrMotTqReqL2_null with the attribute content.
just like the following string text
<R-PORT-PROTOTYPE UUID="E8000CF6-DAFD-49C7-B1D4-D7EC20F43654">
<SHORT-NAME>VtDBGO_FrMotTqReqL2_null</SHORT-NAME>
<REQUIRED-COM-SPECS>
<NONQUEUED-RECEIVER-COM-SPEC>
<DATA-FILTER-TYPE>ALWAYS</DATA-FILTER-TYPE>
</FILTER>
<HANDLE-NEVER-RECEIVED>false</HANDLE-NEVER-RECEIVED>
<INIT-VALUE>
<RECORD-VALUE-SPECIFICATION>
<FIELDS>
<NUMERICAL-VALUE-SPECIFICATION>
<SHORT-LABEL>ElementConstant</SHORT-LABEL>
<VALUE>0</VALUE>
</NUMERICAL-VALUE-SPECIFICATION>
<NUMERICAL-VALUE-SPECIFICATION>
<SHORT-LABEL>ElementConstant_1</SHORT-LABEL>
<VALUE>0</VALUE>
</NUMERICAL-VALUE-SPECIFICATION>
</FIELDS>
</RECORD-VALUE-SPECIFICATION>
</INIT-VALUE>
</NONQUEUED-RECEIVER-COM-SPEC>
</REQUIRED-COM-SPECS>
Here is my two solutions ,but it didn't work
pattern_update=re.compile(r"""<R-PORT -PROTOTYPE>
\s+<SHORT-NAME>{0}</SHORT-NAME>
.*?<INIT-VALUE>((?:.(?!<REQUIRED-COM-SPECS>))*?)</INIT-VALUE>.*?
\s+</R-PORT-PROTOTYPE>""".format(port),re.DOTALL | re.MULTILINE)
_ar_xml_str = re.sub(pattern_update, replace_str, ar_xml_str)
ar_xml_str = _ar_xml_str
pattern_update=re.compile(r"""<SHORT-NAME>VtDBGO_ReMotTqReqL2_null</SHORT-NAME>
.*?<INIT-VALUE>(.*?)</INIT-VALUE>""",re.DOTALL | re.MULTILINE)
_ar_xml_str = re.sub(pattern_update, replace_str, ar_xml_str)
ar_xml_str = _ar_xml_str
I am trying to access the name of a picture in a word document using python-docx but i also need to know which paragraph and run it is contained in so I can not use inline_shapes.
docx = Document()
section = docx.sections[0]
p = docx.add_paragraph()
run = p.add_run()
img = run.add_picture("pptExporter.png", 100000)
a = run._r.xml
print(a)
b = run._r.drawing.inline.graphic.graphicData.pic.nvPicPr.get('name')
p2 = docx.add_paragraph("hello", style="Caption")
docx.save("test.docx")
When I print the xml I get:
<w:r xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<w:drawing>
<wp:inline xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<wp:extent cx="100000" cy="75021"/>
<wp:docPr id="1" name="Picture 1"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
<a:graphic>
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic>
<pic:nvPicPr>
<pic:cNvPr id="0" name="pptExporter.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId9"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="100000" cy="75021"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
but I get the following error:
Traceback (most recent call last):
File "file", line 11, in <module>
b = run._r.drawing.inline.graphic.graphicData.pic.nvPicPr.get('name')
AttributeError: 'CT_R' object has no attribute 'drawing'
Using the below script to remove the child node based on the image type from below XML but there is below error because of xmlns header so I removed that and tried still it is only removing 3 child nodes present out of 5.
Can you please check?
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) All rights reserved. -->
<dummy_list xmlns="https://dummy_list_file"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="template.xsd">
<dummy_capability>
<dummy_type>1</dummy_type>
<dummy_type_string>dummy_3700E</dummy_type_string>
<dummy_image>c3700</dummy_image>
<dummy_string>dummy3702E,dummy3701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>2</dummy_type>
<dummy_type_string>dummy_2700E</dummy_type_string>
<dummy_image>c2700</dummy_image>
<dummy_string>dummy2702E,dummy2701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>3</dummy_type>
<dummy_type_string>dummy_1700E</dummy_type_string>
<dummy_image>c1700</dummy_image>
<dummy_string>dummy1702E,dummy1701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
</dummy_list>
#!/router/bin/python3-3.6.3
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('dummy.xml')
root = tree.getroot()
for child in root:
if (child.find('dummy_image').text == 'c3700'):
print("Removing child: " + child.find('dummy_image').text)
root.remove(child)
tree.write('out.xml')
How can I parse this with also present?
xmlns="https://dummy_list_file"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="template.xsd
Why it is not removing all the child nodes fro perticular image type?
Another method.
from simplified_scrapy import SimplifiedDoc,utils
import json
xml = utils.getFileContent('dummy.xml')
doc = SimplifiedDoc(xml)
dummy_capabilitys = doc.selects('dummy_image').contains('c3700').parent
for dummy_capability in dummy_capabilitys:
dummy_capability.repleaceSelf("")
utils.saveFile("out.xml",doc.html)
# Get attributes
root = doc.select('dummy_list')
print (root["xmlns"],root["xmlns:xsi"],root["xsi:schemaLocation"])
Result:
https://dummy_list_file http://www.w3.org/2001/XMLSchema-instance template.xsd
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
I have plain text file with words in each line:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3521 >India<TOPONYM>/O
3526 >Zimbabwe<TOPONYM>/O
3531 >England<TOPONYM>/O
3536 >Melbourne<TOPONYM>/O
3541 >England<TOPONYM>/O
3546 >England<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3556 >England<TOPONYM>/O
3561 >England<TOPONYM>/O
3566 >Australia<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3821 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4234 >Hampden<TOPONYM>/O
4239 >Hampden<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4845 >Edinburgh<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
I want to remove same location names in this list and it should look like this:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3497 England/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O
I want to remove the duplicate locations name and docid should remain in the file. I know there is a way through linux using uniq but if I'll run that it will remove locations within different docid.
Is there anyway to split it through every docid and within docid if location names are same then it should remove duplicate names.
I am writing from mobile, so this will not be a complete solution, but the key points:
import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={}
for line in file:
if re.match(Docid,line):
Lines={}
print line
else:
loc=re.findall(Location, line)[0]
if loc not in Lines.keys():
print line
Lines[loc] = True
Basically it checks each line of it is not a new docid. If it isn't, it then tries to read location and see if it already was read. If not, it prints the location and adds it to the list of locations tead.
If there is a new docid, it resets the last of read locations.
Here is a way to do it.
import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))
final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
currentline = str(line)
if 'DOCID' in currentline:
unique_list = [] # this resets itself every docid
final_list.append(line)
else:
exclude = set(string.punctuation)
currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
city = currentline.split()[1]
if city not in unique_list:
unique_list.append(city)
final_list.append(line)
for line in final_list:
print(line)
output:
3210 <DOCID>GH950102-000003<DOCID>/O
3243 Australia/LOCATION
3360 England/LOCATION
3414 India/LOCATION
3474 Melbourne/LOCATION
3526 >Zimbabwe<TOPONYM>/O
3551 >Glasgow<TOPONYM>/O
3568 <DOCID>GH950102-000004<DOCID>/O
3739 Hampden/LOCATION
3838 Ibrox/LOCATION
3861 Neerday/LOCATION
4161 Fir Park/LOCATION
4229 Park<TOPONYM>/O
4244 >Midfield<TOPONYM>/O
4249 >Glasgow<TOPONYM>/O
4251 <DOCID>GH950102-000005<DOCID>/O
4535 Edinburgh/LOCATION
4840 Road<TOPONYM>/O
4850 >Glasgow<TOPONYM>/O``
Note: The testfileis a text file with your input text. You can optimize the code if necessary.
<GeocodeResponse>
<status>OK</status>
<result>
<type>locality</type>
<type>political</type>
<formatted_address>Chengam, Tamil Nadu 606701, India</formatted_address>
<address_component>
<long_name>Chengam</long_name>
<short_name>Chengam</short_name>
<type>locality</type>
<type>political</type>
</address_component>
<address_component>
<long_name>Tiruvannamalai</long_name>
<short_name>Tiruvannamalai</short_name>
<type>administrative_area_level_2</type>
<type>political</type>
</address_component>
<address_component>
<long_name>Tamil Nadu</long_name>
<short_name>TN</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
<address_component>
<long_name>India</long_name>
<short_name>IN</short_name>
<type>country</type>
<type>political</type>
</address_component>
<address_component>
<long_name>606701</long_name>
<short_name>606701</short_name>
<type>postal_code</type>
</address_component>
<geometry>
<location>
<lat>12.3067864</lat>
<lng>78.7957856</lng>
</location>
<location_type>APPROXIMATE</location_type>
<viewport>
<southwest>
<lat>12.2982423</lat>
<lng>78.7832165</lng>
</southwest>
<northeast>
<lat>12.3213030</lat>
<lng>78.8035583</lng>
</northeast>
</viewport>
<bounds>
<southwest>
<lat>12.2982423</lat>
<lng>78.7832165</lng>
</southwest>
<northeast>
<lat>12.3213030</lat>
<lng>78.8035583</lng>
</northeast>
</bounds>
</geometry>
<place_id>ChIJu8JCb3jxrDsRAOfhACQczWo</place_id>
</result>
</GeocodeResponse>
I am new to xml thing and i don't know how to handle it with python xml.etree ? Basic stuffs i read from https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml is useful,but still struggling to printout the latitude and longitude values under geometry-->location.i have tried something like this
with open('data.xml', 'w') as f:
f.write(xmlURL.text)
tree = ET.parse('data.xml')
root = tree.getroot()
lat = root.find(".//geometry/location")
print(lat.text)
You almost got it. Change root.find(".//geometry/location") to root.find(".//geometry/location/lat"):
lat = root.find(".//geometry/location/lat")
print(lat.text)
>> 12.3067864
Same goes for lng of course:
lng = root.find(".//geometry/location/lng")
print(lng.text)
>> 78.7957856