I am trying to convert my dataframe to xml in python.
My dataframe looks something like this:
LoaderTXNID Description ANUMS Value Date Fund
67805499 CA67805499 44554 1/27/2023 NC1_AR
67805499 CA67805499 33002 1/27/2023 NC1_AR
67805499 CA67805499 11504 1/27/2023 NC1_AR
67805501 CA67805501 16704 1/27/2023 NC1_AR
67805501 CA67805501 33002 1/27/2023 NC1_AR
67805501 CA67805501 88504 1/27/2023 NC1_AR
67805503 CA67805503 11504 1/27/2023 NC1_AR
67805503 CA67805503 33002 1/27/2023 NC1_AR
67805503 CA67805503 11504 1/27/2023 NC1_AR
67805503 CA67805503 33002 1/27/2023 NC1_AR
67805505 CA67805505 11504 1/27/2023 NC1_AR
67805505 CA67805505 33002 1/27/2023 NC1_AR
67805505 CA67805505 11504 1/27/2023 NC1_AR
67805505 CA67805505 33002 1/27/2023 NC1_AR
Requirement:
I want to convert it to xml resulting in something like below: (i.e Suppose for Description CA67805499 there are 3 ANUMS (44554, 33002 and 11504). So I want them under entries tag
Request
xmlns:xsd=http://www.w3.org/2001/XMLSchema
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Transactions>
<Transaction>
<Description>CA67805499</Description>
<Entries>
<Entry>
<Anum>44554</Anum>
</Entry>
<Entry>
<Anum>33002</Anum>
</Entry>
<Entry>
<Anum>11504</Anum>
</Entry>
</Entries>
</Transaction>
<Transaction>
<Description>CA67805501</Description>
<Entries>
<Entry>
<Anum>16704</Anum>
</Entry>
<Entry>
<Anum>33002</Anum>
</Entry>
<Entry>
<Anum>88504</Anum>
</Entry>
</Entries>
</Transaction>
</Transactions>undefined</Request>
I wrote the following code to achieve the same:
df= pd.read_csv('C:\\Users\\Sddl\\Desktop\\twoigma.csv')
df.columns = df.columns.str.replace(' ', '_')
with open('outputf.xml', 'w') as myfile:
myfile.write(df.to_xml(index=False,row_name='Entry',root_name='Transaction',elem_cols={'ANUMS'},attr_cols={'Description'},pretty_print=True,parser='lxml'))
But I get result like below:
<?xml version="1.0" encoding="UTF-8"?>
<Transaction>
<Entry Description="CA67805499">
<ANUMS>33002</ANUMS>
</Entry>
<Entry Description="CA67805499">
<ANUMS>11504</ANUMS>
</Entry>
<Entry Description="CA67805499">
<ANUMS>33002</ANUMS>
</Entry>
<Entry Description="CA67805501">
<ANUMS>16704</ANUMS>
</Entry>
<Entry Description="CA67805501">
<ANUMS>33002</ANUMS>
</Entry>
<Entry Description="CA67805501">
<ANUMS>88504</ANUMS>
</Entry>
</Transaction>
How do I group by records in xml?
Since your needed output is a specialized, grouped XML, consider XSLT, the special-purpose language to transform XML files and sibling to XPath.
Both pandas read_xml and to_xml supports XSLT 1.0 using the lxml parser for reading from and writing to complex XML files. Specifically, run the Muenchian Method for grouping <ANUMS> by <Description> nodes with a new <Request> root after generating a flat XML with to_xml:
XSLT (save as .xsl or in a Python string where XSLT is a special XML type)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="desc-key" match="Entry" use="Description" />
<xsl:template match="/Transaction">
<Request
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:copy>
<xsl:for-each select="Entry[count(. | key('desc-key', Description)[1]) = 1]">
<xsl:sort select="Description" />
<Transaction>
<xsl:copy-of select="Description"/>
<Entries>
<xsl:for-each select="key('desc-key', Description)">
<Entry>
<xsl:copy-of select="ANUMS" />
</Entry>
</xsl:for-each>
</Entries>
</Transaction>
</xsl:for-each>
</xsl:copy>
</Request>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
from io import StringIO
import pandas as pd
txt = '''LoaderTXNID Description ANUMS "Value Date" Fund
67805499 CA67805499 44554 "1/27/2023" NC1_AR
67805499 CA67805499 33002 "1/27/2023" NC1_AR
67805499 CA67805499 11504 "1/27/2023" NC1_AR
67805501 CA67805501 16704 "1/27/2023" NC1_AR
67805501 CA67805501 33002 "1/27/2023" NC1_AR
67805501 CA67805501 88504 "1/27/2023" NC1_AR
67805503 CA67805503 11504 "1/27/2023" NC1_AR
67805503 CA67805503 33002 "1/27/2023" NC1_AR
67805503 CA67805503 11504 "1/27/2023" NC1_AR
67805503 CA67805503 33002 "1/27/2023" NC1_AR
67805505 CA67805505 11504 "1/27/2023" NC1_AR
67805505 CA67805505 33002 "1/27/2023" NC1_AR
67805505 CA67805505 11504 "1/27/2023" NC1_AR
67805505 CA67805505 33002 "1/27/2023" NC1_AR'''
with StringIO(txt) as f:
annum_df = pd.read_csv(f, sep="\\s+")
annum_df.to_xml(
"outputf.xml", # OUTPUT XML
index = False,
row_name = 'Entry',
root_name = 'Transaction',
elem_cols = ['ANUMS', 'Description'],
stylesheet = "style.xsl", # EXTERNAL XSLT
parser='lxml'
)
Output XML
<?xml version="1.0"?>
<Request xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Transaction>
<Transaction>
<Description>CA67805499</Description>
<Entries>
<Entry>
<ANUMS>44554</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
<Entry>
<ANUMS>11504</ANUMS>
</Entry>
</Entries>
</Transaction>
<Transaction>
<Description>CA67805501</Description>
<Entries>
<Entry>
<ANUMS>16704</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
<Entry>
<ANUMS>88504</ANUMS>
</Entry>
</Entries>
</Transaction>
<Transaction>
<Description>CA67805503</Description>
<Entries>
<Entry>
<ANUMS>11504</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
<Entry>
<ANUMS>11504</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
</Entries>
</Transaction>
<Transaction>
<Description>CA67805505</Description>
<Entries>
<Entry>
<ANUMS>11504</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
<Entry>
<ANUMS>11504</ANUMS>
</Entry>
<Entry>
<ANUMS>33002</ANUMS>
</Entry>
</Entries>
</Transaction>
</Transaction>
</Request>
Related
I would like your help to be able to access the text inside a Text Box element of a word .docx file. Currently, after doing some research, I found that the best way to do it is to access the element through the xpath, so to know what path it contains the text of a text box element I decided to convert my document to a .zip file to be able to access the structure, inside this I found a file in the path: /word/document.xml which contains this structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex"
xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex"
xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex"
xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex"
xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex"
xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex"
xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex"
xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex"
xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink"
xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:oel="http://schemas.microsoft.com/office/2019/extlst"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex"
xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid"
xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml"
xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash"
xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh wp14">
<w:body>
<w:p w14:paraId="390851B5" w14:textId="0F1265C2" w:rsidR="00D477FD"
w:rsidRDefault="00A91CF5">
<w:r>
<w:rPr>
<w:noProof />
</w:rPr>
<mc:AlternateContent>
<mc:Choice Requires="wps">
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300"
simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0"
layoutInCell="1" allowOverlap="1" wp14:anchorId="4FD68124"
wp14:editId="0D7E409D">
<wp:simplePos x="0" y="0" />
<wp:positionH relativeFrom="column">
<wp:posOffset>34290</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>195580</wp:posOffset>
</wp:positionV>
<wp:extent cx="2190750" cy="2438400" />
<wp:effectExtent l="0" t="0" r="19050" b="19050" />
<wp:wrapNone />
<wp:docPr id="1" name="Cuadro de texto 1" />
<wp:cNvGraphicFramePr />
<a:graphic
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData
uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<wps:wsp>
<wps:cNvSpPr txBox="1" />
<wps:spPr>
<a:xfrm>
<a:off x="0" y="0" />
<a:ext cx="2190750" cy="2438400" />
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst />
</a:prstGeom>
<a:solidFill>
<a:schemeClr val="lt1" />
</a:solidFill>
<a:ln w="6350">
<a:solidFill>
<a:prstClr val="black" />
</a:solidFill>
</a:ln>
</wps:spPr>
<wps:txbx>
<w:txbxContent>
<w:p w14:paraId="563CE207" w14:textId="6341C126"
w:rsidR="00A91CF5" w:rsidRDefault="00A91CF5">
<w:r>
<w:t>Hola como estas</w:t>
</w:r>
</w:p>
</w:txbxContent>
</wps:txbx>
<wps:bodyPr rot="0" spcFirstLastPara="0"
vertOverflow="overflow" horzOverflow="overflow"
vert="horz" wrap="square" lIns="91440" tIns="45720"
rIns="91440" bIns="45720" numCol="1" spcCol="0"
rtlCol="0" fromWordArt="0" anchor="t" anchorCtr="0"
forceAA="0" compatLnSpc="1">
<a:prstTxWarp prst="textNoShape">
<a:avLst />
</a:prstTxWarp>
<a:noAutofit />
</wps:bodyPr>
</wps:wsp>
</a:graphicData>
</a:graphic>
</wp:anchor>
</w:drawing>
</mc:Choice>
<mc:Fallback>
<w:pict>
<v:shapetype w14:anchorId="4FD68124" id="_x0000_t202"
coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe">
<v:stroke joinstyle="miter" />
<v:path gradientshapeok="t" o:connecttype="rect" />
</v:shapetype>
<v:shape id="Cuadro de texto 1" o:spid="_x0000_s1026"
type="#_x0000_t202"
style="position:absolute;margin-left:2.7pt;margin-top:15.4pt;width:172.5pt;height:192pt;z-index:251659264;visibility:visible;mso-wrap-style:square;mso-wrap-distance-left:9pt;mso-wrap-distance-top:0;mso-wrap-distance-right:9pt;mso-wrap-distance-bottom:0;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-position-vertical:absolute;mso-position-vertical-relative:text;v-text-anchor:top"
o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQBG9oAoNgIAAH0EAAAOAAAAZHJzL2Uyb0RvYy54bWysVEtv2zAMvg/YfxB0X+yk6cuIU2QpMgwo
2gLp0LMiS7EwWdQkJXb260cpzqPtTsMuMilSH8mPpCd3XaPJVjivwJR0OMgpEYZDpcy6pD9eFl9u
KPGBmYppMKKkO+Hp3fTzp0lrCzGCGnQlHEEQ44vWlrQOwRZZ5nktGuYHYIVBowTXsICqW2eVYy2i
Nzob5flV1oKrrAMuvMfb+72RThO+lIKHJym9CESXFHML6XTpXMUzm05YsXbM1or3abB/yKJhymDQ
I9Q9C4xsnPoA1SjuwIMMAw5NBlIqLlINWM0wf1fNsmZWpFqQHG+PNPn/B8sft0v77EjovkKHDYyE
tNYXHi9jPZ10TfxipgTtSOHuSJvoAuF4ORre5teXaOJoG40vbsZ5IjY7PbfOh28CGhKFkjrsS6KL
bR98wJDoenCJ0TxoVS2U1kmJsyDm2pEtwy7qkJLEF2+8tCFtSa8uMI8PCBH6+H6lGf8Zy3yLgJo2
eHkqPkqhW3U9IyuodkiUg/0MecsXCnEfmA/PzOHQIAG4COEJD6kBk4FeoqQG9/tv99Efe4lWSloc
wpL6XxvmBCX6u8Eu3w7H4zi1SRlfXo9QceeW1bnFbJo5IENDXDnLkxj9gz6I0kHzivsyi1HRxAzH
2CUNB3Ee9quB+8bFbJaccE4tCw9maXmEjuRGPl+6V+Zs38+Ao/AIh3Flxbu27n3jSwOzTQCpUs8j
wXtWe95xxlNb+n2MS3SuJ6/TX2P6BwAA//8DAFBLAwQUAAYACAAAACEASPXnCtsAAAAIAQAADwAA
AGRycy9kb3ducmV2LnhtbEyPwU7DMBBE70j8g7VI3KhdmqIQ4lSAChdOLYizG29ti9iObDcNf89y
guPOjGbftJvZD2zClF0MEpYLAQxDH7ULRsLH+8tNDSwXFbQaYkAJ35hh011etKrR8Rx2OO2LYVQS
cqMk2FLGhvPcW/QqL+KIgbxjTF4VOpPhOqkzlfuB3wpxx71ygT5YNeKzxf5rf/IStk/m3vS1SnZb
a+em+fP4Zl6lvL6aHx+AFZzLXxh+8QkdOmI6xFPQmQ0S1hUFJawEDSB7tRYkHCRUy6oG3rX8/4Du
BwAA//8DAFBLAQItABQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAAAAAAAAAAAAAAAAAAABbQ29u
dGVudF9UeXBlc10ueG1sUEsBAi0AFAAGAAgAAAAhADj9If/WAAAAlAEAAAsAAAAAAAAAAAAAAAAA
LwEAAF9yZWxzLy5yZWxzUEsBAi0AFAAGAAgAAAAhAEb2gCg2AgAAfQQAAA4AAAAAAAAAAAAAAAAA
LgIAAGRycy9lMm9Eb2MueG1sUEsBAi0AFAAGAAgAAAAhAEj15wrbAAAACAEAAA8AAAAAAAAAAAAA
AAAAkAQAAGRycy9kb3ducmV2LnhtbFBLBQYAAAAABAAEAPMAAACYBQAAAAA=
"
fillcolor="white [3201]" strokeweight=".5pt">
<v:textbox>
<w:txbxContent>
<w:p w14:paraId="563CE207" w14:textId="6341C126"
w:rsidR="00A91CF5" w:rsidRDefault="00A91CF5">
<w:r>
<w:t>Hola como estas</w:t>
</w:r>
</w:p>
</w:txbxContent>
</v:textbox>
</v:shape>
</w:pict>
</mc:Fallback>
</mc:AlternateContent>
</w:r>
</w:p>
<w:sectPr w:rsidR="00D477FD">
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1417" w:right="1701" w:bottom="1417" w:left="1701" w:header="708"
w:footer="708" w:gutter="0" />
<w:cols w:space="708" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
</w:body>
</w:document>
The text that I want to obtain is inside: <w:t>Hello, how are you </w:t>, so to get there it occurred to me to enter noting the entire path until I reached it, having something like this at the end:
import docx
doc = docx.Document("documento.docx")
textbox_text = doc.element.xpath('/w:document/w:body/w:p/w:r/mc:AlternateContent/mc:Fallback/w:pick/v:shape/v:textbox/w:txbxContent/w:t')
print(textbox_text)
But it doesn't return the text, it doesn't return anything
I have been looking around and there are a lot of similar questions, but none that solved my issue sadly.
My XML file looks like this
<?xml version="1.0" encoding="utf-8"?>
<Nodes>
<Node ComponentID="1">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="On"/>
</Settings>
</Node>
<Node ComponentID="2">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="Off"/>
</Settings>
</Node>
<Node ComponentID="3">
<Settings>
<Value name="Text Box (1)"> SettingG </Value>
<Value name="Text Box (2)"> SettingH </Value>
<Value name="Text Box (3)"> SettingI </Value>
<Value name="Text Box (4)"> SettingJ </Value>
<AdvSettings State="Yes"/>
</Settings>
</Node>
</Nodes>
With Python I'm trying to get the Values of text box 1 and text box 2 for each Node that has "AdvSettings" set on ON.
So in this case I would like a result like
ComponentID State Textbox1 Textbox2
1 On SettingA SettingB
3 On SettingG SettingH
I have done some attempts but didn't get far. With this I managed to get the AdvSettings tag, but that's as far as I got:
import xml.etree.ElementTree as ET
tree = ET.parse('XMLSearch.xml')
root = tree.getroot()
for AdvSettingsin root.iter('AdvSettings'):
print(AdvSettings.tag, AdvSettings.attrib)
You can use an XPath to find all the relevant nodes and then extract the needed data out of them. An example to this will be like below. (Comments as explanation)
from lxml import etree
xml = etree.fromstring('''
<Nodes>...
</Nodes>
''')
# Use XPath to select the relevant nodes
on_nodes = xml.xpath("//Node[Settings[AdvSettings[#State='Yes' or #State='On']]]")
# Get all needed information from every node
data_collected = [dict(
[("ComponentID", node.attrib['ComponentID'])] +
[(c.get("name"), c.text) for c in node.find("Settings").getchildren() if c.text]) for node in on_nodes]
# You got a list of dicts with all relevant information
# print it out, I used pandas for formatting. Optional
import pandas
print(pandas.DataFrame.from_records(data_collected).to_markdown(index=False))
Would give you an output like
| ComponentID | Text Box (1) | Text Box (2) | Text Box (3) | Text Box (4) |
|--------------:|:---------------|:---------------|:---------------|:---------------|
| 1 | SettingA | SettingB | SettingC | SettingD |
| 3 | SettingG | SettingH | SettingI | SettingJ |
Below (using python core xml lib)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="utf-8"?>
<Nodes>
<Node ComponentID="1">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="On"/>
</Settings>
</Node>
<Node ComponentID="2">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="Off"/>
</Settings>
</Node>
<Node ComponentID="3">
<Settings>
<Value name="Text Box (1)"> SettingG </Value>
<Value name="Text Box (2)"> SettingH </Value>
<Value name="Text Box (3)"> SettingI </Value>
<Value name="Text Box (4)"> SettingJ </Value>
<AdvSettings State="Yes"/>
</Settings>
</Node>
</Nodes>'''
data = []
root = ET.fromstring(xml)
nodes = root.findall('.//Node')
for node in nodes:
adv = node.find('.//AdvSettings')
if adv is None:
continue
flag = adv.attrib.get('State','Off')
if flag == 'On' or flag == 'Yes':
data.append({'id':node.attrib.get('ComponentID'),'txt_box_1':node.find('.//Value[#name="Text Box (1)"]').text.strip(),'txt_box_2':node.find('.//Value[#name="Text Box (2)"]').text.strip()})
df = pd.DataFrame(data)
print(df)
output
id txt_box_1 txt_box_2
0 1 SettingA SettingB
1 3 SettingG SettingH
I would like to sort the below xml, by the attribute "value" of the "entry" tags and sort the strings (letters) before the numbers.
<test>
<entry value="-12" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
</test>
I have written some python that sorts this xml, but it sorts first the numbers and then the strings.
I have checked this thread, but could not implement any of the solutions to sorting XML.
import xml.etree.ElementTree as ElT
import os
from os.path import sep
def sort_xml(directory, xml_file, level1_tag, attribute, mode=0):
#mode 0 - numbers before letters
#mode 1 - letters before numbers
file = directory + sep + xml_file
tree = ElT.parse(file)
data = tree.getroot()
els = data.findall(level1_tag)
if mode == 0:
new_els = sorted(els, key=lambda e: (e.tag, e.attrib[attribute]))
if mode == 1:
new_els = sorted(els, key=lambda e: (isinstance(e.tag, (float, int)), e.attrib[attribute]))
for el in new_els:
if mode == 0:
el[:] = sorted(el, key=lambda e: (e.tag, e.attrib[attribute]))
if mode == 1:
el[:] = sorted(el, key=lambda e: (isinstance(e.tag, (float, int)), e.attrib[attribute]))
data[:] = new_els
tree.write(file, xml_declaration=True, encoding='utf-8')
with open(file, 'r') as fin:
data = fin.read().splitlines(True)
with open(file, 'w') as fout:
fout.writelines(data[1:])
sort_xml(os.getcwd(), "test.xml", "entry", "value", 1)
Any ideas how this could be done?
Edit1: Desired output
<test>
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
<entry value="-12" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
</test>
I think your problem is that when you are sorting you are checking if the value is an int or float. In fact all the values are strings e.g. isinstance(e.tag, (float, int)) will always be false.
A sorter function like this does what you want
def sorter(x):
"Check if the value can be interpreted as an integer, then by the string"
value = x.get("value")
def is_integer(i):
try:
int(i)
except ValueError:
return False
return True
return is_integer(value), value
which can be used like so (using StringIO as a substitute for the file)
from xml.etree import ElementTree
from io import StringIO
xml = """<test>
<entry value="-12" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
</test>"""
tree = ElementTree.parse(StringIO(xml))
root = tree.getroot()
root[:] = sorted(root, key=sorter)
tree.write("output.xml")
The contents of output.xml is
<test>
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
<entry value="-12" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
</test>
I took the part where the letters start and put it at the top. This the actual requirement to have the letters at the top, I don't care about the rest.
below
import xml.etree.ElementTree as ET
xml = '''<test>
<entry value="-12" />
<entry value="/this" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
</test>'''
root = ET.fromstring(xml)
numeric = []
non_numeric = []
for entry in root.findall('.//entry'):
try:
x = int(entry.attrib['value'])
numeric.append((x, entry.attrib['value']))
except ValueError as e:
non_numeric.append(entry.attrib['value'])
sorted(numeric, key=lambda x: x[0])
sorted(non_numeric)
root = ET.Element('test')
for value in non_numeric:
entry = ET.SubElement(root, 'entry')
entry.attrib['value'] = value
for value in numeric:
entry = ET.SubElement(root, 'entry')
entry.attrib['value'] = str(value[1])
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<test>
<entry value="/this" />
<entry value="_null" />
<entry value="abc" />
<entry value="abcd" />
<entry value="empty" />
<entry value="false" />
<entry value="test1" />
<entry value="test2" />
<entry value="true" />
<entry value="-12" />
<entry value="0" />
<entry value="043" />
<entry value="14" />
<entry value="6" />
</test>
i have following xml file:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE TaskDefinition PUBLIC "xxx" "yyy">
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME"/>
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>
I want to search for:
<entry key="applications" value="APP_NAME"/>
and change the value to ie.: `APP_NAME_2.
I know i can extract this value by this:
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='sample.xml')
root = tree.getroot()
print(root[0][0][0].tag, root[0][0][0].attrib)
but in this case i have to know exact position of ths entry in tree - so it is not flexible, and i have no idea how to change it.
Also tried something like this:
for app in root.attrib:
if 'applications' in root.attrib:
print(app)
but i can't figure out, why this returns nothing.
In python docs, there is following example:
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
rank.set('updated', 'yes')
tree.write('output.xml')
but i have no idea how to addjust this to my example.
I don't want to use regex for this case.
Any help appreciated.
You can locate the specific entry element with XPath.
import xml.etree.ElementTree as ET
tree = ET.parse("sample.xml")
# Find the element that has a 'key' attribute with a value of 'applications'
entry = tree.find(".//entry[#key='applications']")
# Change the value of the 'value' attribute
entry.set("value", "APP_NAME_2")
tree.write("output.xml")
Result (output.xml):
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME_2" />
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>
thanks for taking the time with this one.
i have an xml file with an element called selectionset. the idea is to take that element and modify some of the subelements attributes and tails, that part i have done.
the shady thing for me to get is why when i try to add the new subelements to the original (called selectionsets) its only pushing the last on the list inplist
import xml.etree.ElementTree as etree
from xml.etree.ElementTree import *
from xml.etree.ElementTree import ElementTree
tree=ElementTree()
tree.parse('STRUCTURAL.xml')
root = tree.getroot()
col=tree.find('selectionsets/selectionset')
#find the value needed
val=tree.findtext('selectionsets/selectionset/findspec/conditions/condition/value/data')
setname=col.attrib['name']
listnames=val + " 6"
inplist=["D","E","F","G","H"]
entry=3
catcher=[]
ss=root.find('selectionsets')
outxml=ss
for i in range(len(inplist)):
str(val)
col.set('name',(setname +" "+ inplist[i]))
col.find('findspec/conditions/condition/value/data').text=str(inplist[i]+val[1:3])
#print (etree.tostring(col)) #everything working well til this point
timper=col.find('selectionset')
root[0].append(col)
# new=etree.SubElement(outxml,timper)
#you need to create a tree with element tree before creating the xml file
itree=etree.ElementTree(outxml)
itree.write('Selection Sets.xml')
print (etree.tostring(outxml))
# print (Test_file.selectionset())
#Initial xml
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="ft" filename="STRUCTURAL.nwc" filepath="C:\Users\Ricardo\Desktop\Comun\Taller 3">
<selectionsets>
<selectionset name="Column Location" guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="contains" flags="10">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">C-A </data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
</selectionsets>
</exchange>
#----Current Output
<selectionsets>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
</selectionsets>
Here's what I've been able to put together and it looks like it'll do what you're looking for. Here are the main differences: (1) This will iterate over multiple selectionset items (if you end up with more than one), (2) It creates a deepcopy of the element before modifying the values (I think you were always modifying the original "col"), (3) It appends the new selectionset to the selectionsets tag rather than the root.
Here's the deepcopy documentation
import xml.etree.ElementTree as etree
import copy
tree=etree.ElementTree()
tree.parse('test.xml')
root = tree.getroot()
inplist=["D","E","F","G","H"]
for selectionset in tree.findall('selectionsets/selectionset'):
for i in inplist:
col = copy.deepcopy(selectionset)
col.set('name', '%s %s' % (col.attrib['name'], i))
data = col.find('findspec/conditions/condition/value/data')
data.text = '%s%s' % (i, data.text[1:3])
root.find('selectionsets').append(col)
itree = etree.ElementTree(root)
itree.write('Selection Sets.xml')