CDATA sections and comments are lost when parsing XML with ElementTree - python

I am editing xml files, I ran into the problem that when changing a file in a python script, its structure is lost.
Xml file:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<code_main />
</errors>
<reference>3</reference>
</element>
....
</main>
Используя:
tree = ET.parse(xml_file).write("test.xml", encoding='utf-8', xml_declaration=True)
I lose all comments in the file, while if I compare the original file with the modified one using diff (in linux), the files are shown as completely different
Is there a way to change the xml file (my task is to add a subelement to <element>), while leaving the overall structure of the file unchanged, including comments and order.
The order and comments are fundamental in the file
UPD:
After executing the above code, I get it from the source xml in the following form:
<?xml version='1.0' encoding='utf-8'?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Pay attention to <path>
Comments are also not saved at the same time:
Source:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<!--Stt-->
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Modified:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>

Related

cant find specific node/element using python elementtree

I have the below XML document I am trying to parse. I just need to grab one node from the document. I need to get the serviceProfile text. I'm banging my head against the desk here... I am new to Python.
<?xml version='1.0' encoding='UTF-8'?>
<soapenv:Envelope
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<ns:getUserResponse
xmlns:ns="http://www.cisco.com/AXL/API/11.5">
<return>
<user uuid="{blbhbl-bhblb-kbhb}">
<firstName>fname</firstName>
<displayName>fname lname</displayName>
<middleName/>
<lastName>lname</lastName>
<userid>wooty</userid>
<password/>
<pin/>
<mailid>wooty#woot.com</mailid>
<department/>
<manager/>
<userLocale />
<associatedDevices/>
<primaryExtension/>
<associatedPc/>
<enableCti>false</enableCti>
<digestCredentials/>
<phoneProfiles/>
<defaultProfile/>
<presenceGroupName uuid="{sdsds-sdsds-sdsdsd-sdsdsd-sdsd}">Standard Presence group</presenceGroupName>
<subscribeCallingSearchSpaceName/>
<enableMobility>false</enableMobility>
<enableMobileVoiceAccess>false</enableMobileVoiceAccess>
<maxDeskPickupWaitTime>10000</maxDeskPickupWaitTime>
<remoteDestinationLimit>4</remoteDestinationLimit>
<associatedRemoteDestinationProfiles/>
<associatedTodAccess/>
<status>1</status>
<enableEmcc>false</enableEmcc>
<associatedCapfProfiles/>
<ctiControlledDeviceProfiles/>
<patternPrecedence />
<numericUserId />
<mlppPassword />
<customUserFields/>
<homeCluster>true</homeCluster>
<imAndPresenceEnable>true</imAndPresenceEnable>
<serviceProfile uuid="{dsdsdsd-sdsdsd-sdsd-sdsds-sdsds}">1 IM Presence Only</serviceProfile>
<lineAppearanceAssociationForPresences/>
<directoryUri>blah#wooty.com</directoryUri>
<telephoneNumber>555-555-5555</telephoneNumber>
<title/>
<mobileNumber/>
<homeNumber/>
<pagerNumber/>
<extensionsInfo/>
<selfService />
<userProfile/>
<calendarPresence>false</calendarPresence>
<ldapDirectoryName uuid="{sdsd-sdsdsd-sdsds-sdsds}">someinfo</ldapDirectoryName>
<userIdentity>blah#woot.com</userIdentity>
<nameDialing>blehWoot</nameDialing>
<ipccExtension/>
<convertUserAccount uuid="{sdsd-sdsdsd-sdsds-sdsds}">someinfo</convertUserAccount>
<enableUserToHostConferenceNow>false</enableUserToHostConferenceNow>
<attendeesAccessCode/>
</user>
</return>
</ns:getUserResponse>
</soapenv:Body>
</soapenv:Envelope>
Based on #danielHaley suggestions i created the following code to retrieve the node.
#read XML response and get service profile
tree = ET.ElementTree(ET.fromstring(response.content))
root = tree.getroot()
serviceprofile = root.find(".//serviceProfile").text
Worked great. thank you so much for your help.

How to merge only selected lines of two different xml files into single xml file using python?

I have a an xml files as "A.xml":
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Debug/exampled.lib" target="lib/Debug/exampled.lib" />
</files>
</package>
Another xml file "B.xml" as:
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Release/example.lib" target="lib/Release/example.lib" />
</files>
</package>
I only want to merge these two files in following way:
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Debug/exampled.lib" target="lib/Debug/exampled.lib" />
<file src="lib/Release/example.lib" target="lib/Release/example.lib" />
</files>
</package>
So, kindly suggest how to merge only the files tag of the two files using python scripting.
This is a stackoverflow answer that mention many ways you can parse xml:
How do I parse XML in Python?
I suggest going with ElementTree as it's very simple to use and can solve your case with minimal code.
Seeing how the community is hostile toward questions that had little research backing it, I'm not going to post the full answer here and suggest that you take time and read up on ElementTree.

Using Python ElementTree/ElementInclude and xpointer to access included XML files

I have a 'main.xml' file that includes 2 'sub_x.xml' file. The include lines are using 'xpointer' to only point/include specific tags of the include xml's. When I use ElementTree to determine if this worked correctly, it shows that the whole 'sub' xml files are being included and not just the tags I want. I am not sure if I am using xpointer incorrectly or ElementTree or ElementInclude does not support this. Here are the files:
------'main.xml'--------
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="sub_1.xml" xpointer="xpointer(//ModelInfo/Model)" parse="xml" />
<xi:include href="sub_2.xml" xpointer="xpointer(//ModelInfo/Model)" parse="xml" />
</ModelInfo>`
-------'sub_1.xml'------
`<?xml version="1.0" ?>
<ModelInfo>
<Model ModelName="glow">
<Variables>
<Variable Alias="glow_val" Input="False" Output="True" />
</Variables>
</Model>
</ModelInfo>`
-------'sub_2.xml'------
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo>
<Model ModelName="sirpwr_b_supply8v1">
<Variables>
<Variable Alias="sirpwr_a_supplyecu_Snsr8vIstat" Input="True" Output="False" />
</Variables>
</Model>
</ModelInfo>`
I would like 'main.xml' to appear to ElementTree as:
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo xmlns:xi="http://www.w3.org/2001/XInclude">
<Model ModelName="glow">
<Variables>
<Variable Alias="glow_val" Input="False" Output="True" />
</Variables>
</Model>
<Model ModelName="sirpwr_b_supply8v1">
<Variables>
<Variable Alias="sirpwr_a_supplyecu_Snsr8vIstat" Input="True" Output="False" />
<Variable Alias="sirpwr_b_supply8v1_qstat" Input="False" Output="True" />
</Variables>
</Model>
</ModelInfo>`
The script I am running to load the XML files and test is:
`tree = ElementTree.parse('main.xml')
root = tree.getroot()
ElementInclude.include(root)
for element in root:
print element.tag`
xpointer is not working because 'ModelInfo' is being copied over from the 'sub_x' xml files.
ElementInclude does not support all of XInclude. The xpointer attribute on the <include> element is ignored.
It does work the way you want it with lxml and the xinclude() method:
from lxml import etree
tree = etree.parse('main.xml')
tree.xinclude()
print etree.tostring(tree)
Note that the XPointer xpointer() scheme never reached the status of W3C Recommendation (it's still just a working draft). It has been implemented in libxml2 (the C library behind lxml) but almost nowhere else.

How to remove all attributes of a tag

How can I remove all the attributes of a xml tag so I can get from this:
<xml blah blah blah> to just <xml>.
With lxml I know I can remove the whole element and I didn't find any way to do it specific on a tag. (I found solutions on stackoverflow for C# but I want Python).
I am opening a gpx(xml) file and this is my code so far (based on How do I get the whole content between two xml tags in Python?):
from lxml import etree
t = etree.parse("1.gpx")
e = t.xpath('//trk')[0]
print(e.text + ''.join(map(etree.tostring, e))).strip()
Another approach I did was this:
from lxml import etree
TOPOGRAFIX_NS = './/{http://www.topografix.com/GPX/1/1}'
TRACKPOINT_NS = TOPOGRAFIX_NS + 'extensions/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}'
doc1 = etree.parse("1.gpx")
for node1 in doc1.findall(TOPOGRAFIX_NS + 'trk'):
node_to_string1 = etree.tostring(node1)
print(node_to_string1)
But I get the trk tag with TOPOGRAFIX_NS attributes witch I don't want and here I am wanting to remove the tag attribute. I just want to get:
<trk> all the inside content </trk>
Thank you very much!
P.S. The content of the gpx file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="Endomondo.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<author>
<name>Blah Blah</name>
<email id="blah" domain="blah.com"/>
</author>
<link href="http://www.endomondo.com">
<text>Endomondo</text>
</link>
<time>2014-01-20T10:50:28Z</time>
</metadata>
<trk>
<name>Galati</name>
<src>http://www.endomondo.com/</src>
<link href="http://www.endomondo.com/workouts/260782567/13005122">
<text>Galati</text>
</link>
<type>MOUNTAIN_BIKING</type>
<trkseg>
<trkpt lat="45.431074" lon="28.021038">
<time>2013-10-20T05:49:04Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

XML file to Excel, error when opening

I have this file here:
<?xml?>
<table name="data">
<row et_kt="215846" et_nafn="" et_kt_maka="" et_kt_fjolsk="215846" et_kyn="X" et_hjusk_stada="1" et_faeddag="190201" et_danrdag="198612" />
<row et_kt="239287" et_nafn="" et_kt_maka="" et_kt_fjolsk="239287" et_kyn="X" et_hjusk_stada="4" et_faeddag="190401" et_danrdag="199106" />
.
.
.
</table>
Excel tell me the file is in a different format than the .xml implies. What's wrong with the format?
Try changing <?xml?> for <?xml version="1.0"?>.
EDIT: Check this answer for some extra information about the issue.
at first glance I'd say that that there is no valid xml declaration ie
<?xml version="1.0" encoding="UTF-8" ?>
or as it a microsoft product may be you should try <?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>

Categories

Resources