Parse Two XML using Python - python

I have an XML file that contains 200 Event blocks looking like below:
<?xml version='1.0' encoding='UTF-8'?>
<ProjectData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.project.com/proj1/projv" xsi:schemaLocation="http://www.pp.com/oj/p http://www.onj.com/p/IXX/schema/proj.xsd">
<fileType>This file is sample</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Id>0</Id>
<pp define="something">2</pp>
<Index>3</Index>
<Conf ref="point">CFG.AC.UF</Conf>
<Check>tttt</Check>
<Group>wwll</Group>
<Heart ref="point">mbmb</Heart>
<Name>kkk</Name>
<Thresh ref="point">kckcv</Thresh>
<Hyster ref="point">foo</Hyster>
<Trip ref="point">dim</Trip>
<Clear ref="point">CLR.AC.UF</Clear>
</Event>
</EventList>
</ProjData>
The Event block contains information that I am interested in taking (4 of them only: Id, Index, Name, and Group) to generate my new xml file. I want to do this by python code.
Does anyone know how I can achieve this by python.
My new xml file should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>0</Id>
<Index>3</Index>
<Name>kkk**$Id**</Name>
<Group>wwll**$Index**</Group>
<desc>placeholder</desc>
</event>
</Newevents>
I also want to add the Id and Index which are numbers to name and group strings with three significant digit place holder.
For example if the Id is 1, I want my Name be kkk001, or if Id is 3, I want my Name be kkk003.
Same for my group element string but using Index: if Index is 5 I want my group be wwll005.
I Googled but there are sporadic information about this.
Can anyone come up with a neat python code that do the parsing of my xml file and generate the new xml file in the format and numbering I want above?
I have another xml file called descXML.xml that I need to parse to only take the desc element string and add it to my new xml file.
In the second xml file that I have (descXML.xml), desc element data should be taken based on the Id match with my new xml file.
Is there any possibility to do the check if Id element is equal to the Id element data of my new xml file, then add desc element content for the corresponding code number? How can I do this condition? Can you provide and example python for this?
Here is how descXML.xml file looks like and analogous to my first original xml file here are also 200 Event blocks:
<EventList>
<Event>
<Mnemonic>AC.UF.SLOW</Mnemonic>
<Id define="xyz">3</Id>
<Index>13</Index>
<Description>today was warm and I want to go swimming</Description>
</Event>
<EventList>
Can 1 and 2 above merge into one python file?
The final XML file I want should look like below:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>3</Id>
<Index>13</Index>
<Name>kkk000</Name>
<Group>wwll003</Group>
<desc>today was warm and I want to go swimming</desc>
</event>
Trial based on comments given below:
I tried to be consise here and try solution provided below but did not work so I am providing my exact xml files:
My file1.xml
<?xml version='1.0' encoding='UTF-8'?>
<Dataizx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.kklk.com/cx1/ASD" xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd">
<fileType>Auto-Generated IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Mnemonic>ijk</Mnemonic>
<Id define="rece">2</Id>
<Index>0</Index>
<Config ref="point">shine</Config>
</Event>
<Event>
<Mnem>xyz</Mnem>
<Id define="teller">3</Id>
<Index>1</Index>
<Config ref="point">good</Config>
</Event>
</EventList>
</Dataizx>
And here is my xml that contains the description:
<?xml version="1.0" encoding="UTF-8"?>
<IXXData xmlns="http://www.mnm.com/mnm/mnm" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:i="http://www.mnm.com/mnm/mnm" xsi:schemaLocation="http://www.mnm.com/mnm/mnm http://www.mnm.com/mnm/mnm/schema/mnm.xsd">
<fileType>Merged IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0 + none</version>
<description>Merged event metadata.</description>
</header>
<EventList>
<Event>
<Id define="mmm">2</Id>
<Description>everything was good.</Description>
</Event>
<Event>
<Id define="lll">4</Id>
<Description>teller and the other one.</Description>
</Event>
<Event>
<Id define="ggg">3</Id>
<Description>weather is nice.</Description>
</Event>
</EventList>
</IXXData>
I used your xsl and python but I could not get description out of the second file.

Consider an XLST solution which can pick various nodes from original XML and merge nodes in an external XML based on specific criteria. Python (like many object-oriented programming languages) maintains an XSLT processor like in its lxml module.
As information, XSLT is a special-purpose, declarative programming language (not an object-oriented one) to transform XML files in various formats and structures.
Additionally for your purposes you can use XSLT's document() and concat() functions. Your XSLT was a little involved as it required setting a variable to match ids across documents and had quite a bit of namespaces to manage.
XSLT (save externally as .xsl file)
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.kklk.com/cx1/ASD"
xmlns:i="http://www.sesolar.com/SE1/ICB"
xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd"
exclude-result-prefixes="xsi p i">
<xsl:output version="1.0" encoding="UTF-8"/>
<xsl:template match="p:EventList">
<NewsEvents>
<xsl:for-each select="p:Event">
<Id><xsl:value-of select="p:Id"/></Id>
<Index><xsl:value-of select="p:Index"/></Index>
<Name><xsl:value-of select="concat(p:Name, '00', p:Id)"/></Name>
<Group><xsl:value-of select="concat(p:Group, '00', p:Index)"/></Group>
<xsl:variable name="descID" select="p:Id"/>
<desc><xsl:value-of select="document('descXML.xml')/i:IcbData/i:EventList/
i:Event/i:Id[text()=$descID]/following-sibling::i:Description"/></desc>
</xsl:for-each>
</NewsEvents>
</xsl:template>
</xsl:transform>
Python (loads .xml and .xsl, transforming former with the latter for new .xml output)
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('C:\\Path\\To\\MainXML.xml')
xslt = ET.parse('C:\\Path\\To\\AboveXSLT.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('C:\\Path\\To\\Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output (using above posted XML data)
(if descXML's Id matches any Event's Id, corresponding <desc> node below will be populated)
<?xml version='1.0' encoding='UTF-8'?>
<Dataroot>
<NewsEvents>
<Id>2</Id>
<Index>0</Index>
<Name>002</Name>
<Group>000</Group>
<desc>everything was good.</desc>
</NewsEvents>
<NewsEvents>
<Id>3</Id>
<Index>1</Index>
<Name>003</Name>
<Group>001</Group>
<desc>weather is nice.</desc>
</NewsEvents>
</Dataroot>
I know this XSLT approach may look intimidating but it saves much looping and creating elements, subelements, and attributes in Python code. I often recommend this route whenever XML files are being handled and I do find it ignored among programmers not just Pythoners. Meanwhile, most easily work with another special-purpose, declarative language without question -SQL!

Related

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>
Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

convert one line xml into csv

I have xml documents in the format which is given below and I cannot find a successful way to convert this to csv using python. I am using Spyder IDE and am an extremely amateur python-ista. I managed to use an online converter for one of the files, but the remaining files are too large to upload.
I am looking for the output to be columns of rowID, PostID, Score, Text.
Please could someone assist?
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="1" Score="5" Text="Was there something in particular you didn't understand in the wikipedia article? http://en.wikipedia.org/wiki/Spin_%28physics%29" CreationDate="2010-11-02T19:11:07.043" UserId="42" />
<row Id="2" PostId="3" Score="1" Text="I thought the wikipedia article here was pretty good, but maybe it only makes sense if you have a little quantum mechanics background: http://en.wikipedia.org/wiki/Particle_physics_and_representation_theory Were you able to get anything out of it?" CreationDate="2010-11-02T19:13:34.870" UserId="42" />
<row Id="3" PostId="3" Score="0" Text="i mostly thought this was a better place for the question than MO." CreationDate="2010-11-02T19:16:09.873" UserId="40" />
<row Id="6" PostId="4" Score="11" Text="An accurate answer, but if the poster doesn't understand the actual concept of spin (not to mention group theory), this is all but useless." CreationDate="2010-11-02T19:32:15.410" UserId="13" />
<row Id="7" PostId="2" Score="2" Text="I'm tempted to answer: with much difficulty, in a highly qualitative way, and only by reading a fair-sized book. There are many decent pop-sci books on string theory; I can't remember the names of any I read, but I'm sure someone can recommend one or two." CreationDate="2010-11-02T19:36:53.290" UserId="13" />
<row Id="8" PostId="8" Score="0" Text="so the fundamental particle is acting on the quantum states?" CreationDate="2010-11-02T19:36:55.263" UserId="40" />
Secondly, if some rows do not have all fields or have extra fields, how can I ignore those and only populate what is there for the fields specified? I am getting the below error message, but do not want the additional 3 columns?
ParserError: Error tokenizing data. C error: Expected 4 fields in line 41, saw 7
The following is working for me:
import os
import xml.etree.ElementTree as ET
xml_file = "c:/temp/test.xml"
csv_file_output = '{}_out.csv'.format(os.path.splitext(xml_file)[0])
tree = ET.parse(xml_file)
xml_root = tree.getroot()
with open(csv_file_output, 'w') as fout:
fout.write("Id,PostId,Score,Text")
for row in xml_root.iter("row"):
id = row.get("Id")
postId = row.get("PostId")
score = row.get("Score")
text = row.get("Text")
fout.write('\n{0},{1},{2},"{3}"'.format(id, postId, score, text))
This could also be done using pandas and saving a Data Frame to CSV but I kept it simple.
A file with the same name but ending in _out.csv will be generated in the same folder as the XML file.

Pretty print subnode without namespace declaration

I have an xml document and I want to extract a subnode (boundedBy) and pretty_print it exactly as it looks in the original document (with exception to the pretty formatting).
<?xml version="1.0" encoding="UTF-8" ?>
<wfs:FeatureCollection
xmlns:sei="https://somedomain.com/namespace"
xmlns:wfs="http://www.opengis.net/wfs"
xmlns:gml="http://www.opengis.net/gml"
xmlns:ogc="http://www.opengis.net/ogc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd
https://somedomain.com/schemas/wfsnamespace some.xsd">
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<gml:featureMember>
<sei:HUB_HEIGHT_FCST>
<!--- This is the section I want --->
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<!--- This is the section I want --->
<sei:geometry_4326>
<gml:Point srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120</gml:coordinates>
</gml:Point>
</sei:geometry_4326>
<sei:rundatetime>2017-09-26 00:00:00</sei:rundatetime>
<sei:validdatetime>2017-09-26 17:00:00</sei:validdatetime>
</sei:HUB_HEIGHT_FCST>
</gml:featureMember>
</wfs:FeatureCollection>
Here is how I'm extracting the subnode:
# parse the xml string
parser = etree.XMLParser(remove_blank_text=True, remove_comments=True, recover=False, strip_cdata=False)
root = etree.fromstring(xmlstr, parser=parser)
#find the subnode I want
subnodes = root.xpath("./gml:boundedBy", namespaces={'gml': 'http://www.opengis.net/gml'})
subnode = subnodes[0]
# make a pretty output
xmlstr = etree.tostring(subnode, xml_declaration=False, encoding="UTF-8", pretty_print=True)
print xmlstr
Which gives me this. Unfortunately lxml is adding the namespaces to the boundedBy node (which makes sense for the sake of completeness in xml).
<gml:boundedBy xmlns:gml="http://www.opengis.net/gml" xmlns:sei="https://somedomain.com/namespace" xmlns:wfs="http://www.opengis.net/wfs" xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I only want the subnode as it looked in the original document.
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I'm flexible with not using lxml, but either way I haven't found options on how to accomplish this.
edit:
Since it was pointed out that I should explain why I want to do this...
I'm trying to log the xml fragment without altering it's original structure. The automated test I'm building looks at certain nodes for correctness. In the process I'm logging the fragment and want to make it a bit more readable for the person reviewing. Some of the fragments can get fairly large which is why pretty_print is so nice.

Python script to change an attribute value in .tcx file (XML)

I have a .tcx (XML) file, with the following schema:
<Activities>
<Activity>
<Lap StartTime="2015-12-24T08:12:18.969Z">
<TotalTimeSeconds>4069.0</TotalTimeSeconds>
<DistanceMeters>30458.794921875</DistanceMeters>
<MaximumSpeed>43.36123275756836</MaximumSpeed>
<Calories>2286</Calories>
<AverageHeartRateBpm><Value>144</Value></AverageHeartRateBpm><MaximumHeartRateBpm><Value>169</Value></MaximumHeartRateBpm>
<Intensity>Active</Intensity>
<Cadence>87</Cadence>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2015-12-24T08:12:19.969Z</Time>
<Position><LatitudeDegrees>45.4917</LatitudeDegrees><LongitudeDegrees>9.16198</LongitudeDegrees></Position>
<AltitudeMeters>124.018</AltitudeMeters>
<DistanceMeters>0.0</DistanceMeters>
<SensorState>Present</SensorState>
<Extensions><TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2"><Watts>0</Watts></TPX></Extensions></Trackpoint>
...
</Track>
</Lap>
</Activity>
</Activities>
and need to change (double) the Watts attribute.
Would like a simple python script
Simply run an XSLT script. No Python loops or expensive XPaths (//) is needed. As information, XSLT is a declarative, special-purpose programming language used specifically to restructure, redesign, or re-format XML documents to various end use needs. Like most general purpose languages such as Java, C#, Perl, PHP, VB, Python comes equipped with an XSLT 1.0 processor in its lxml module.
Below runs an identity transform to copy entire document as is and then multiplies the current value in any Watts node by 2. I declare a namespace doc in XSLT to reference the Watts element.
XSLT (save as .xsl or .xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:Watts">
<xsl:copy>
<xsl:value-of select=". * 2"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Python Script
import lxml.etree as ET
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('Output.xml')
xmlfile.write(tree_out)
xmlfile.close()
Your last two element tags need to be closing tags, and you have a Watts element not an attribute. Here is how to do it with your file structure.
Python provides the ElementTree library for this. The following script will accomplish what you want:
import xml.etree.ElementTree as ET
tree = ET.parse("test.tcx")
tpxns = "http://www.garmin.com/xmlschemas/ActivityExtension/v2"
for watts in tree.iter("{%s}Watts"%tpxns):
watts.text = str(2*int(watts.text))
tree.write("testnew.tcx")
Here I import the ElementTree library and use a simpler name for it. The parse function creates an ElementTree object from your file. I walk through the file to find all Watts elements (as these occur in a namespace, I actually need to look for {http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts, which I build up using string formatting).
When I find such an element, I set the inner text to be twice what the previous value was (converting to an int first and then back to a string).
Finally, I write the new xml file out. I could have overwrote the original file here if I wanted to.
Look over the documentation for the ElementTree module if you need to do anything fancier. It provides very powerful tools for working with XML. There are even more powerful libraries out there if you need more features (I like lxml for example).

Create/parse an xml file using python

Ive been reading around online and came across this solution Creating a simple XML file using python but was not sure if it was still a relevant solution since the post is 5 years old. Python I'm guessing evolved a lot in the past 5 years.
I have a class which has a few attributes. In this case an attribute name which is a string and then another attribute called items which is a list of strings. I want to write this data to an xml and then later be able to parse read it back in to repopulate the variable Teams when I run a tool. I want the xml which it generates to have the pretty spacing and indentations.
Can I create the desired xml with a standard library in python as well as parse an xml file? Or do I need to use a separate download, is so what do you recommend?
Teams =[]
Teams.append(Team( name="Zebras" items=[]))
Teams.append(Team( name="Cobras" items=[]))
Teams.append(Team( name="Tigers" items=[]))
Teams.append(Team( name="Lizards" items=[]))
Xml output example
<?xml version="1.0" ?>
<teams>
<team name="cobras">
<item name="teapot001"/>
<item name="teapot002"/>
<item name="teapot003"/>
</team>
<team name="lizards">
<item name="teapot001"/>
<item name="teapot002"/>
<item name="teapot003"/>
</team>
</teams>
Use ElementTree or minidom from xml library, you can se this answer in this post:
How do I parse XML in Python?
example:
import xml.dom.minidom
xml = xml.dom.minidom.parse(xml_fname)
# or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = xml.toprettyxml()

Categories

Resources