Create/parse an xml file using python - python

Ive been reading around online and came across this solution Creating a simple XML file using python but was not sure if it was still a relevant solution since the post is 5 years old. Python I'm guessing evolved a lot in the past 5 years.
I have a class which has a few attributes. In this case an attribute name which is a string and then another attribute called items which is a list of strings. I want to write this data to an xml and then later be able to parse read it back in to repopulate the variable Teams when I run a tool. I want the xml which it generates to have the pretty spacing and indentations.
Can I create the desired xml with a standard library in python as well as parse an xml file? Or do I need to use a separate download, is so what do you recommend?
Teams =[]
Teams.append(Team( name="Zebras" items=[]))
Teams.append(Team( name="Cobras" items=[]))
Teams.append(Team( name="Tigers" items=[]))
Teams.append(Team( name="Lizards" items=[]))
Xml output example
<?xml version="1.0" ?>
<teams>
<team name="cobras">
<item name="teapot001"/>
<item name="teapot002"/>
<item name="teapot003"/>
</team>
<team name="lizards">
<item name="teapot001"/>
<item name="teapot002"/>
<item name="teapot003"/>
</team>
</teams>

Use ElementTree or minidom from xml library, you can se this answer in this post:
How do I parse XML in Python?
example:
import xml.dom.minidom
xml = xml.dom.minidom.parse(xml_fname)
# or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = xml.toprettyxml()

Related

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>
Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

convert one line xml into csv

I have xml documents in the format which is given below and I cannot find a successful way to convert this to csv using python. I am using Spyder IDE and am an extremely amateur python-ista. I managed to use an online converter for one of the files, but the remaining files are too large to upload.
I am looking for the output to be columns of rowID, PostID, Score, Text.
Please could someone assist?
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="1" Score="5" Text="Was there something in particular you didn't understand in the wikipedia article? http://en.wikipedia.org/wiki/Spin_%28physics%29" CreationDate="2010-11-02T19:11:07.043" UserId="42" />
<row Id="2" PostId="3" Score="1" Text="I thought the wikipedia article here was pretty good, but maybe it only makes sense if you have a little quantum mechanics background: http://en.wikipedia.org/wiki/Particle_physics_and_representation_theory Were you able to get anything out of it?" CreationDate="2010-11-02T19:13:34.870" UserId="42" />
<row Id="3" PostId="3" Score="0" Text="i mostly thought this was a better place for the question than MO." CreationDate="2010-11-02T19:16:09.873" UserId="40" />
<row Id="6" PostId="4" Score="11" Text="An accurate answer, but if the poster doesn't understand the actual concept of spin (not to mention group theory), this is all but useless." CreationDate="2010-11-02T19:32:15.410" UserId="13" />
<row Id="7" PostId="2" Score="2" Text="I'm tempted to answer: with much difficulty, in a highly qualitative way, and only by reading a fair-sized book. There are many decent pop-sci books on string theory; I can't remember the names of any I read, but I'm sure someone can recommend one or two." CreationDate="2010-11-02T19:36:53.290" UserId="13" />
<row Id="8" PostId="8" Score="0" Text="so the fundamental particle is acting on the quantum states?" CreationDate="2010-11-02T19:36:55.263" UserId="40" />
Secondly, if some rows do not have all fields or have extra fields, how can I ignore those and only populate what is there for the fields specified? I am getting the below error message, but do not want the additional 3 columns?
ParserError: Error tokenizing data. C error: Expected 4 fields in line 41, saw 7
The following is working for me:
import os
import xml.etree.ElementTree as ET
xml_file = "c:/temp/test.xml"
csv_file_output = '{}_out.csv'.format(os.path.splitext(xml_file)[0])
tree = ET.parse(xml_file)
xml_root = tree.getroot()
with open(csv_file_output, 'w') as fout:
fout.write("Id,PostId,Score,Text")
for row in xml_root.iter("row"):
id = row.get("Id")
postId = row.get("PostId")
score = row.get("Score")
text = row.get("Text")
fout.write('\n{0},{1},{2},"{3}"'.format(id, postId, score, text))
This could also be done using pandas and saving a Data Frame to CSV but I kept it simple.
A file with the same name but ending in _out.csv will be generated in the same folder as the XML file.

Parse Two XML using Python

I have an XML file that contains 200 Event blocks looking like below:
<?xml version='1.0' encoding='UTF-8'?>
<ProjectData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.project.com/proj1/projv" xsi:schemaLocation="http://www.pp.com/oj/p http://www.onj.com/p/IXX/schema/proj.xsd">
<fileType>This file is sample</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Id>0</Id>
<pp define="something">2</pp>
<Index>3</Index>
<Conf ref="point">CFG.AC.UF</Conf>
<Check>tttt</Check>
<Group>wwll</Group>
<Heart ref="point">mbmb</Heart>
<Name>kkk</Name>
<Thresh ref="point">kckcv</Thresh>
<Hyster ref="point">foo</Hyster>
<Trip ref="point">dim</Trip>
<Clear ref="point">CLR.AC.UF</Clear>
</Event>
</EventList>
</ProjData>
The Event block contains information that I am interested in taking (4 of them only: Id, Index, Name, and Group) to generate my new xml file. I want to do this by python code.
Does anyone know how I can achieve this by python.
My new xml file should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>0</Id>
<Index>3</Index>
<Name>kkk**$Id**</Name>
<Group>wwll**$Index**</Group>
<desc>placeholder</desc>
</event>
</Newevents>
I also want to add the Id and Index which are numbers to name and group strings with three significant digit place holder.
For example if the Id is 1, I want my Name be kkk001, or if Id is 3, I want my Name be kkk003.
Same for my group element string but using Index: if Index is 5 I want my group be wwll005.
I Googled but there are sporadic information about this.
Can anyone come up with a neat python code that do the parsing of my xml file and generate the new xml file in the format and numbering I want above?
I have another xml file called descXML.xml that I need to parse to only take the desc element string and add it to my new xml file.
In the second xml file that I have (descXML.xml), desc element data should be taken based on the Id match with my new xml file.
Is there any possibility to do the check if Id element is equal to the Id element data of my new xml file, then add desc element content for the corresponding code number? How can I do this condition? Can you provide and example python for this?
Here is how descXML.xml file looks like and analogous to my first original xml file here are also 200 Event blocks:
<EventList>
<Event>
<Mnemonic>AC.UF.SLOW</Mnemonic>
<Id define="xyz">3</Id>
<Index>13</Index>
<Description>today was warm and I want to go swimming</Description>
</Event>
<EventList>
Can 1 and 2 above merge into one python file?
The final XML file I want should look like below:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>3</Id>
<Index>13</Index>
<Name>kkk000</Name>
<Group>wwll003</Group>
<desc>today was warm and I want to go swimming</desc>
</event>
Trial based on comments given below:
I tried to be consise here and try solution provided below but did not work so I am providing my exact xml files:
My file1.xml
<?xml version='1.0' encoding='UTF-8'?>
<Dataizx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.kklk.com/cx1/ASD" xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd">
<fileType>Auto-Generated IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Mnemonic>ijk</Mnemonic>
<Id define="rece">2</Id>
<Index>0</Index>
<Config ref="point">shine</Config>
</Event>
<Event>
<Mnem>xyz</Mnem>
<Id define="teller">3</Id>
<Index>1</Index>
<Config ref="point">good</Config>
</Event>
</EventList>
</Dataizx>
And here is my xml that contains the description:
<?xml version="1.0" encoding="UTF-8"?>
<IXXData xmlns="http://www.mnm.com/mnm/mnm" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:i="http://www.mnm.com/mnm/mnm" xsi:schemaLocation="http://www.mnm.com/mnm/mnm http://www.mnm.com/mnm/mnm/schema/mnm.xsd">
<fileType>Merged IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0 + none</version>
<description>Merged event metadata.</description>
</header>
<EventList>
<Event>
<Id define="mmm">2</Id>
<Description>everything was good.</Description>
</Event>
<Event>
<Id define="lll">4</Id>
<Description>teller and the other one.</Description>
</Event>
<Event>
<Id define="ggg">3</Id>
<Description>weather is nice.</Description>
</Event>
</EventList>
</IXXData>
I used your xsl and python but I could not get description out of the second file.
Consider an XLST solution which can pick various nodes from original XML and merge nodes in an external XML based on specific criteria. Python (like many object-oriented programming languages) maintains an XSLT processor like in its lxml module.
As information, XSLT is a special-purpose, declarative programming language (not an object-oriented one) to transform XML files in various formats and structures.
Additionally for your purposes you can use XSLT's document() and concat() functions. Your XSLT was a little involved as it required setting a variable to match ids across documents and had quite a bit of namespaces to manage.
XSLT (save externally as .xsl file)
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.kklk.com/cx1/ASD"
xmlns:i="http://www.sesolar.com/SE1/ICB"
xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd"
exclude-result-prefixes="xsi p i">
<xsl:output version="1.0" encoding="UTF-8"/>
<xsl:template match="p:EventList">
<NewsEvents>
<xsl:for-each select="p:Event">
<Id><xsl:value-of select="p:Id"/></Id>
<Index><xsl:value-of select="p:Index"/></Index>
<Name><xsl:value-of select="concat(p:Name, '00', p:Id)"/></Name>
<Group><xsl:value-of select="concat(p:Group, '00', p:Index)"/></Group>
<xsl:variable name="descID" select="p:Id"/>
<desc><xsl:value-of select="document('descXML.xml')/i:IcbData/i:EventList/
i:Event/i:Id[text()=$descID]/following-sibling::i:Description"/></desc>
</xsl:for-each>
</NewsEvents>
</xsl:template>
</xsl:transform>
Python (loads .xml and .xsl, transforming former with the latter for new .xml output)
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('C:\\Path\\To\\MainXML.xml')
xslt = ET.parse('C:\\Path\\To\\AboveXSLT.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('C:\\Path\\To\\Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output (using above posted XML data)
(if descXML's Id matches any Event's Id, corresponding <desc> node below will be populated)
<?xml version='1.0' encoding='UTF-8'?>
<Dataroot>
<NewsEvents>
<Id>2</Id>
<Index>0</Index>
<Name>002</Name>
<Group>000</Group>
<desc>everything was good.</desc>
</NewsEvents>
<NewsEvents>
<Id>3</Id>
<Index>1</Index>
<Name>003</Name>
<Group>001</Group>
<desc>weather is nice.</desc>
</NewsEvents>
</Dataroot>
I know this XSLT approach may look intimidating but it saves much looping and creating elements, subelements, and attributes in Python code. I often recommend this route whenever XML files are being handled and I do find it ignored among programmers not just Pythoners. Meanwhile, most easily work with another special-purpose, declarative language without question -SQL!

Python XML Generation: Avoid multiple ns0 tag without lxml

I have a python script that simply reads "input.xml" and copies into "output.xml" file. As showed in "output.xml", Python's Xpath generates ns0, ns1 tag. How to avoid these tags without using other xml libraries (eg. lxml)?
Script:
import xml.etree.ElementTree as ET
fileName = "input.xml"
tree = ET.parse(template)
tree.write("output.xml")
Input.xml:
<Car>
<brand xmlns = "www.car.com" xmlns:brand="www.bmw.com">
<arg key="name" value="series 3" />
</brand>
<market xmlns = "www.ebay.com">
<arg key="name" value="auto"/>
</market>
</Car>
output.xml:
<Car xmlns:ns0="www.car.com" xmlns:ns1="www.ebay.com">
<ns0:brand>
<ns0:arg key="name" value="series 3" />
</ns0:brand>
<ns1:market>
<ns1:arg key="name" value="auto" />
</ns1:market>
</Car>
I am afraid, that there is no simple solution to this.
There is an issue in Python bug tracker related to that which is not closed for a while.
You could try to follow the solution proposed there, but it does not look very clear.
My recommendation is to reconsider using lxml - it provides real power to XML processing, Google AppEngine is including this.

Categories

Resources