Pretty print subnode without namespace declaration - python

I have an xml document and I want to extract a subnode (boundedBy) and pretty_print it exactly as it looks in the original document (with exception to the pretty formatting).
<?xml version="1.0" encoding="UTF-8" ?>
<wfs:FeatureCollection
xmlns:sei="https://somedomain.com/namespace"
xmlns:wfs="http://www.opengis.net/wfs"
xmlns:gml="http://www.opengis.net/gml"
xmlns:ogc="http://www.opengis.net/ogc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd
https://somedomain.com/schemas/wfsnamespace some.xsd">
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<gml:featureMember>
<sei:HUB_HEIGHT_FCST>
<!--- This is the section I want --->
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<!--- This is the section I want --->
<sei:geometry_4326>
<gml:Point srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120</gml:coordinates>
</gml:Point>
</sei:geometry_4326>
<sei:rundatetime>2017-09-26 00:00:00</sei:rundatetime>
<sei:validdatetime>2017-09-26 17:00:00</sei:validdatetime>
</sei:HUB_HEIGHT_FCST>
</gml:featureMember>
</wfs:FeatureCollection>
Here is how I'm extracting the subnode:
# parse the xml string
parser = etree.XMLParser(remove_blank_text=True, remove_comments=True, recover=False, strip_cdata=False)
root = etree.fromstring(xmlstr, parser=parser)
#find the subnode I want
subnodes = root.xpath("./gml:boundedBy", namespaces={'gml': 'http://www.opengis.net/gml'})
subnode = subnodes[0]
# make a pretty output
xmlstr = etree.tostring(subnode, xml_declaration=False, encoding="UTF-8", pretty_print=True)
print xmlstr
Which gives me this. Unfortunately lxml is adding the namespaces to the boundedBy node (which makes sense for the sake of completeness in xml).
<gml:boundedBy xmlns:gml="http://www.opengis.net/gml" xmlns:sei="https://somedomain.com/namespace" xmlns:wfs="http://www.opengis.net/wfs" xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I only want the subnode as it looked in the original document.
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I'm flexible with not using lxml, but either way I haven't found options on how to accomplish this.
edit:
Since it was pointed out that I should explain why I want to do this...
I'm trying to log the xml fragment without altering it's original structure. The automated test I'm building looks at certain nodes for correctness. In the process I'm logging the fragment and want to make it a bit more readable for the person reviewing. Some of the fragments can get fairly large which is why pretty_print is so nice.

Related

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

xmltodict and back again; attributes get their own tag

I'm trying to create something that imports an XML, compares some of the values to values from another XML or values in an Oracle Database, and then write it back again with some values changed. I've tried simply importing the xml and then exporting it again, but that already leads to an issue for me; xml attributes are not shown as attributes within the tag anymore, instead they get their own child tag.
I think it's the same problem as described here, in which the top answer says that the issue has been open for years. I'm hoping you guys know an elegant way to fix this, because the only thing I can think of is doing a replace after the export.
import xmltodict
from dicttoxml import dicttoxml
testfile = '<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>'
print(testfile)
print('\n')
orgfile = xmltodict.parse(testfile)
print(orgfile)
print('\n')
newfile = dicttoxml(orgfile, attr_type=False).decode()
print(newfile)
Result:
D:\python3 Test.py
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>
OrderedDict([('Testfile', OrderedDict([('Amt', OrderedDict([('#Ccy', 'EUR'), ('#
text', '123.45')]))]))])
<?xml version="1.0" encoding="UTF-8" ?><root><Testfile><Amt><key name="#Ccy">EUR
</key><key name="#text">123.45</key></Amt></Testfile></root>
You can see the input tag Amt Ccy="EUR" gets converted to Amt with child tags.
I'm not sure which libraries you're actually using, but xmltodict has an unparse method, that does exactly what you want:
import xmltodict
testfile = '<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>'
print(testfile)
print('\n')
orgfile = xmltodict.parse(testfile)
print(orgfile)
print('\n')
newfile = xmltodict.unparse(orgfile, pretty=False)
print(newfile)
Output:
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>
OrderedDict([('Testfile', OrderedDict([('Amt', OrderedDict([('#Ccy', 'EUR'), ('#text', '123.45')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>
Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

Python: xml.Find always returns none

So I'm trying to search and replace the xml keyword RunCodeAnalysis inside a vcxproj file with python.
I'm pretty new to python so be gentle, but I thought it would be the simplest language to do this kind of thing.
I read a handful of similar examples and came up with the code below, but no matter what I search for the ElementTree Find call always returns None.
from xml.etree import ElementTree as et
xml = '''\
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Protected_Debug|Win32'">
<RunCodeAnalysis>false</RunCodeAnalysis>
</PropertyGroup>
</Project>
'''
et.register_namespace('', "http://schemas.microsoft.com/developer/msbuild/2003")
tree = et.ElementTree(et.fromstring(xml))
print(tree.find('.//RunCodeAnalysis'))
Here's a simplified code example online: https://ideone.com/1T1wsb
Can anyone tell me what I'm doing wrong?
Ok.. So #ThomWiggers helped me with the missing piece - and here's my final code in all it's naive glory. No parameter checking or any kind of smarts yet, but it takes two parameters - filename and whether to turn static code analysis to true or false. I've got about 30 projects I want to turn it on for for nightly builds but really don't want to turn it on day to day as it's just too slow.
import sys
from xml.etree import ElementTree as et
et.register_namespace('', "http://schemas.microsoft.com/developer/msbuild/2003")
tree = et.parse(sys.argv[1])
value = sys.argv[2]
for item in tree.findall('.//{http://schemas.microsoft.com/developer/msbuild/2003}RunCodeAnalysis'):
item.text = value
for item in tree.findall('.//{http://schemas.microsoft.com/developer/msbuild/2003}EnablePREfast'):
item.text = value
tree.write(sys.argv[1])

Parse Two XML using Python

I have an XML file that contains 200 Event blocks looking like below:
<?xml version='1.0' encoding='UTF-8'?>
<ProjectData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.project.com/proj1/projv" xsi:schemaLocation="http://www.pp.com/oj/p http://www.onj.com/p/IXX/schema/proj.xsd">
<fileType>This file is sample</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Id>0</Id>
<pp define="something">2</pp>
<Index>3</Index>
<Conf ref="point">CFG.AC.UF</Conf>
<Check>tttt</Check>
<Group>wwll</Group>
<Heart ref="point">mbmb</Heart>
<Name>kkk</Name>
<Thresh ref="point">kckcv</Thresh>
<Hyster ref="point">foo</Hyster>
<Trip ref="point">dim</Trip>
<Clear ref="point">CLR.AC.UF</Clear>
</Event>
</EventList>
</ProjData>
The Event block contains information that I am interested in taking (4 of them only: Id, Index, Name, and Group) to generate my new xml file. I want to do this by python code.
Does anyone know how I can achieve this by python.
My new xml file should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>0</Id>
<Index>3</Index>
<Name>kkk**$Id**</Name>
<Group>wwll**$Index**</Group>
<desc>placeholder</desc>
</event>
</Newevents>
I also want to add the Id and Index which are numbers to name and group strings with three significant digit place holder.
For example if the Id is 1, I want my Name be kkk001, or if Id is 3, I want my Name be kkk003.
Same for my group element string but using Index: if Index is 5 I want my group be wwll005.
I Googled but there are sporadic information about this.
Can anyone come up with a neat python code that do the parsing of my xml file and generate the new xml file in the format and numbering I want above?
I have another xml file called descXML.xml that I need to parse to only take the desc element string and add it to my new xml file.
In the second xml file that I have (descXML.xml), desc element data should be taken based on the Id match with my new xml file.
Is there any possibility to do the check if Id element is equal to the Id element data of my new xml file, then add desc element content for the corresponding code number? How can I do this condition? Can you provide and example python for this?
Here is how descXML.xml file looks like and analogous to my first original xml file here are also 200 Event blocks:
<EventList>
<Event>
<Mnemonic>AC.UF.SLOW</Mnemonic>
<Id define="xyz">3</Id>
<Index>13</Index>
<Description>today was warm and I want to go swimming</Description>
</Event>
<EventList>
Can 1 and 2 above merge into one python file?
The final XML file I want should look like below:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>3</Id>
<Index>13</Index>
<Name>kkk000</Name>
<Group>wwll003</Group>
<desc>today was warm and I want to go swimming</desc>
</event>
Trial based on comments given below:
I tried to be consise here and try solution provided below but did not work so I am providing my exact xml files:
My file1.xml
<?xml version='1.0' encoding='UTF-8'?>
<Dataizx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.kklk.com/cx1/ASD" xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd">
<fileType>Auto-Generated IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Mnemonic>ijk</Mnemonic>
<Id define="rece">2</Id>
<Index>0</Index>
<Config ref="point">shine</Config>
</Event>
<Event>
<Mnem>xyz</Mnem>
<Id define="teller">3</Id>
<Index>1</Index>
<Config ref="point">good</Config>
</Event>
</EventList>
</Dataizx>
And here is my xml that contains the description:
<?xml version="1.0" encoding="UTF-8"?>
<IXXData xmlns="http://www.mnm.com/mnm/mnm" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:i="http://www.mnm.com/mnm/mnm" xsi:schemaLocation="http://www.mnm.com/mnm/mnm http://www.mnm.com/mnm/mnm/schema/mnm.xsd">
<fileType>Merged IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0 + none</version>
<description>Merged event metadata.</description>
</header>
<EventList>
<Event>
<Id define="mmm">2</Id>
<Description>everything was good.</Description>
</Event>
<Event>
<Id define="lll">4</Id>
<Description>teller and the other one.</Description>
</Event>
<Event>
<Id define="ggg">3</Id>
<Description>weather is nice.</Description>
</Event>
</EventList>
</IXXData>
I used your xsl and python but I could not get description out of the second file.
Consider an XLST solution which can pick various nodes from original XML and merge nodes in an external XML based on specific criteria. Python (like many object-oriented programming languages) maintains an XSLT processor like in its lxml module.
As information, XSLT is a special-purpose, declarative programming language (not an object-oriented one) to transform XML files in various formats and structures.
Additionally for your purposes you can use XSLT's document() and concat() functions. Your XSLT was a little involved as it required setting a variable to match ids across documents and had quite a bit of namespaces to manage.
XSLT (save externally as .xsl file)
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.kklk.com/cx1/ASD"
xmlns:i="http://www.sesolar.com/SE1/ICB"
xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd"
exclude-result-prefixes="xsi p i">
<xsl:output version="1.0" encoding="UTF-8"/>
<xsl:template match="p:EventList">
<NewsEvents>
<xsl:for-each select="p:Event">
<Id><xsl:value-of select="p:Id"/></Id>
<Index><xsl:value-of select="p:Index"/></Index>
<Name><xsl:value-of select="concat(p:Name, '00', p:Id)"/></Name>
<Group><xsl:value-of select="concat(p:Group, '00', p:Index)"/></Group>
<xsl:variable name="descID" select="p:Id"/>
<desc><xsl:value-of select="document('descXML.xml')/i:IcbData/i:EventList/
i:Event/i:Id[text()=$descID]/following-sibling::i:Description"/></desc>
</xsl:for-each>
</NewsEvents>
</xsl:template>
</xsl:transform>
Python (loads .xml and .xsl, transforming former with the latter for new .xml output)
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('C:\\Path\\To\\MainXML.xml')
xslt = ET.parse('C:\\Path\\To\\AboveXSLT.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('C:\\Path\\To\\Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output (using above posted XML data)
(if descXML's Id matches any Event's Id, corresponding <desc> node below will be populated)
<?xml version='1.0' encoding='UTF-8'?>
<Dataroot>
<NewsEvents>
<Id>2</Id>
<Index>0</Index>
<Name>002</Name>
<Group>000</Group>
<desc>everything was good.</desc>
</NewsEvents>
<NewsEvents>
<Id>3</Id>
<Index>1</Index>
<Name>003</Name>
<Group>001</Group>
<desc>weather is nice.</desc>
</NewsEvents>
</Dataroot>
I know this XSLT approach may look intimidating but it saves much looping and creating elements, subelements, and attributes in Python code. I often recommend this route whenever XML files are being handled and I do find it ignored among programmers not just Pythoners. Meanwhile, most easily work with another special-purpose, declarative language without question -SQL!

Categories

Resources