Incorrect Value When Parsing XML File - python

I am trying to get the ImageVersion number from an xml file.
This is the code I have:
from xml.etree import ElementTree as ET
tree = ET.parse(file.xml)
root = tree.getroot()
siteImageVersion= (root.getchildren()[0].attrib['ImageVersion'])
The xml file looks like this
<!--InputFile D:/OutputFiles/Config.xml was parsed-->
<Configuration xmlns="http://....xsd" version="3">
<TesterRecord TimeStamp="2020-09-04T02:07:51-07:00" Name="SomeName" IPAddress="IPAddress" SystemId="Id" Version="0.1.0.1.00003" ImageVersion="Test_XXX_3.10.5.1" CellIndex="33" GeneratedBy="Name" Other="N/A">
</TesterRecord>
</Configuration>
I would expect that the output would be Test_XXX_3.10.5.1 (as it should be). But for some reason, I am getting this output instead: Test_XXX_3.10.4.2. I have no idea how the number changed, there is no 3.10.4.2 in the XML file.

Are you sure that you are reading the correct file?
(Sometimes it is merely the correct processing on the wrong data.)
Is there a file anywhere in that directory that does have "Test_XXX_3.10.4.2"?
Delete/Move/Rename it and see what happens.
Caching could also be a cause, if you are accessing the data from a remote source. You might not be getting the updated file, but getting the old cached version. Try a brand new file and see what happens.

Related

Cant parse XML tree without bounds element from Pyosmium

I downloaded some data from OpenStreetMap, and have been sorting the data so i only have the nodes and ways that i need for my project (highways and the corresponding nodes in the references). To sort the XML file and create a new one, i use the library Pyosmium. Everything works except i cant parse the XML file with xml.etree.ElementTree. When i sort my data into a new file im not moving the bounds that contain the min and max longitude and latitude. If i manually copy in the bounds it parses.
I read through the Pyosium doc's and only found osmium.io.Reader and osmium.io.Header as well as some Geometry Attributes that describe the box (containing what i need), but i found no help in regards to getting it from my file and using my writer to write it to the new one.
So far this is what i have in my main method that just handles the nodes and ways, using SimpleHandlers
wayHandler = XMLhandlers.StreetHandler()
nodeHandler = XMLhandlers.NodeHandler()
wayHandler.apply_file('data/map_2.osm')
nodeHandler.apply_file('data/map_2.osm')
if os.path.exists('data/map_2_TEST.osm'):
os.remove('data/map_2_TEST.osm')
writer = XMLhandlers.wayWriter('data/map_2_TEST.osm')
writer.apply_file('data/map_2.osm')
tree = ET.parse('data/map_2_TEST.osm')
pruces the following error:
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
Pastebin of original XML file: https://pastebin.com/i8uyCneC
Pastebin of sorted XML file that wont parse: https://pastebin.com/WZUcsZg4
EDIT:
The error is not in the parsing itself. If i comment out the part that generates the new XML and only try to parse the new XML file (that was generated beforehand) it works for some reason.
EDIT 2:
The error was i forgot to call close() on my SimpleWriter to flush remaining buffers and close the writer.
The issue happens since the code never stops the writer when done. By calling writer.close() it flushes the remaining buffers and closes the writer.
The following code has the line added, and the tree parses as expected.
wayHandler = XMLhandlers.StreetHandler()
nodeHandler = XMLhandlers.NodeHandler()
wayHandler.apply_file('data/map_2.osm')
nodeHandler.apply_file('data/map_2.osm')
if os.path.exists('data/map_2_TEST.osm'):
os.remove('data/map_2_TEST.osm')
writer = XMLhandlers.wayWriter('data/map_2_TEST.osm')
writer.apply_file('data/map_2.osm')
writer.close()
tree = ET.parse('data/map_2_TEST.osm')

Processing large xml files. Only root tree children attributes are relevant

I'm new to xml and python and I hope that I phrased my problem right:
I have xml files with a size of one gigabyte.
The files look like this:
<test name="LongTestname" result="PASS">
<step ID="0" step="NameOfStep1" result="PASS">
Stuff I dont't care about
</step>
<step ID="1" step="NameOfStep2" result="PASS">
Stuff I dont't care about
</step>
</test>
For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.
I have already tried following:
tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here I get a memory error because the files are to big
Then I tried:
try:
for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
if elem.tag == "step" and event == "start":
stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
elem.clear()
This works but is really slow. I guess it iterates through all elements and this takes a very long time.
Then I found a solution looking like this:
try:
tree = ET.iterparse(pathToSteps, events=("start","end"))
_, root = next(tree)
print('ROOT:', root.tag)
except:
print("ERROR: Unable to open and parse file !!!")
for child in root:
print(child.attrib)
But this prints only the attributes of the first step.
Is there a way to speed up the working solution?
Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.
I think you're on the right track with iterparse().
Maybe try specifying the step element name in the tag argument and only processing "start" events...
from lxml import etree
for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
print(elem.attrib)
elem.clear()
EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.
Without knowing the specifics of your setup, it might be hard to guess what the 'fastest possible' might be and how much of the delay is due to the parsing of the file. The first thing I would do, is of course time the run so you have some initial benchmark. Then I would write a simple python program that does nothing else but read the file from disk (no XML parsing). If the time difference is not significant, then the XML parsing isn't the issue and it is the reading of the file from disk is the problem. Of course, in an XML document, there is no indication in the file itself where the next tag ends so skipping the IO associated with those portions isn't possible (you still need to do a linear read of the file). Other than potentially using a different programming language (non-interpreted), there may not be many things you can do.
If you do get a significant slowdown from the actual XML parsing, you could then potentially try to pre-process the file into a different one. Since the file format of your files is very static, you could read the file and output to a different file (using a regex) until you get the tag. Then just throw out the data until you close the </step> tag or </test> tag. That will result in a valid, but hopefully much smaller XML file. The key here would be to do the 'parsing' yourself instead of having the underlying parser try to understand all of the document format, which could be much faster since your format is simple. You could then run your original program on this output which will not 'see' any of the extraneous tags. Of course, this breaks if you actually have nested <step> tags, but if that is the case, then you likely need to parse the file with a real XML parser to understand where the first-level starts and stops.

How to read XML file into Pandas Dataframe like Read XML Table in Excel

I have an xml file and I am trying to iterate through the tags to convert it to a pandas dataframe. My current process is to open the XML file with excel as an "XML table" but this takes forever. Trying to find a similar process in Python.
I am trying to follow along with the code presented on numerous other Stack Overflow questions and articles such as here here and here
I believe there are 2 problems I am facing:
Does having the namespace affect my xml?
I don't want to specify all of my tags as seen as a solution in 19.7.1.6. of the Element Tree documentation. I just want all of my tags to appear as a column for each "Security." If it doesn't have that tag it should be null. I also do not want to do a nasty if-else.
The problem is that when I run the code:
import xml.etree.ElementTree as et
etree = et.parse(xml_path)
test = etree.getroot()
and try and iterate as suggested in the above links, I am not able to easily access the child nodes.
Sample File:
<?xml version="1.0"?>
<SecurityInformation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/SecurityInformation.xsd">
<Security>
<Country>United States</Country>
</Security>
</SecurityInformation>
I've made a package for similar use case. It could work here too.
pip install pandas_read_xml
you can do something like
import pandas_read_xml as pdx
df = pdx.read_xml('filename.xml', ['SecurityInformation'])
To flatten, you could
df = pdx.flatten(df)
or
df = pdx.fully_flatten(df)

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

Updating an Existing XML Document in Python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!
First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

Categories

Resources