Updating an Existing XML Document in Python - python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!

First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

Related

Processing large xml files. Only root tree children attributes are relevant

I'm new to xml and python and I hope that I phrased my problem right:
I have xml files with a size of one gigabyte.
The files look like this:
<test name="LongTestname" result="PASS">
<step ID="0" step="NameOfStep1" result="PASS">
Stuff I dont't care about
</step>
<step ID="1" step="NameOfStep2" result="PASS">
Stuff I dont't care about
</step>
</test>
For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.
I have already tried following:
tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here I get a memory error because the files are to big
Then I tried:
try:
for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
if elem.tag == "step" and event == "start":
stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
elem.clear()
This works but is really slow. I guess it iterates through all elements and this takes a very long time.
Then I found a solution looking like this:
try:
tree = ET.iterparse(pathToSteps, events=("start","end"))
_, root = next(tree)
print('ROOT:', root.tag)
except:
print("ERROR: Unable to open and parse file !!!")
for child in root:
print(child.attrib)
But this prints only the attributes of the first step.
Is there a way to speed up the working solution?
Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.
I think you're on the right track with iterparse().
Maybe try specifying the step element name in the tag argument and only processing "start" events...
from lxml import etree
for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
print(elem.attrib)
elem.clear()
EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.
Without knowing the specifics of your setup, it might be hard to guess what the 'fastest possible' might be and how much of the delay is due to the parsing of the file. The first thing I would do, is of course time the run so you have some initial benchmark. Then I would write a simple python program that does nothing else but read the file from disk (no XML parsing). If the time difference is not significant, then the XML parsing isn't the issue and it is the reading of the file from disk is the problem. Of course, in an XML document, there is no indication in the file itself where the next tag ends so skipping the IO associated with those portions isn't possible (you still need to do a linear read of the file). Other than potentially using a different programming language (non-interpreted), there may not be many things you can do.
If you do get a significant slowdown from the actual XML parsing, you could then potentially try to pre-process the file into a different one. Since the file format of your files is very static, you could read the file and output to a different file (using a regex) until you get the tag. Then just throw out the data until you close the </step> tag or </test> tag. That will result in a valid, but hopefully much smaller XML file. The key here would be to do the 'parsing' yourself instead of having the underlying parser try to understand all of the document format, which could be much faster since your format is simple. You could then run your original program on this output which will not 'see' any of the extraneous tags. Of course, this breaks if you actually have nested <step> tags, but if that is the case, then you likely need to parse the file with a real XML parser to understand where the first-level starts and stops.

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

how to use QWebView of PyQt to display xml witl xslt applied

According to PythonCentral :
QWebView ... allows you to display web pages from URLs, arbitrary HTML, XML with XSLT stylesheets, web pages constructed as QWebPages, and other data whose MIME types it knows how to interpret
However, the xml contents are displayed as if it were interpreted as html, that is, the tags filtered away and the textnodes shown w/o line breaks.
Question is: how do I show xml in QWebView with the xsl style sheet applied?
The same xml-file opened in any stand-alone webbrowser shows fine. The html-file resulted from the transformed xml (by lxml.etree) also displays well in QWebView.
Here is my (abbreviated) xml file:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="../../page.xsl"?>
<specimen>
...
</specimen>
OK, I found part of the solution. It's a multi-step approach using QXmlQuery:
path = base + "16000-16999/HF16019"
xml = os.path.join(path, "specimen.xml")
xsl = os.path.join(path, "../../page.xsl")
app = QApplication([])
query = QXmlQuery(QXmlQuery.XSLT20)
query.setFocus(QUrl("file:///" + xml));
query.setQuery(QUrl("file:///" + xsl));
out = query.evaluateToString();
win = QWebView()
win.setHtml(out);
win.show()
app.exec_()
Evidently, the xslt is applied this way. What's still wrong is that the css style sheets referenced in the xslt are not applied/found.
I came across your question because I had a similar problem. I thought I'd post the solution I found to the problem as well, because it works without QXmlQuery and is rather simple.
For my solution, my xml file was also interpreted as HTML, so I just worked with that and replaced every < with <, every > with > and every & with & as mentioned in this answer.
So, for your xmlString, just do:
xmlString.replace("<","<").replace(">",">").replace("&", "&")
This way, if your xml file gets interpreted as html it will at least show the text correctly with all the tags.

Accessing non tree structured xml data in python

I have several xml files that I want to parse in python. I am aware of the ElementTree package in python, however my xml files aren't stored in a tree like structure. Below is an example
<tag1 attribute1="at1" attribute2="at2">My files are text that I annotated with a tool
to create these xml files.</tag1>
Some parts of the text are enclosed in an xml tag, whereas others are not.
<tag1 attribute1="at1" attribute2="at2"><tag2 attribute3="at3" attribute4="at4">Some
are even enclosed in multiple tags.</tag1></tag2>
And some have overlapping tags:
<tag1 attribute1="at1" attribute2="at2">This is an example sentence
<tag3 attribute5="at5">containing a nested example sentence</tag3></tag1>
Whenever I use an ElementTree like function to parse the file, I can only access the very first tag. I am looking for a way to parse all the tags and don't want a tree like structure. Any help is greatly appreciated.
If you have one XML fragment per line, just parse each line individually.
for line in some_file:
# parse using ET and getroot.

What is the best way to change text contained in an XML file using Python?

Let's say I have an existing trivial XML file named 'MyData.xml' that contains the following:
<?xml version="1.0" encoding="utf-8" ?>
<myElement>foo</myElement>
I want to change the text value of 'foo' to 'bar' resulting in the following:
<?xml version="1.0" encoding="utf-8" ?>
<myElement>bar</myElement>
Once I am done, I want to save the changes.
What is the easiest and simplest way to accomplish all this?
Use Python's minidom
Basically you will take the following steps:
Read XML data into DOM object
Use DOM methods to modify the document
Save new DOM object to new XML document
The python spec should hold your hand rather nicely though this process.
This is what I wrote based on #Ryan's answer:
from xml.dom.minidom import parse
import os
# create a backup of original file
new_file_name = 'MyData.xml'
old_file_name = new_file_name + "~"
os.rename(new_file_name, old_file_name)
# change text value of element
doc = parse(old_file_name)
node = doc.getElementsByTagName('myElement')
node[0].firstChild.nodeValue = 'bar'
# persist changes to new file
xml_file = open(new_file_name, "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
Not sure if this was the easiest and simplest approach but it does work. (#Javier's answer has less lines of code but requires non-standard library)
For quick, non-critical XML manipulations, i really like P4X. It let's you write like this:
import p4x
doc = p4x.P4X (open(file).read)
doc.myElement = 'bar'
You also might want to check out Uche Ogbuji's excellent XML Data Binding Library, Amara:
http://uche.ogbuji.net/tech/4suite/amara
(Documentation here:
http://platea.pntic.mec.es/~jmorilla/amara/manual/)
The cool thing about Amara is that it turns an XML document in to a Python object, so you can just do stuff like:
record = doc.xml_create_element(u'Record')
nameElem = doc.xml_create_element(u'Name', content=unicode(name))
record.xml_append(nameElem)
valueElem = doc.xml_create_element(u'Value', content=unicode(value))
record.xml_append(valueElem
(which creates a Record element that contains Name and Value elements (which in turn contain the values of the name and value variables)).

Categories

Resources