Python: xml.Find always returns none - python

So I'm trying to search and replace the xml keyword RunCodeAnalysis inside a vcxproj file with python.
I'm pretty new to python so be gentle, but I thought it would be the simplest language to do this kind of thing.
I read a handful of similar examples and came up with the code below, but no matter what I search for the ElementTree Find call always returns None.
from xml.etree import ElementTree as et
xml = '''\
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Protected_Debug|Win32'">
<RunCodeAnalysis>false</RunCodeAnalysis>
</PropertyGroup>
</Project>
'''
et.register_namespace('', "http://schemas.microsoft.com/developer/msbuild/2003")
tree = et.ElementTree(et.fromstring(xml))
print(tree.find('.//RunCodeAnalysis'))
Here's a simplified code example online: https://ideone.com/1T1wsb
Can anyone tell me what I'm doing wrong?

Ok.. So #ThomWiggers helped me with the missing piece - and here's my final code in all it's naive glory. No parameter checking or any kind of smarts yet, but it takes two parameters - filename and whether to turn static code analysis to true or false. I've got about 30 projects I want to turn it on for for nightly builds but really don't want to turn it on day to day as it's just too slow.
import sys
from xml.etree import ElementTree as et
et.register_namespace('', "http://schemas.microsoft.com/developer/msbuild/2003")
tree = et.parse(sys.argv[1])
value = sys.argv[2]
for item in tree.findall('.//{http://schemas.microsoft.com/developer/msbuild/2003}RunCodeAnalysis'):
item.text = value
for item in tree.findall('.//{http://schemas.microsoft.com/developer/msbuild/2003}EnablePREfast'):
item.text = value
tree.write(sys.argv[1])

Related

Parsing an xml file with an emphasis tag in it in python

I am currently writing a python script that can extract all of the text in an xml file. I am using the Element Tree library to interpret the data but I am running into this problem however when the data is structured like this...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
When I attempt to read out the text, I get the first half of the Segment ("Alright. So what we had") before the pause tag.
What I am trying to figure out is if there is a way to ignore the tags in the data segments and print out all of the text.
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
Result:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

How to make nested xml structure flat with python

I have XML with huge nested structure.
Like this one
<root>
<node1>
<subnode1>
<name1>text1</name1>
</subnode1>
</node1>
<node2>
<subnode2>
<name2>text2</name2>
</subnode2>
</node2>
</root>
I want convert it to
<root>
<node1>
<name1>text1</name1>
</node1>
<node2>
<name2>text2</name2>
</node2>
</root>
I was tried with following steps
from xml.etree import ElementTree as et
tr = etree.parse(path)
root = tr.getroot()
for node in root.getchildren():
for element in node.iter():
if (element.text is not None):
node.extend(element)
I also tried with node.append(element) but it also does not work it adds element in end and i got infinity loop.
Any helps be appreciated.
A few points to mention here:
Firstly, your test element.text is not None always returns True if you parse your XML file as given above using xml.etree.Elementree since at the end of each node, there is a new line character, hence, the text in each supposedly not-having-text node always have \n character. An alternative is to use lxml.etree.parse with a lxml.etree.XMLParser that ignore the blank text as below.
Secondly, it's not good to append to a tree while reading through it. The same reason for why this code will give infinite loop:
>>> a = [1,2,3,4]
>>> for k in a:
a.append(5)
You could see #Alex Martelli answer for this question here: Modifying list while iterating regarding the issue.
Hence, you should make a buffer XML tree and build it accordingly rather than modifying your tree while traversing it.
from xml.etree import ElementTree as et
import pdb;
from lxml import etree
p = etree.XMLParser(remove_blank_text=True)
path = 'test.xml'
tr = et.parse(path, parser = p)
root = tr.getroot()
buffer = et.Element(root.tag);
for node in root.getchildren():
bnode = et.Element(node.tag)
for element in node.iter():
#pdb.set_trace()
if (element.text is not None):
bnode.append(element)
#node.extend(element)
buffer.append(bnode)
et.dump(buffer)
Sample run and results:
Chip chip# 01:01:53# ~: python stackoverflow.py
<root><node1><name1>text1</name1></node1><node2><name2>text2</name2></node2></root>
NOTE: you can always try to print a pretty XML tree using lxml package in python following tutorials here: Pretty printing XML in Python since the tree I printed out is rather horrible to read by naked eyes.

Find an element in an XML tree using ElementTree

I am trying to locate a specific element in an XML file, using ElementTree. Here is the XML:
<documentRoot>
<?version="1.0" encoding="UTF-8" standalone="yes"?>
<n:CallFinished xmlns="http://api.callfire.com/data" xmlns:n="http://api.callfire.com/notification/xsd">
<n:SubscriptionId>96763001</n:SubscriptionId>
<Call id="158864460001">
<FromNumber>5129618605</FromNumber>
<ToNumber>15122537666</ToNumber>
<State>FINISHED</State>
<ContactId>125069153001</ContactId>
<Inbound>true</Inbound>
<Created>2014-01-15T00:15:05Z</Created>
<Modified>2014-01-15T00:15:18Z</Modified>
<FinalResult>LA</FinalResult>
<CallRecord id="94732950001">
<Result>LA</Result>
<FinishTime>2014-01-15T00:15:15Z</FinishTime>
<BilledAmount>1.0</BilledAmount>
<AnswerTime>2014-01-15T00:15:06Z</AnswerTime>
<Duration>9</Duration>
</CallRecord>
</Call>
</n:CallFinished>
</documentRoot>
I am interested in the <Created> item. Here is the code I am using:
import xml.etree.ElementTree as ET
calls_root = ET.fromstring(calls_xml)
for item in calls_root.find('CallFinished/Call/Created'):
print "Found you!"
call_start = item.text
I have tried a bunch of different XPath expressions, but I'm stumped - I cannot locate the element. Any tips?
You aren't referencing the namespaces that exist in the XML document, so ElementTree can't find the elements in that XPath. You need to tell ElementTree what namespaces you are using.
The following should work:
import xml.etree.ElementTree as ET
namespaces = {'n':'{http://api.callfire.com/notification/xsd}',
'_':'{http://api.callfire.com/data}'
}
calls_root = ET.fromstring(calls_xml)
for item in calls_root.find('{n}CallFinished/{_}Call/{_}Created'.format(**namespaces)):
print "Found you!"
call_start = item.text
Alternatively, LXML has a wrapper around ElementTree and has good support for namespaces without having to worry about string formatting.

Elementtree displaying elements out of order

I'm using Python's ElementTree to parse xml files. I have a "findall" to find all "revision" subelements, but when I iterate through the result, they are not in document order. What can I be doing wrong?
Here's my code:
allrevisions = page.findall('{http://www.mediawiki.org/xml/export-0.5/}revision')
for rev in allrevisions:
print rev
print rev.find('{http://www.mediawiki.org/xml/export-0.5/}timestamp').text
Here's a link to the document I'm parsing: http://pastie.org/2780983
Thanks,
bsg
-Oops. By going through my code and running it piece by piece, I worked out the problem - I had stuck in a reverse() on the elements list in the wrong place, which was causing all the trouble. Thank you so much for your help - I'm sorry it was such a silly issue.
The documentation for ElementTree says that findall returns the elements in document order.
A quick test shows the correct behaviour:
import xml.etree.ElementTree as et
xmltext = """
<root>
<number>1</number>
<number>2</number>
<number>3</number>
<number>4</number>
</root>
"""
tree = et.fromstring(xmltext)
for number in tree.findall('number'):
print number.text
Result:
1
2
3
4
It would be helpful to see the document you are parsing.
Update:
Using the source data you provided:
from __future__ import with_statement
import xml.etree.ElementTree as et
with open('xmldata.xml', 'r') as f:
xmldata = f.read()
tree = et.fromstring(xmldata)
for revision in tree.findall('.//{http://www.mediawiki.org/xml/export-0.5/}revision'):
print revision.find('{http://www.mediawiki.org/xml/export-0.5/}text').text[0:10].encode('utf8')
Result:
‘The Mind
{{db-spam}
‘The Mind
'''The Min
<!-- Pleas
The same order as they appear in the document.

Categories

Resources