Using ElementTree to find a node - invalid predicate - python

I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!

The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.

Related

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

find specific child in xml

<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
...
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point ....
Hello, I want to open the .xml file, find "curve" and import the y-coordinate of the curve into a list. I know that "curve" has the index [16] so I am using this right now:
tree = ET.parse(file_name)
root = tree.getroot()
curvature = [float(i) for i in[x["y"] for x in [root[0][16][i].attrib for i in range(len(root[0][16]))]]]
But how do I do it, if curve is not at the 16th position? How do I find curve in any xml file then? I have been trying for several hours now but I simply do not get it. Thank you very much in advance.
You could use XPath for instance.
This would then essentially look like:
root.findall(xpath)
where your xpath would be './/curve' if you are just interested in all childs of tag-type curve.
For more inofrmation regarding xpath see w3schools
I recommend learning about Regular Expressions (more commonly referred to as Regex), I use them all the time for problems like this.
This is a good place to reference the different aspects of Regex:
Regex
Regex is a way to match text, its a lot like if "substring" in string: except a million times more powerful. The entire purpose of regex is to find that "substring" even when you don't know what it is.
So lets take a closer look at your example in particular, first thing to do is figure out exactly which rules need to be true in order to "match" the y value.
I don't know exactly how you are actually reading in your data, but am reading it in as a single string.
string = '<graphiceditor>' \
'<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">' \
'<parent>DS_Autobahn 1</parent>' \
'<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"' \
'<point x="19.986891478960015" y="-0.00020825890723451596"/>' \
'<point ....'
You can see I split the sting into multiple lines to make it more readable. If you are reading it from a file with open() make sure to remove the "\n" meta-characters or my regex wont work (not that you cant write regex that would!)
The first thing I want to do is find the curve tag, then I want to continue on to find the y= section, then grab just the number. Let's simplify that out into really defined steps:
Find where the curve section begins
Continue until the next y= section begins
Get the value from inside the quotes after the y= section.
Now for the regex, I could explain how exactly it works but we would be here all day. Go back to that Doc I linked at the start and read-up.
import re
string = "[see above]"
y_val = re.search('<curve.*?y="(.*?)"', string).group(1)
That's it! Cast your y_val to a float() and you are ready to go!
Use an XML parser to parse XML; not regex.
Like mentioned in another answer, I would also use XPath. If you need to use complex XPaths, I'd recommend using lxml. In your example though ElementTree will suffice.
For example, this Python...
import xml.etree.ElementTree as ET
tree = ET.parse("file_name.xml")
root = tree.getroot()
curvature = [float(y) for y in [point.attrib["y"] for point in root.findall(".//curve/point[#y]")]]
print(curvature)
using this XML ("file_name.xml")...
<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1">
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point x="19.986891478960015" y="-0.00030825690983451678"/>
</curve>
</plot>
</graphiceditor>
will print...
[-0.00020825890723451596, -0.0003082569098345168]
Note: Notice the difference between the second y coordinate in the list and what's in the XML. That's because you're casting the value to a float. Consider casting to a decimal if you need to maintain precision.

Create XML node with empty text content and with a named closing tag using ElementTree.SubElement

I can't seem to find a way to generate a sub-element like this
<child attr="something"></child>
Using the following code:
myChild = ElementTree.SubElement(root, tag="child", attrib={'attr': 'something'})
I always get:
<child attr="something" />
Unless I at least add:
whiteSpace = " "
myChild.text = whiteSpace
This is very annoying.
Is there a way that I can generate null text for the element with ElementTree?
UPDATES:
After some tries, I tend to agree that it really shouldn't matter. The reason why I asked is that I wanted to generate Xcode workspace file which uses empty content for it's project nodes. But I found that the default subElement actually works as well. So I won't put any more efforts in making the output XML in the "identical" format as the normal Xcode workspace.
Case closed.
The answer to the question is probably: There is no way to achieve what I wanted in the beginning.
But as the updates in my question said. The difference between the two mentioned formats really doesn't matter. They both work as valid XML.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

How can you parse xml in Google Refine using jython/python ElementTree

I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)
Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:creator>J. Koenig</dc:creator>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:identifier>CCTL0059</dc:identifier>
<dc:identifier>CCTL0059</dc:identifier>
<dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
<dc:format>application/pdf</dc:format>
</oai_dc:dc>
Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'
from elementtree import ElementTree as ET
element = ET.parse(value)
namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
count += 1
return count
You can use a GREL expression like this, try it:
forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")
For each identifier found, give me the htmlText and join them all with commas.
parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.
You've used the wrong namespace. This works on Jython 2.5.1:
from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question
namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
print e.text
Output
CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335
Here's a slight tweak on J.F. Sebastian's version which can be pasted directly into Google Refine:
from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])
It returns a comma separated list, but you can change the delimiter used in the return statement.

Categories

Resources