find specific child in xml - python

<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
...
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point ....
Hello, I want to open the .xml file, find "curve" and import the y-coordinate of the curve into a list. I know that "curve" has the index [16] so I am using this right now:
tree = ET.parse(file_name)
root = tree.getroot()
curvature = [float(i) for i in[x["y"] for x in [root[0][16][i].attrib for i in range(len(root[0][16]))]]]
But how do I do it, if curve is not at the 16th position? How do I find curve in any xml file then? I have been trying for several hours now but I simply do not get it. Thank you very much in advance.

You could use XPath for instance.
This would then essentially look like:
root.findall(xpath)
where your xpath would be './/curve' if you are just interested in all childs of tag-type curve.
For more inofrmation regarding xpath see w3schools

I recommend learning about Regular Expressions (more commonly referred to as Regex), I use them all the time for problems like this.
This is a good place to reference the different aspects of Regex:
Regex
Regex is a way to match text, its a lot like if "substring" in string: except a million times more powerful. The entire purpose of regex is to find that "substring" even when you don't know what it is.
So lets take a closer look at your example in particular, first thing to do is figure out exactly which rules need to be true in order to "match" the y value.
I don't know exactly how you are actually reading in your data, but am reading it in as a single string.
string = '<graphiceditor>' \
'<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">' \
'<parent>DS_Autobahn 1</parent>' \
'<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"' \
'<point x="19.986891478960015" y="-0.00020825890723451596"/>' \
'<point ....'
You can see I split the sting into multiple lines to make it more readable. If you are reading it from a file with open() make sure to remove the "\n" meta-characters or my regex wont work (not that you cant write regex that would!)
The first thing I want to do is find the curve tag, then I want to continue on to find the y= section, then grab just the number. Let's simplify that out into really defined steps:
Find where the curve section begins
Continue until the next y= section begins
Get the value from inside the quotes after the y= section.
Now for the regex, I could explain how exactly it works but we would be here all day. Go back to that Doc I linked at the start and read-up.
import re
string = "[see above]"
y_val = re.search('<curve.*?y="(.*?)"', string).group(1)
That's it! Cast your y_val to a float() and you are ready to go!

Use an XML parser to parse XML; not regex.
Like mentioned in another answer, I would also use XPath. If you need to use complex XPaths, I'd recommend using lxml. In your example though ElementTree will suffice.
For example, this Python...
import xml.etree.ElementTree as ET
tree = ET.parse("file_name.xml")
root = tree.getroot()
curvature = [float(y) for y in [point.attrib["y"] for point in root.findall(".//curve/point[#y]")]]
print(curvature)
using this XML ("file_name.xml")...
<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1">
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point x="19.986891478960015" y="-0.00030825690983451678"/>
</curve>
</plot>
</graphiceditor>
will print...
[-0.00020825890723451596, -0.0003082569098345168]
Note: Notice the difference between the second y coordinate in the list and what's in the XML. That's because you're casting the value to a float. Consider casting to a decimal if you need to maintain precision.

Related

Using ElementTree to find a node - invalid predicate

I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!
The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.

How to get all text inside XML tags

xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.

Create XML node with empty text content and with a named closing tag using ElementTree.SubElement

I can't seem to find a way to generate a sub-element like this
<child attr="something"></child>
Using the following code:
myChild = ElementTree.SubElement(root, tag="child", attrib={'attr': 'something'})
I always get:
<child attr="something" />
Unless I at least add:
whiteSpace = " "
myChild.text = whiteSpace
This is very annoying.
Is there a way that I can generate null text for the element with ElementTree?
UPDATES:
After some tries, I tend to agree that it really shouldn't matter. The reason why I asked is that I wanted to generate Xcode workspace file which uses empty content for it's project nodes. But I found that the default subElement actually works as well. So I won't put any more efforts in making the output XML in the "identical" format as the normal Xcode workspace.
Case closed.
The answer to the question is probably: There is no way to achieve what I wanted in the beginning.
But as the updates in my question said. The difference between the two mentioned formats really doesn't matter. They both work as valid XML.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

python libxml2dom xpath question

quick question... i can create/parse a chunk of html using libxml2dom, etc...
however, is there a way to somehow display the xpath used to generate/extract the html chunk.. i'm assuming that there's some method/way of doing this that i can't find..
ex:
import libxml2dom
d = libxml2dom.parseString(s, html=1)
##
hdr="//div[3]/table[1]/tr/th"
thdr_ = d.xpath(hdr)
print "lent = ",len(thdr_)
at this point, thdr_ is an array/list of objects.. each of which points to a chunk of html (if you will)
i'm trying to figure out if there's a way to get, say, the xpath for say, the thdr_[x] element/item of the list...
ie:
thdr_[0]=//div[3]/table[1]/tr[0]/th
thdr_[1]=//div[3]/table[1]/tr[1]/th
thdr_[2]=//div[3]/table[1]/tr[2]/th
.
.
.
any thoughts/comments..
thanks
-tom
I did this by iterating each node and comparing the textContent with my expected text. For fuzzy comparisons I used the SequenceMatcher class from difflib.

Categories

Resources