I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!
The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.
xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.
I can't seem to find a way to generate a sub-element like this
<child attr="something"></child>
Using the following code:
myChild = ElementTree.SubElement(root, tag="child", attrib={'attr': 'something'})
I always get:
<child attr="something" />
Unless I at least add:
whiteSpace = " "
myChild.text = whiteSpace
This is very annoying.
Is there a way that I can generate null text for the element with ElementTree?
UPDATES:
After some tries, I tend to agree that it really shouldn't matter. The reason why I asked is that I wanted to generate Xcode workspace file which uses empty content for it's project nodes. But I found that the default subElement actually works as well. So I won't put any more efforts in making the output XML in the "identical" format as the normal Xcode workspace.
Case closed.
The answer to the question is probably: There is no way to achieve what I wanted in the beginning.
But as the updates in my question said. The difference between the two mentioned formats really doesn't matter. They both work as valid XML.
This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one
quick question... i can create/parse a chunk of html using libxml2dom, etc...
however, is there a way to somehow display the xpath used to generate/extract the html chunk.. i'm assuming that there's some method/way of doing this that i can't find..
ex:
import libxml2dom
d = libxml2dom.parseString(s, html=1)
##
hdr="//div[3]/table[1]/tr/th"
thdr_ = d.xpath(hdr)
print "lent = ",len(thdr_)
at this point, thdr_ is an array/list of objects.. each of which points to a chunk of html (if you will)
i'm trying to figure out if there's a way to get, say, the xpath for say, the thdr_[x] element/item of the list...
ie:
thdr_[0]=//div[3]/table[1]/tr[0]/th
thdr_[1]=//div[3]/table[1]/tr[1]/th
thdr_[2]=//div[3]/table[1]/tr[2]/th
.
.
.
any thoughts/comments..
thanks
-tom
I did this by iterating each node and comparing the textContent with my expected text. For fuzzy comparisons I used the SequenceMatcher class from difflib.