Choosing a specific XML node using lxml in Python - python

I'm from a VBScript background and new to lxml with Python.
In VBScript, to choose a specific node, I would simply do something like:
Set myNode = xmlDoc.selectSingleNode("/node1/node2/myNode").
What I have done with Python:
from lxml import etree
xmlDoc = etree.parse(fileName)
myNode =
Question: So what should be written in front of myNode to be able to select it?
Preferably without using XPath? Also taking lxml into account

You could use something like:
myNode = xmlDoc.find('node2/myNode')
The etree.parse function will return a root node (ie your node1), so you don't need to use an absolute path.
Example
content = '''
<root>
<div>
<p>content 1</p>
</div>
</root>
'''
from lxml import etree
xmlDoc = etree.fromstring(content)
paragraph_element = xmlDoc.find('div/p')
print(paragraph_element)
Output
<Element p at 0x9f54bc8>
Note:
For my example I have used the function etree.fromstring. This is purely for demonstration purposes, so you can see a workable example using a string. The function etree.parse should generate the same result when working with files rather than strings.
Aside: Why not use XPath? It is extremely powerful!

Related

Incorrect parent element lxml

I am implementing a web scraping program in Python.
Consider my following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
HiThere
</b>
</div>
If I wish to use lxml to extract my bold or italicized texts only, I use the following command
tree = etree.fromstring(myhtmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
This gives me the correct result, i.e. the result of my opp1 is :
['HelloWorld', 'HiThere']
So far, everything is perfect. However, the real problem arises if I try to query the parents of the tags. As expected, the output of opp1[0].getparent().tag and opp1[0].getparent().getparent().tag are i and b.
The real problem is however in the second tag. Ideally, the parent of opp[1] should be the b tag. However, the output of opp1[1].getparent().tag and opp1[1].getparent().getparent().tag are i and b again.
You can verify the same in the following code:
from lxml import etree
htmlstr = """<div><b><i>HelloWorld</i>HiThere</b></div>"""
htmlparser = etree.HTMLParser()
tree = etree.fromstring(htmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
print(opp1)
print(opp1[0].getparent(), opp1[0].getparent().getparent())
print(opp1[1].getparent(), opp1[1].getparent().getparent())
Can someone point out why this is the case? What can I do to correct it? I plan to use only lxml for my program, and do not want any solution that uses bs4.
The issue seems to stem from LXML's (and ElementTree's) data model, where an element is roughly "tag, attributes, text, children, tail"; the DOM data model has actual nodes for text too.
If you change your program to do
for x in tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]"):
print(x, x.getparent(), "text?", x.is_text, "tail?", x.is_tail)
it will print
HelloWorld <Element i at 0x10aa0ccd0> text? True tail? False
HiThere <Element i at 0x10aa0ccd0> text? False tail? True
i.e. "HiThere" is the tail of the i element, since that's the way the Etree data model represents intermingled text and tags.
The takeaway here (which should probably work for your use case) is to consider .getparent().getparent() as the effective parent of a text result that has is_tail=True.

How do i parse a xml comment properly in python

i have been using Python recently and i want to extract information from a given xml file. The problem is that the information is really badly stored, in a format like this
<Content>
<tags>
....
</tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>
I can not post the entire data here, since it is about 20.000 lines.
I just want to recieve the list containing ["string1", "string2", ...] and this is the code i have been using so far:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
print (node.text)
However my output is none. How can i recieve the comment data? (again, I am using Python)
You'll want to create a SAX based parser instead of a DOM based parser. Especially with a document as large as yours.
A sax based parser requires you to write your own control logic in how data is stored. It's more complicated than simply loading it into a DOM, but much faster as it loads line by line and not the entire document at once. Which gives it the advantage that it can deal with squirrely cases like yours with comments.
When you build your handler, you'll probably want to use the LexicalHandler in your parser to pull out those comments.
I'd give you a working example on how to build one, but it's been a long time since I've done it myself. There's plenty of guides on how to build a sax based parser online, and will defer that discussion to another thread.
The problem is that your comment does not seem to be standard. The standard comment is <!--Comment here--> like this.
And these kind of comments can be parsed with Beautifulsoup for example:
from bs4 import BeautifulSoup, Comment
xml = """<Content>
<tags>
...
</tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)
This returns ['[CDATA["string1"; "string2"; ....]]'] From where it could be easy to parse further into the required strings.
If you have non standard comments, i would recommend a regular expression like:
import re
xml = """<Content>
<tags>
asd
</tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
for j in re.findall('\".+\"', i):
print(j)
This returns: "string1"; "string2"
With Python 3.8 you can insert Comment in Element TREE
A sample code to read attrs, value, tag and comment in XML
import csv, sys
import xml.etree.ElementTree as ET
parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True)) # Python 3.8
tree = ET.parse(infile_path, parser)
csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)
COMMENT = ""
TAG =""
NAME=""
# Get the comment nodes
for node in tree.iter():
if "function Comment" in str(node.tag):
COMMENT = node.text
else:
#read tag
TAG = node.tag # string
#read attributes
NAME= node.attrib.get("name") # ID
#Value
VALUE = node.text # value
print(TAG, NAME, VALUE, COMMENT)

Python lxml: how to get human-readable XPath for XML element?

I have a short XML document:
<tag1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.com/2009/namespace">
<tag2>
<tag3/>
<tag3/>
</tag2>
</tag1>
A short Python program loads this XML file like this:
from lxml import etree
f = open( 'myxml.xml' )
tree = etree.parse(f)
MY_NAMESPACE = 'http://example.com/2009/namespace'
xpath = etree.XPath( '/f:tag1/f:tag2/f:tag3', namespaces = { 'f': MY_NAMESPACE } )
# get first element that matches xpath
elem = xpath(tree)[0]
# get xpath for an element
print tree.getpath(elem)
I am expecting to get a meaningful, human-readable xpath with this code, however, instead I get a string like /*/*/*[1].
Any idea what could be causing this and how I can diagnose this issue?
Note: Using Python 2.7.9 and lxml 2.3
It looks like getpath() (underlying libxml2 call xmlGetNodePath) produces positional expression xpath for namespaced documents.
User mzjn in the comments section pointed out that since lxml v3.4.0 a function getelementpath() produces a human-readable xpath with fully qualified tag names (using "Clark notation"). This function generates xpath by traversing the tree from the node up to the root instead of using libxml2 API call.
Similarly, if lxml v3.4+ is not available one can write a tree traversal function of their own.

How to prevent xml.ElementTree fromstring from dropping commentnode

I have tho following code fragment:
from xml.etree.ElementTree import fromstring,tostring
mathml = fromstring(input)
for elem in mathml.getiterator():
elem.tag = 'm:' + elem.tag
return tostring(mathml)
When i input the following input:
<math>
<a> 1 2 3 </a> <b />
<foo>Uitleg</foo>
<!-- <bar> -->
</math>
It results in:
<m:math>
<m:a> 1 2 3 </m:a> <m:b />
<m:foo>Uitleg</m:foo>
</m:math>
How come? And how can I preserve the comment?
edit: I don't care for the exact xml library used, however, I should be able to do the pasted change to the tags. Unfortunately, lxml does not seem to allow this (and I cannot use proper namespace operations)
You cannot with xml.etree, because its parser ignores comments (which is acceptable behaviour for an xml parser by the way). But you can if you use the (compatible) lxml library, which allows you to configure parser options.
from lxml import etree
parser = etree.XMLParser(remove_comments=False)
tree = etree.parse('input.xml', parser=parser)
# or alternatively set the parser as default:
# etree.set_default_parser(parser)
This would by far be the easiest option. If you really have to use xml.etree, you could try hooking up your own parser, although even then, comments are not officially supported: have a look at this example (from the author of xml.etree) (still seems to work in python 2.7 by the way)

Setting value for a node in XML document in Python

I have a XML document "abc.xml":
I need to write a function replace(name, newvalue) which can replace the value node with tag 'name' with the new value and write it back to the disk. Is this possible in python? How should I do this?
import xml.dom.minidom
filename='abc.xml'
doc = xml.dom.minidom.parse(filename)
print doc.toxml()
c = doc.getElementsByTagName("c")
print c[0].toxml()
c[0].childNodes[0].nodeValue = 'zip'
print doc.toxml()
def replace(tagname, newvalue):
'''doc is global, first occurrence of tagname gets it!'''
doc.getElementsByTagName(tagname)[0].childNodes[0].nodeValue = newvalue
replace('c', 'zit')
print doc.toxml()
See minidom primer and API Reference.
# cat abc.xml
<root>
<a>
<c>zap</c>
</a>
<b>
</b>
</root>
Sure it is possible.
The xml.etree.ElementTree module will help you with parsing XML, finding tags and replacing values.
If you know a little bit more about the XML file you want to change, you can probably make the task a bit easier than if you need to write a generic function that will handle any XML file.
If you are already familiar with DOM parsing, there's a xml.dom package to use instead of the ElementTree one.

Categories

Resources