lxml bug in .xpath? - python

After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]

It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard

All p tags should be doc.findall(".//p")
As per guide, expression nodename Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).

In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.

Related

Why does adding 3 elements make findall not work?

For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.

lxml etree iteration inside xpath searches

I've been trying to retrieve information from tree nodes. I got the nodes from xpath searches but inside those nodes I try to run xpath searches again but it goes to the root. Is it possible to iterate over nodes with specific classes and retrieve different information indide them?
The example of the code I'm parsing
<li>
<div class="product-preview">
<div class="product-image">
<div class="product-info">
<\li>
I need to find those product-preview nodes that I'm already getting using
lxml.html.xpath( //div[contains(#class, product-preview)])
but when I try to get the different subnodes iterating over the results from the previous code, always searches in the parent
What I am trying to do is
for element in lxml.html.xpath( //div[contains(#class, product-preview)] ):
element.xpath( "new search" )
How should I iterate over the elements to make new searches inside them?
Thank you very much.
I've resolved the issue using the method find_class from lxml.html, it returns the node without parent. After that I can use xpath to search inside it.
for element in tree.find_class( class_name ):
element.xpath( xpath ).get( attrib )

Navigating in html by xpath in python

So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.
It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.
I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.
Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

Iterating Over Elements and Sub Elements With lxml

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.

xpath selecting elements and iterating over the tag

Consider the tag in my html is like this
<div class ="summary">
<p>Best <a class="abch" href="/canvas">canvas</a> abcdefgh <a class="zph" href="/canvas">canvas</a>, I cycle them to garden</p>
</div>
When I do
site.select('.//*[contains(#class, "summary")]/p/text()').extract()
I get only the text of p and the hyperlinks are lost.
I want to do extract the data of as well as the textual data of (eg canvas above). There can be any number of tags inside the element. they may or may not be present within the tag.
Any idea how to extract the entire data.
I think two slashes after p will work for you. One slash / selects children only, two slashes // will include deeper elements. Since text nodes under a are not direct children of p they are not selected.
site.select('.//*[contains(#class, "summary")]/p//text()').extract()
Update:
Answering to your comment: I can only can think of such way:
for p in site.select('.//*[contains(#class, "summary")]/p'):
p.select('//text()').extract()
When this XPath expression is evaluated:
string(.//*[contains(#class, "summary")]/p)
the result is a string that is the concatenation (in document order) of all of the text nodes descendants of p.
I guess that this is what you want.

Categories

Resources