For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.
Related
I'm parsing an xml document in Python using minidom.
I have an element:
<informationRequirement>
<requiredDecision href="id"/>
</informationRequirement>
The only thing I need is value of href in subelement but its tag name can be different (for example requiredKnowledge instead of requiredDecision; it always shall begin with required).
If the tag was always the same I would use something like:
element.getElementsByTagName('requiredDecision')[0].attributes['href'].value
But that's not the case. What can be substitute of this knowing that tag name varies?
(there will be always one subelement)
If you're always guaranteed to have one subelement, just grab that element:
element.childNodes[0].attributes['href'].value
However, this is brittle. A (perhaps) better approach could be:
hrefs = []
for child in element.childNodes:
if child.tagName.startswith('required'):
hrefs.append(child.attributes['href'].value)
I would like to search through child elements for specific attributes using BeautifulSoup, from what I can see using the below method each child is a string (child['value'] gives me "string indices must be integers"), which does not allow selection based on attributes or returning of those attributes, which incidently is what I need to do.
def get_value(container):
html_file = open(html_path)
html = html_file.read()
soup = BeautifulSoup(html)
values = {}
container = soup.find(attrs={"name" : container})
if (container.contents != []):
for child in container.children:
value = unicode(child['value']) # i would like to be able to search throught these children based on their attributes, and return one or more of their values
return value
Could probably get around this with a further child_soup Beautifulsoup(child) and then a find command but this seems really horrible, anyone got a better solution?
container.children is a generator that provides Tag objects, so you can operate on them normally.
You also might want to try element.find_all(..., recursive=False) in order to look for an element's direct children with some traits.
After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]
It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard
All p tags should be doc.findall(".//p")
As per guide, expression nodename Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).
In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.
I'm trying to get the links from a page with xpath. The problem is that I only want the links inside a table, but if I apply the xpath expression on the whole page I'll capture links which I don't want.
For example:
tree = lxml.html.parse(some_response)
links = tree.xpath("//a[contains(#href, 'http://www.example.com/filter/')]")
The problem is that applies the expression to the whole document. I located the element I want, for example:
tree = lxml.html.parse(some_response)
root = tree.getroot()
table = root[1][5] #for example
links = table.xpath("//a[contains(#href, 'http://www.example.com/filter/')]")
But that seems to be performing the query in the whole document as well, as I still am capturing the links outside of the table. This page says that "When xpath() is used on an Element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):". So, what I using is an absolute expression and I need to make it relative? Is that it?
Basically, how can I go about filtering only elements that exist inside of this table?
Your xpath starts with a slash (/) and is therefore absolute. Add a dot (.) in front to make it relative to the current element i.e.
links = table.xpath(".//a[contains(#href, 'http://www.example.com/filter/')]")
Another option would be to ask directly for elements inside your table.
For instance:
tree = lxml.html.parse(some_response)
links = tree.xpath("//table[**criteria**]//a[contains(#href, 'http://www.example.com/filter/')]")
Where **criteria** is necessary if there are many tables in the page. Some possible criteria would be to filter based on the table id or class. For instance:
links = tree.xpath("//table[#id='my_table_id']//a[contains(#href, 'http://www.example.com/filter/')]")
I am trying to remove comments from a list of elements that were obtained by using lxml
The best I have been able to do is:
no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))]
I am wondering if there is a more direct way?
I am going to add something based on Matthew's answer - he got me almost there the problem is that when the element are taken from the tree the comments lose some identity (I don't know how to describe it) so that it cannot be determined whether they are HtmlComment class objects using the isinstance() method
However, that method can be used when the elements are being iterated through on the tree
from lxml.html import HtmlComment
no_comments=[element for element in root.iter() if not isinstance(element,HtmlComment)
For those novices like me root is the base html element that holds all of the other elements in the tree there are a number of ways to get it. One is to open the file and iterate through it so instead of root.iter() in the above
html.fromstring(open(r'c:\temp\testlxml.htm').read()).iter()
You can cut out the strings:
from lxml.html import HtmlComment # or similar
no_comments=[element for element in element_list if not isinstance(element, HtmlComment)]