Getting element by undefined tag name - python

I'm parsing an xml document in Python using minidom.
I have an element:
<informationRequirement>
<requiredDecision href="id"/>
</informationRequirement>
The only thing I need is value of href in subelement but its tag name can be different (for example requiredKnowledge instead of requiredDecision; it always shall begin with required).
If the tag was always the same I would use something like:
element.getElementsByTagName('requiredDecision')[0].attributes['href'].value
But that's not the case. What can be substitute of this knowing that tag name varies?
(there will be always one subelement)

If you're always guaranteed to have one subelement, just grab that element:
element.childNodes[0].attributes['href'].value
However, this is brittle. A (perhaps) better approach could be:
hrefs = []
for child in element.childNodes:
if child.tagName.startswith('required'):
hrefs.append(child.attributes['href'].value)

Related

Why does adding 3 elements make findall not work?

For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.

PYTHON - Unable To Find Xpath Using Selenium

I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text

Getting XML attribute value with lxml module

How can i get the value of an attribute of XML file with lxml module?
My XML looks like this"
<process>
<name>somename</name>
<statistics>
<stats param='someparam'>
<value>0.456</value>
<real_value>0.4</value>
</stats>
<stats ...>
.
.
.
</stats>
</statistics>
</process>
I want to get the value 0.456 from the value attribute. I'm iterating trought the attribute and getting the text but im not sure that this is the best way for doing this
for attribute in root.iter('statistics'):
for stats in attribute:
for param_value in stats.iter('value'):
value = param_value.text
is there any other much easier way for doing this? something like stats.get_value('value')
Use XPath:
root.find('.//value').text
This gets you the content of the first value tag.
If you want to iterate over all value elements, use findall, this gets you a list with all the elements.
If you only want the value elements inside <stats param='someparam'> elements, make the path more specific:
root.findall("./statistics/stats[#param='someparam']/value")
edit: Note that find/findall only support a subset of XPath. If you want to make use of the whole XPath (1.x) functionality, use the xpath method.

Extract text between tags with XPath including markup

I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span> tags.
For this I use XPath:
/span[#class="st"]
This however will extract everything including the <span>.
and.
/span[#class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDIT
Thanks to #helderdarocha and #TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.
The /span[#class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[#class="st"]/node()
This will return:
Two child text nodes
The full <em> node (element and contents).
If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:
/span[#class="st"]//text()
or
/span[#class="st"]/descendant::text()
This will return three text nodes, the text inside <em>, but not the <em> elements.
Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.
XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.
In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.
-s

lxml bug in .xpath?

After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]
It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard
All p tags should be doc.findall(".//p")
As per guide, expression nodename Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).
In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.

Categories

Resources