I'm trying to get the links from a page with xpath. The problem is that I only want the links inside a table, but if I apply the xpath expression on the whole page I'll capture links which I don't want.
For example:
tree = lxml.html.parse(some_response)
links = tree.xpath("//a[contains(#href, 'http://www.example.com/filter/')]")
The problem is that applies the expression to the whole document. I located the element I want, for example:
tree = lxml.html.parse(some_response)
root = tree.getroot()
table = root[1][5] #for example
links = table.xpath("//a[contains(#href, 'http://www.example.com/filter/')]")
But that seems to be performing the query in the whole document as well, as I still am capturing the links outside of the table. This page says that "When xpath() is used on an Element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):". So, what I using is an absolute expression and I need to make it relative? Is that it?
Basically, how can I go about filtering only elements that exist inside of this table?
Your xpath starts with a slash (/) and is therefore absolute. Add a dot (.) in front to make it relative to the current element i.e.
links = table.xpath(".//a[contains(#href, 'http://www.example.com/filter/')]")
Another option would be to ask directly for elements inside your table.
For instance:
tree = lxml.html.parse(some_response)
links = tree.xpath("//table[**criteria**]//a[contains(#href, 'http://www.example.com/filter/')]")
Where **criteria** is necessary if there are many tables in the page. Some possible criteria would be to filter based on the table id or class. For instance:
links = tree.xpath("//table[#id='my_table_id']//a[contains(#href, 'http://www.example.com/filter/')]")
Related
I am trying to select all table elements from a div parent node by using a customized function.
This is what I've got so far:
import BeautifulSoup
import requests
import lxml
url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510'
def getTables(url):
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
div_component = soup.find('div', attrs={'class':'td-post-content'})
tables = div_component.find_all('table', attrs={'class':'listas'})
return tables
However when applied as getTables(url) the output is an empty list [].
I expect this function to return all html tables elements inside div node given specific his specific attributes.
How could I adjust this function?
Is there any other library I could use to accomplish this task?
Taking what the other commenters have said, and expanding on it.
Your div_component returns 1 element and doesn't contain tables, but using find_all() yeilds 8 elements:
len(soup.find_all('div', attrs={'class':'td-post-content'}))
So you can't just use find() on a list you'll need to iterate through it to find a div that contains tables.
Another way to just go after the tables you want, you can just use
tables = soup.find_all('table', attrs={'class':'listas'})
where tables is a list with 6 elements. If you know which table you want, you can iterate through the tables until you find the one you want.
The first problem is that "find" finds only the first such match. The first td-post-content <div> does not contain any tables. I think you want "findall". Second, you can use CSS selectors with BeautifulSoup. So, you can search for soup.findall('div.td-post-content') without using the attributes parameter.
For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.
On this website https://classicdb.ch/?quest=788
I tried:
driver.find_element_by_xpath(
"//div[contains(text(), 'Start')]").text
It finds the element and it returns
'Start: Kaltunk'
However when I try to find the element that contains "End" it doesn't finds anything.
driver.find_element_by_xpath(
"//div[contains(text(), 'End')]").text
Why is this?
Thank you.
Try with the below xpath.
//table//div[contains(.,'End:')]
Screenshot:
Explanation: Edit 1
First of all let's see how many text() nodes are present under the target div.
So the div have 3 text nodes.
Let me elaborate the original xpath used by OP.
//div[contains(text(), 'End')]
^div present anywhere in the document
^which contains
^the **first** text() node with string value as `End`
When contains() is given as its first argument (in div[argument]), it takes the string value of the first node, but End appears in the second text node, not the first. That's the reason why the xpath did not worked.
We have 2 options to handle this.
1) using text() as the first argument - By that way it will get all text nodes under current context and then use contains() as a condition to check for the text() value that will match any text() node whose value contains End as shown below.
//div[text()[contains(., 'End')]]
^div present any where in the document
^which have text() node
^ that contains 'End`
Check the below screenshot:
By this time, you would got a question then why the first xpath (//div[contains(text(), 'Start')]) used by OP worked?
If you look at the text() nodes associated in the div, Start text is present in the 1st text() node itself, that's the reason why he was able to use that xpath.
2) Using . to check in current node context In simple terms when you say . it will check in the entire current element context for the End.
//div[contains(.,'End')]
If you don't limit the scopt to //table (at the beginning of the xpath) you will get 5 divs as the ancestor divs of the original div which have this text also be matched with the xpath. So limit the scope to check with in the table like
`//table//div[contains(.,'End')]
I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text
I am using Scrapy along with XPath. In a scenario, i need to get the anchor element's href and text.
What i did is:
Get all the anchor from the container using a selector
Looped through the anchors to find href and text. I am able to get the href but not text.
Here is the snippet to understand better
anchors = response.selector.xpath("//table[#class='style1']//ul//li//a")
for anchor in anchors:
link = anchor.xpath('#href').extract()[0]
name = anchor.xpath('[how-to-access-current-node-here]').text()
How can i achieve this?
Thanks in advance!
You can use xpath text(), provided that you know where the header text is (from a), let's say from your sample if the header text is within a's parent element, then extracting it is only go a level back, like this:
anchors = response.selector.xpath("//table[#class='style1']//ul//li//a")
for anchor in anchors:
link = anchor.xpath('#href').extract()[0]
# go one level back and access text()
name = anchor.xpath('../text()').extract()
Or, better still you do even need to do this under a for loop, just use extract and it will return a list:
anchors = response.selector.xpath("//table[#class='style1']//ul//li//a")
links = anchors.xpath('#href').extract()
names = anchors.xpath('../text()').extract()
paired_links_with_names = zip(links, names)
...
# you may do your thing here or still do a for / loop
Of course you need to inspect the elements and find out where the header text is of course, it's only how you access that text from your existing xpath location.
Hope this helps.