XPath for LXML with Intermediary Element - python

I'm trying to scrape some pages with python and LXML. My test page is http://www.sarpy.com/oldterra/prop/PDisplay3.asp?ParamValue1=010558233
I'm having good luck with most of the XPaths. For example,
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[3]/td[1]/text()')
successfully gets me the date of the first sale listed. I have several other pieces too. However, I cannot get the B&P listed under the sale date. For example the B&P of the first sale is 200639333.
I notice in the page structure that there is a form element preceding the tr of the B&P item. Since it's the next table row, I tried incrementing the tr index as follows:
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[4]/td[1]/text()')
That returns:
['\r\n ']
Because of the line breaks and sub element of br and input within the field, I tried making text() into text()[1], text()[2], etc., but no luck.
I tried to base the path off of the adjacent form like this:
tree.xpath('/html/body/table[7]/form[#action="../rod/ImageDisplay.asp"]/following-sibling::tr/td[1]/text()')
No luck.
I figure there are two potential issues: the intermediary form elements that may be breaking the indexing patterns, and the whitespace. I'd appreciate any help in correcting this xpath.

The <tr> you are looking for is the child of the <form> , not its sibling , try -
tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../form[1]/td[1]/text()')
This may get you 200639333 with a lot of whitespaces.
Or -
tree.xpath('/html/body/table[7]/form[#action="../rod/ImageDisplay.asp"]/tr[1]/td[1]/text()')
For all such elements.

Related

Choose XPATH based on <th> string value with selenium

There is a table that I want to get the XPATH of, however the amount of rows and columns is inconsistent across results so I can't just right click and get copy the full XPATH.
My current code:
result_priority_number = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[6]/div/div[2]/table/tbody/tr[18]/td[2]")
The table header names though are always consistent. How do I get the value of an element where the table header specifically says something (i.e. "Priority Number")
I can't just right click and get copy the full XPATH.
Never use this method. Xpath has a very useful feature for search! It isn't just for nested pathing!
//td[contains(text(),'header value')]
or if it has many tables and you want only one of its:
//table[#id='id_of_table']//td[contains(text(),'header value')]
or the table hasn't id or class:
//table[2]//td[contains(text(),'header value')]
where 2 is index of table in page
and other many feature for searching in html nodes
in your case, for get Filing language:
//td[contains(text(),'Filing language')]/following-sibling::td

PYTHON - Unable To Find Xpath Using Selenium

I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text

Scraping Text from table using Soup / Xpath / Python

I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low
Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.
I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.
Here's a code snipet I'm basing it on :
Example website : http://www.uscho.com/rankings/d-i-mens-poll/
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[#id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
I can't seem to understand any of the code above. Only understood that the URL is being read. :(
Thank you for any help!
Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.
So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).
Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.
You could replace the line
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
with
for row in section.xpath('table/tr'):
Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.
Try this for yourself. Replace the last loop block with this much easier to read alternative:
for row in section.xpath('table/tr'):
print row.xpath('td//text()')
You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.
Here're a few helpful references:
Easy to understand tutorial :
http://www.w3schools.com/xpath/xpath_examples.asp
Stackoverflow question : Extract text between tags with XPath including markup
Another tutorial : http://www.tutorialspoint.com/xpath/

Removing child node from selector

I'm creating a project in scrapy whereby I scrape (obviously!) particular data from a webpage.
items = sel.xpath('//div[#class="productTiles cf"]/ul').extract()
for item in items:
price = sel.xpath('//ul/li[#class="productPrice"]/span/span[#class="salePrice"]').extract()
print price
This will product the following result:
u'<span class="salePrice">$20.43\xa0<span class="reducedFrom">$40.95</span></span>',
u'<span class="salePrice">$20.93\xa0<span class="reducedFrom">$40.95</span></span>
What I need to get is just the salePrice, e.g. 20.43 and 20.93 respectively, while ignoring the rest of the other tag and the rest of the data. Any help here would be much appreciated.
Looks like the solution is as follows:
//ul/li[#class="productPrice"]/span/span[#class="salePrice"]//text()
It'll grab just the text of the correct element I'm looking for, like so:
u'$20.43\xa0', u'$20.93\xa0'
Now I can just parse it to remove the unnecessary rubbish at the end, and I'm set. If anyone has a more elegant solution, I'd love to see it.
span[#class="salePrice"] returns span with its children.
This should get only the text of the top span:
sel.xpath('//ul/li[#class="productPrice"]/span/span[#class="salePrice"]/text()').extract()[0]

trouble getting text from xpath entry in python

I am on the website
http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b
and trying to scrape the data from the tables. When I pull the xpath from one entry, say the pitcher
"Terry Mulholland," I retrieve this:
pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tbody/tr[2]/td/a)
When I try to print pitcher[0].text for pitcher in printers, I get [] rather than the text, Any idea why?
The problem is, last tbody doesn't exist in the original source. If you get that xpath via some browser, keep in mind that browsers can guess and add missing elements to make html valid.
Removing the last tbody resolves the problem.
In : import lxml.html as html
In : site = html.parse("http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b")
In : pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tr[2]/td/a")
In : pitchers[0].text
Out: 'Terry Mulholland'
But I need to add that, the xpath expression you are using is pretty fragile. One div added in some convenient place and now you have a broken script. If possible, try to find better references like id or class that points to your expected location.

Categories

Resources