I've written a script in python to find the text within td tag which is the next sibling of first tdtag using BeautifulSoup in combination with css selectors. If i run the script, i find it working. However, when i do the same using lxml library, it no longer works. How can i get my latter script working? Thanks.
This is the content:
html_content="""
<tr>
<td width="25%" valign="top" bgcolor="lightgrey" nowrap="">
<font face="Arial" size="-1" color="224119">
<b>Owner Address </b>
</font>
</td>
<td width="75%" valign="top" nowrap="">
<font face="Arial" size="-1" color="black">
1698 EIDER DOWN DR<br>SUMMERVILLE SC 29483
</font>
</td>
</tr>
"""
Working one with bs4:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
item = soup.select("td")[0].find_next_sibling().text
print(item)
Result:
1698 EIDER DOWN DRSUMMERVILLE SC 29483
The below script can find the address string:
from lxml.html import fromstring
root = fromstring(html_content)
item = root.cssselect("td b:contains('Address')")[0].text
print(item)
Result:
Owner Address
It doesn't work when it comes to find the next sibling (applied "+" sign to find the next sibling:
from lxml.html import fromstring
root = fromstring(html_content)
item = root.cssselect("td b:contains('Owner Address')+td")[0].text
print(item)
Result:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\new_line_one.py", line 28, in <module>
item = root.cssselect("td b:contains('Owner Address')+td")[0].text
IndexError: list index out of range
How can i make it work to find the next sibling? Btw, I'm only after css selectors not xpath. Thanks.
From the css3 selector docs:
8.3.1. Adjacent sibling combinator
The adjacent sibling combinator is made of the "plus sign" (U+002B, +)
character that separates two sequences of simple selectors. The
elements represented by the two sequences share the same parent in the
document tree and the element represented by the first sequence
immediately precedes the element represented by the second one.
Which means in your selector td b:contains('Owner Address')+td, you're asking for a td that has the same parent as the b which contains 'Address' and is child of another td. This node does not exist. To make it work, you need to make sure that you're first partial selector matches the td, not the b node. Since they contain each other, the following would work:
td:contains('Owner Address') + td
Note that this td has no text (only child nodes), so your snippet from above only prints whitespaces.
Related
xpath returns empty list for the following queries.
Need to fetch UrlOne1, UrlOne2, DataOne1, DataOne, DataOne2
<table>
<thead></thead>
<tbody class="dataContainer">
<tr class="tableLight">
<td>DataOne1</td>
<td> <span class="badge"></span> <span class="long">DataOne</span> <span class="short">DataOne</span> </td>
<td class="hide-s"><span class="ClassOneCN"></span> <span class="ClassOne2">DataOne2</span></td></tr>
<tr class="tableLight">
<tr class="tableLight">
<tr class="tableLight">
returns null [] for the following
response.xpath('//*[#class="dataContainer"]/a/#href')
response.xpath('//*[#class="tableLight"]')
response.xpath('//*[local-name() = "tr" and class="tableLight"]')
but the code below works fine with answer : ['>]
response.xpath('//*[#class="dataContainer"]')
For the first xpath //*[#class="dataContainer"]/a/#href
// is the descendant-or-self axis whereas / is a direct child of the current node. In this case a isn't a direct child so you need to use // :
//*[#class="dataContainer"]//a/#href
The second path //*[#class="tableLight"] should work, but if you know it's an tr tag use it :
//tr[#class="tableLight"]
And for the third xpath //*[local-name() = "tr" and class="tableLight"] class is an attribute so you need to use #class (but I would suggest using the xpath above instead) :
//*[local-name() = "tr" and #class="tableLight"]
As for your what you need (UrlOne1, UrlOne2, DataOne1, DataOne, DataOne2), you could get the a elements like so response.xpath('//tr[#class="tableLight"]//a') and then retrieve the href attribute or text for each a element.
Or directly get the href attributes and text :
//tr[#class="tableLight"]//a/#href
//tr[#class="tableLight"]//a//text()
Given this code ("sleep" instances used to help display what's going on):
from splinter import Browser
import time
with Browser() as browser:
# Visit URL
url = "https://mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx"
browser.visit(url)
browser.fill('ctl00$ContentPlaceHolder1$txtCredentialNumber', 'IF0000000262422')
# Find and click the 'search' button
button = browser.find_by_name('ctl00$ContentPlaceHolder1$btnSearch')
# Interact with elements
button.first.click()
time.sleep(5)
#Only click the link next to "Professional Teaching Certificate Renewal"
certificate_link = browser.find_by_xpath("//td[. = 'Professional Teaching Certificate Renewal']/following-sibling::td/a")
certificate_link.first.click()
time.sleep(10)
I am now trying to get the values from the table that shows after this code runs. I am not well-versed in xpath commands, but based on the response to this question, I have tried these, to no avail:
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/a")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[1]")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[2]")
I tried [2] because I do notice a colon (:) sibling character between "Name" and the cell containing the name. I just want the string value of the name itself (and all other values in the table).
I do notice a different structure (span is used within td instead of just td) in this case (I also tried td span[. ='Name']... but no dice):
Updated to show more detail
<tr>
<td>
<span class="MOECSBold">Name</span>
</td>
<td>:</td>
<td>
<span id="ContentPlaceHolder1_lblName" class="MOECSNormal">MICHAEL WILLIAM LANCE </span>
</td>
</tr>
This ended up working:
browser.find_by_xpath("//td[span='Name']/following-sibling::td")[1].value
I want to recover a number that is located in the following table:
the site
<table class="table table-hover table-inx">
<tbody><tr>
</tr>
<tr>
</tr>
<tr>
</tr>
<tr>
<td class=""><label for="RentNet">Miete (netto)</label></td>
<td>478,28 €</td>
</tr>
<tr>
</tr>
<tr>
</tr>
<tr>
<td class=""><label for="Rooms">Zimmer</label></td>
<td>4</td>
</tr>
</tbody></table>
I suppose this strange format happens because the table entries are optional. I get to the table with driver.find_element_by_css_selector("table.table.table-hover") and I see how one could easily iterate through the <tr> tags. But how do I find the second <td> holding the data, in the <tr> with the <label for="Rooms"> ?
Is there a more elegant way than "find the only td field with a one-digit number" or load the detail page?
This similar question didn't help me, because there the tag in question has an id
EDIT:
I just found out about a very helpful cheat sheet for Xpath/CSS selectors posted in an answer to a related question: it contains ways to reference child/parent, next table entry etc
You can select the appropriate td tag using driver.find_element_by_xpath(). The XPath expression that you should use is as follows:
`'//label[#for="Rooms"]/parent::td/following-sibling::td'`
This selects the label tag with for attribute equal to Rooms, then navigates to its parent td element, then navigates to the following td element.
So your code will be:
elem = driver.find_element_by_xpath(
'//label[#for="Rooms"]/parent::td/following-sibling::td')
An example of the XPath expression in action is here.
With xpath, you can create a search for an element that contains another element, like so:
elem = driver.find_element_by_xpath('//tr[./td/label[#for="Rooms"]]/td[2]')
The elem variable will now hold the second td element within the "Rooms" label row (which is what you were looking for). You could also assign the tr element to the variable, and then work with all of the data in the row since you know the cell structure (if you would like to work with the label and data).
Have you tried xpath? Firebug is a great tool for copying xpaths. It will use indices to select the element you want. It's especially useful when your element has no name or ID.
Edit: not sure why I was down voted? I went on the site and found the XPath Firebug gave me:
/html/body/div[2]/div[7]/div[2]/div[3]/div/div[1]/div/div[3]/div[3]/div/table/tbody/tr[7]/td[2]
To get that 4, just:
xpath = "/html/body/div[2]/div[7]/div[2]/div[3]/div/div[1]/div/div[3]/div[3]/div/table/tbody/tr[7]/td[2]"
elem = driver.find_element_by_xpath(xpath)
print elem.text # prints '4'
And to get all the elements for "rooms", you can simply driver.find_elements_by_xpath using partial xpath, so like this:
xpath = "/div/div[1]/div/div[3]/div[3]/div/table/tbody/tr[7]/td[2]"
elems = driver.find_elements_by_xpath(xpath) # returns list
for elem in elems:
print elem.text # prints '3', '3', '4'
Finally, you might be able to get the data with page source.
First, let's make a function that outputs a list of rooms when we input the page source:
def get_rooms(html):
rooms = list()
partials = html.split('''<label for="Rooms">''')[1:]
for partial in partials:
partial = partial.split("<td>")[1]
room = partial.split("</td>")[0]
rooms.append(room)
return rooms
Once we have that function defined, we can retrieve the list of room numbers by:
html = driver.page_source
print get_rooms(html)
It should output:
["3", "3", "4"]
I am trying to extract the information from a link from a page that is structured as such:
...
<td align="left" bgcolor="#FFFFFF">$725,000</td>
<td align="left" bgcolor="#FFFFFF"> Available</td>
*<td align="left" bgcolor="#FFFFFF">
<a href="/washington">
Washington Street Studios
<br>1410 Washington Street SW<br>Albany, Oregon, 97321
</a>
</td>*
<td align="center" bgcolor="#FFFFFF">15</td>
<td align="center" bgcolor="#FFFFFF">8.49%</td>
<td align="center" bgcolor="#FFFFFF">$48,333</td>
</tr>
I tried targeting elements with attribute 'align = left' and iterating over it but that didn't work out. If anybody could help me locate the element <a href = "/washington"> (multiple tags like these within the same page) with selenium I would appreciate it.
I would use lxml instead, if it is just to process hxml...
It would be helpful if you're more specific, but you can try this if you are traversing links in a webpage..
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
list_of_links = [i[2] for i in doc.iterlinks()]
list_of_links will look like ['/en/images/logo_com.gif', 'http://www.brand.com/', '/en/images/logo.gif']
doc.iterlinks() will look for all links such as form, img, a-tags and yield lists containing Element object containing the tag, the type of tag (form, a or img), the url and a number, so the line list_of_links = [i[2] for i in doc.iterlinks()] simply grab the url and returns as a separate list.
Note that the retrieved url is relative. As in you will see urls like
'/en/images/logo_com.gif'
instead of
'http://somedomain.com/en/images/logo_com.gif'
if you want to have the latter kind of url, add the code
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
doc.make_links_absolute() # add this line
list_of_links = [i[2] for i in doc.iterlinks()]
If you are processing the url one by one, then simply modify the code to something like
for i in iterlinks():
url = i[2]
# some processing here with url...
Finally, if for some reason you need selenium to come in to get the webpage content, then simply add the following to the beginning
from selenium import webdriver
from StringIO import StringIO
browser = webdriver.Firefox()
browser.get(url)
doc = parse(StringIO(browser.page_source)).getroot()
From what we have provided at the moment, there is a table and you have the desired links in a specific column. There are no "data-oriented" attributes to rely on, but using column index to locate the links looks good enough:
for row in driver.find_elements_by_css_selector("table#myid tr"):
cells = row.find_elements_by_tag_name("td")
print(cells[2].text) # put a correct index here
Here is an example web page I am trying to get data from.
http://www.makospearguns.com/product-p/mcffgb.htm
The xpath was taken from chrome development tools, and firepath in firefox is also able to find it, but using lxml it just returns an empty list for 'text'.
from lxml import html
import requests
site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
page = requests.get(site_url)
tree = html.fromstring(page.text)
text = tree.xpath(xpath)
Printing out the tree text with
print(tree.text_content().encode('utf-8'))
shows that the data is there, but it seems the xpath isn't working to find it. Is there something I am missing? Most other sites I have tried work fine using lxml and the xpath taken from chrome dev tools, but a few I have found give empty lists.
1. Browsers frequently change the HTML
Browsers quite frequently change the HTML served to it to make it "valid". For example, if you serve a browser this invalid HTML:
<table>
<p>bad paragraph</p>
<tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>
To render it, the browser is helpful and tries to make it valid HTML and may convert this to:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
The above is changed because <p>aragraphs cannot be inside <table>s and <tbody>s are recommended. What changes are applied to the source can vary wildly by browser. Some will put invalid elements before tables, some after, some inside cells, etc...
2. Xpaths aren't fixed, they are flexible in pointing to elements.
Using this 'fixed' HTML:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
If we try to target the text of <td> cell, all of the following will give you approximately the right information:
//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()
And the list goes on...
however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. In this case:
/table[0]/tbody[0]/tr[0]/td[0]/text()
3. Conclusion: Browser given Xpaths are usually unhelpful
This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML.
The solution, always refer to the raw HTML and use a flexible, but precise XPath.
Examine the actual HTML that holds the price:
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<font class="pricecolor colors_productprice">
<div class="product_productprice">
<b>
<font class="text colors_text">Price:</font>
<span itemprop="price">$149.95</span>
</b>
</div>
</font>
<br/>
<input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
</td>
</tr>
</table>
If you want the price, there is actually only one place to look!
//span[#itemprop="price"]/text()
And this will return:
$149.95
The xpath is simply wrong
Here is snippet from the page:
<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();">
<img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br />
<table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent">
<tr>
<td colspan="2" class="vCSS_breadcrumb_td"><b>
Home >
You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.
There can be only one element with such id (otherwise it would be broken xml).
The xpath is expecting tbody as child of given element (table) and there is none in whole page.
This can be tested by
>>> "tbody" in page.text
False
How Chrome came to that XPath?
If you simply download this page by
$ wget http://www.makospearguns.com/product-p/mcffgb.htm
and review content of it, it does not contain a single element named tbody
But if you use Chrome Developer Tools, you find some.
How it comes here?
This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.
How to get content of page dynamically modified within browser?
You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.
byselenium.py
from selenium import webdriver
from lxml import html
url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source
tree = html.fromstring(html_source)
text = tree.xpath(xpath)
print text
what prints
$ python byselenimum.py
test tbody True
['$149.95']
Conclusions
Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.
I had a similar issue (Chrome inserting tbody elements when you do Copy as XPath). As others answered, you have to look at the actual page source, though the browser-given XPath is a good place to start. I've found that often, removing tbody tags fixes it, and to test this I wrote a small Python utility script to test XPaths:
#!/usr/bin/env python
import sys, requests
from lxml import html
if (len(sys.argv) < 3):
print 'Usage: ' + sys.argv[0] + ' url xpath'
sys.exit(1)
else:
url = sys.argv[1]
xp = sys.argv[2]
page = requests.get(url)
tree = html.fromstring(page.text)
nodes = tree.xpath(xp)
if (len(nodes) == 0):
print 'XPath did not match any nodes'
else:
# tree.xpath(xp) produces a list, so always just take first item
print (nodes[0]).text_content().encode('ascii', 'ignore')
(that's Python 2.7, in case the non-function "print" didn't give it away)