XPath from Chrome results in an empty list in scrapy - python

I'm inspecting a page by Chrome Dev Tools and have xpath of an element on the page. I disable javascript deliberately so DOM doesn't get changed. However, xpath I Chrome gives for the element results in [] in scrapy, although the element, of course, exists. What might be the problem?
In particular, xpath //*[#id="prddeatailed_container"]/table[1]/tbody/tr[1]/td/div/table/tbody/tr[2]/td[1]/span for this http://cheaptool.ru/product/sadovyj-pylesos-billy-goat-lb351/ - the price 29 990.
$ scrapy shell 'http://cheaptool.ru/product/sadovyj-pylesos-billy-goat-lb351'
In [2]: xp1 = '//*[#id="prddeatailed_container"]/table[1]/tbody/tr[1]/td/div/table/tbody/tr[2]/td[1]/span'
In [3]: aaa = response.xpath(xp1)
In [4]: aaa
Out[4]: []
UPDATE:
It turned out in the result html there was no tbody. Why did Chrome showed it in xpath? How to make it the real html in xpath?

"I disable javascript deliberately so DOM doesn't get changed"
Besides javascript, DOM can also get changed because browsers usually has algorithms to fix the html source so that it can be rendered reasonably well by the browser.
"#user3616725, the question is not what to use, but why doesn't it work"
Common case is as what you discovered while I'm writing this answer, Chrome added <tbody> tag automatically. See the following discussion for explanation about this behavior :
Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
"It turned out in the result html there was no tbody. Why did Chrome showed it in xpath? How to make it the real html in xpath?"
The html result as rendered by Chrome indeed has <tbody>, that's why Chrome showed it in xpath. Chrome dev tools works against final DOM which may be different from the actual HTML source, so you simply can't rely on xpath from Chrome for use in Scrapy.

Since you mention tbody, a lot of HTML don't follow the rule of using tbody and usually Chrome fix it by adding tbody automatically to it. If you print the response HTML, you won't find any tbody.

Related

xpath on browser and response are different

When I search for an xpath in my browser after inspection it show the required result,but when I used the same xpath of my response in scrapy it should an empty list.
So when find an element on the browser, I get showing the number of satisfying element see picture for example.
Now, when I run the same xpath off my response in scrapy shell, I get an empty list,even though the response status is 200. What could be causing this?
You browser renders Javascript code and this leads to change in HTML code. So, in this case, you need to use a Javascript engine for requests in Scrapy. Please look at scrapy-splash to render JS and get same results as in browser.
If you use chrome browser, if would be a a little different in some tag with you get from requests or scrapy.
Like chrome will auto add in the html.

Is it possible to copy xpath in google or chrome without tbody tags?

I am trying to scrape data in an old website that use only tables without any class or id identification. When I use copy xpath function in Chrome or Firefox they return something like this to me:
/html/body/table[3]/tbody/tr/td[1]/table/tbody/tr[4]/td[2]/font
These tbody tags seems unreachable to Python's Scrapy, and the chaotic nature of the site's html structure makes almost impossible for me to create an xpath myself. Is there any way I can copy the xpath without these tbody tags?

How to scrape HTML rendered by JavaScript

I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
I've looked into it and I have to admit that JavaScript in #Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as #pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as #Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).

Using Python requests.get to parse html code that does not load at once

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)

why this xpath is not working

I am scrapying over this page
http://www.modeluxproperties.com/?m=search&web=1&act=details_web&id=503
I want to get the values of all the Amenities
my xpath is
normalize-space(.//div[#id='specimen']/div[#class='section']/table//tr[4]/td/table//tr/td/text())
I got an empty results, why please?
The correct xpath for amenities is:
"//table//div[#id='specimen']//table/tr[4]/td/table/tr/td/text()"
so your xpath is actually completely ok, perhaps you are extracting it in some strange way?You can extract it like so:
sel.xpath("//table//div[#id='specimen']//table/tr[4]/td/table/tr/td/text()").extract()
where sel is simply an instance of Selector, created like so sel = Selector(response).
To debug that kind of issues Firefox firepath extension is very helpful, for Chrome there is xpath helper.Typically you should start with finding the right xpath with firepath and then trying it in scrapy shell, it's really simple something like:
scrapy shell
fetch "http://[your url]"
then you will get selector object sel, and you can test your xpath there. Testing with scrapy shell is often necessary because browsers are modifying html displayed on pages. For example in case of tables most browsers add tbody to tables.

Categories

Resources