When I search for an xpath in my browser after inspection it show the required result,but when I used the same xpath of my response in scrapy it should an empty list.
So when find an element on the browser, I get showing the number of satisfying element see picture for example.
Now, when I run the same xpath off my response in scrapy shell, I get an empty list,even though the response status is 200. What could be causing this?
You browser renders Javascript code and this leads to change in HTML code. So, in this case, you need to use a Javascript engine for requests in Scrapy. Please look at scrapy-splash to render JS and get same results as in browser.
If you use chrome browser, if would be a a little different in some tag with you get from requests or scrapy.
Like chrome will auto add in the html.
Related
I am trying to scrape some content from a website, but selenium.page_source() does not contain all the content I need beacuse the webiste is dynamically rendered. When opening DevTools in Chrome you are able to inspect all of the DOM-elements - even those rendered dynamically. This made me believe that there must be a way in selenium to do this as well.
Any suggestions?
Get the inner html of the html or body
driver.find_element_by_xpath("/html/body").get_attribute('innerHTML')
If that does not get everything, please post the source html/website.
I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.
I'm trying to get all links on a page 'https://www.jumia.com.eg' using scrapy.
The code is like this:
all_categories = response.xpath ('//a')
But I found a lot of missing links in the results.
The count of the results is 242 links.
When I tried Chrome developer tools, I got all the links, the count of the results was 608 with the same selector xpath (//a).
Why doesn't Scarpy get all the links using the mentioned selector while Chrome does?
It turned out that the problem is because data is loaded using Javascript as Justin commented.
That's because the website is using reCAPTCHA.
If you type: view(response) in scrapy shell, you would noticed that you are actually parsing the reCAPTCHA page (which explains the unexpected a tags):
You can try solving the reCAPTCHA (not sure how easy that would be, but this question might help)... Alternatively you can run your scraper from a proxy, such as Crawlera which uses rotating IPs... I have not used Crawlera but according to their website, it retries the page several times (with different IPs) until it hits a clean page.
I can't scrape the 'Resolution' field from the javascript webpage, as I believe.
Webpage address:
https://support.na.sage.com/selfservice/viewdocument.do?noCount=true&externalId=60390&sliceId=1&noCount=true&isLoadPublishedVer=&docType=kc&docTypeID=DT_Article&stateId=4183&cmd=displayKC&dialogID=197243&ViewedDocsListHelper=com.kanisa.apps.common.BaseViewedDocsListHelperImpl&openedFromSearchResults=true
I need to extract Description, Cause, and Resolution.
Tried various ways to get element, including:
find_element_by_xpath
find_element_by_id
find_element_by_class_name.
Nothing gave the desired result.
Could you direct me in which way should I work?
https://support.na.sage.com/selfservice/viewContent.do?externalId=60390&sliceId=1
This is the correct url that you can crawl html, use Network tab of your browser devtool to find that.
Example with Chrome
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)