I can't use Scrapy on all web pages

I can't use Scrapy on all web pages - python

I am new to using Scrapy and I need to extract the information of some of the prices of Walmart Canada. The problem is that it does not extract anything, but it only happens to me with Walmart Canada, since when using Scrapy on another web page, it works correctly.
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
class WalmartItem(Item):
barcodes = Field()
sku = Field()
class WalmartCrawler(CrawlSpider):
name = 'walmartCrawler'
start_urls = [
'https://www.walmart.ca/en/ip/apple-gala/6000195494284']
def parse(self, response):
item = ItemLoader(WalmartItem(), response)
item.add_xpath(
'barcodes', "//div[#class='css-1dar8at e1cuz6d10']/div[#class='css-w8lmum e1cuz6d11']/div[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
item.add_xpath(
'sku', "//*[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
yield item.load_item()

Your xpath doesn't work,
one way to do it is using regex
import re,ast
sku = re.search(r'"sku":"(\d+)',response.text).groups()[0]
barcodes = ast.literal_eval(re.search(r'"upc":(\[.*?\])',response.text).groups()[0])

TL;DR: You cannot assume Scrapy will work to extract data from any web page.
Some websites load information using browser scripting (JavaScript code) or AJAX requests. These processes are executed in the browser after the initial response is received from the server. This means that when you receive the HTML response in Scrapy, you may not receive the information as you see it in the browser.
Instead, to check the response that you will receive in Scrapy, you should check on the Network tab inside the DevTools of your browser (In Google Chrome you can access them with Right Click > Inspect). Here, search for the initial request that the browser is doing to the server. Once you have found it, you can check which is the response to that request. This is the response you are going to receive in Scrapy.
Therefore, inside Scrapy you can only work with this HTML. And as you can see, the price is not available. In this cases you must find another alternatives such as: a) Using Selenium Web Driver, b) Finding the data of the product inside an script tag on the HTML (which is the way to go in this case, check the first script tag inside the HTML). c) Do an extraction via API.
Take a look at this walmart.ca extraction script which goes for b) solution for each product in a list of products:
https://github.com/juansimon27/scrapy-walmart/blob/master/product_scraping/spiders/spider.py
On top of this, in this specific case of walmart.ca, if you do not use the correct user agent in your requests, walmart.ca may respond you with an: <h2>Your web browser is not accepting cookies.</h2> or something like: Your browser is not able to execute JS.
Configure the following user agent to avoid these problems:
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36 '
}
In your script you can put this custom_settings definition just below your start_urls variable, or instead use a settings.py file with the USER_AGENT config.

Related

How to Take Advantage of Scrapy's Concurrency With Non-Selenium Requests

I've got an interesting problem here. I'm writing a Scrapy web scraper to obtain products off of a website. The catalog pages use lazy-loading, which means I cannot obtain more than the first 12 items or use pagination using the default Scrapy. I have started using Selenium with a headless chrome client in order to scroll the page manually to obtain the data.
I have read online that using Scrapy + Selenium means that I can't run Scrapy requests concurrently, which is unfortunate because the vast majority of my requests don't require Selenium. My selenium middleware checks the request.meta property to see if it needs to do anything, otherwise it simply returns None. However, all requests are filtered through the middleware.
My question is this: Is there a way to allow those requests that DON'T require Selenium to be run concurrently?
My middleware:
def __init__(self):
options = Options()
options.add_argument("--headless")
self.driver = webdriver.Chrome("path/to/driver", chrome_options=options)
def process_request(self, request, spider):
if request.meta.get("selenium"):
self.driver.get(request.url)
... # Perform selinium scroll logic and return body
return None
My spider parse function:
def parse(self, response):
meta = {"otherMetaData": "data", "selenium": True}
... # Obtain link to catalog page
yield response.follow(page_link, callback=self.parseProducts, meta=meta)
def parseProducts(self, response):
... # Obtain links to product pages
response.meta.pop("selenium")
yield response.follow(page_link, callback=self.parseProductPage, response.meta)
EDIT: Formatting

This was an Issue I faced recently, what you need to know is that Scrapy relies on the scheduler to issue the timing of the requests, The scheduler waits until the value of CONCURRENT_REQUESTS (this value will be skipped if there are less amount of requests to be made i.e when spider is finishing up final requests) is reached then it will make it's requests, the scheduler also depends on priority which you can modify to give certain links higher priority to make them launch earlier in the queue.
There is no direct way of doing what you are asking for,but a good way to do it would be to pass your "non-selenium" requests to a different spider with it's own custom_settings

Selecting dependent dropdown with scrapy-splash

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())

This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Scrapy - Javascript website

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).
I'm trying to download historical data for commodities for some personal research from this website:
http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx
On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.
Things i've tried:
1) Look at the page source to see if data is being loaded but not shown (hidden)
2) Used firebug to see if there are any AJAX requests
3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.

Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.
Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:
import scrapy
from scrapy import FormRequest
class MxindiaSpider(scrapy.Spider):
name = "mxindia"
allowed_domains = ["mcxindia.com"]
start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)
def parse(self, response):
yield FormRequest.from_response(response,
formdata={
'mTbdate': '02/13/2015', # your date here
'ScriptManager1': 'MupdPnl|mImgBtnGo',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'mImgBtnGo.x': '12',
'mImgBtnGo.y': '9'
},
callback=self.parse_cal, )
def parse_cal(self, response):
inspect_response(response, self) # everything is there!
What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata.
However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument.
So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request:
So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.
I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

use scrapy to crawl node

I'm trying to use scrapy to crawl some advertise information from this web sites.
That website has some div tag with class="product-card new_ outofstock installments_ ".
When I use:
items = response.xpath("//div[contains(#class, 'product-')]")
I get some node with class attribute = "product-description" but not "product-card".
When I use:
items = response.xpath("//div[contains(#class, 'product-card')]")
I still get nothing in result.
Why is that ?

As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.
If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/

The data you want is being populated by javascripts.
You would have to use a selenium webdriver to extract the data.
If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.
scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'
>>>response.xpath('//div[contains(#class,"product-card")]')
Output:
[]
Now, if you use the same Xpath in the browser and get a result as below:
Then the data is populated using scripts and selenium would have to be used to get data.
Here is an example to extract data using selenium:
import scrapy
from selenium import webdriver
from scrapy.http import TextResponse
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['lazada.vn']
start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
required_data = page.xpath('//div[contains(#class,"product-card")]').extract()
self.driver.close()
Here are some examples of "selenium spiders":
Executing Javascript Submit form functions using scrapy in python
Snipplr
Scrapy with selenium
Extract data from dynamic webpages

Scraping Page That Requires JavaScript Interaction

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"

Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).

You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.