Selecting dependent dropdown with scrapy-splash

Selecting dependent dropdown with scrapy-splash - python

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())

This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Related

I can't use Scrapy on all web pages

I am new to using Scrapy and I need to extract the information of some of the prices of Walmart Canada. The problem is that it does not extract anything, but it only happens to me with Walmart Canada, since when using Scrapy on another web page, it works correctly.
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
class WalmartItem(Item):
barcodes = Field()
sku = Field()
class WalmartCrawler(CrawlSpider):
name = 'walmartCrawler'
start_urls = [
'https://www.walmart.ca/en/ip/apple-gala/6000195494284']
def parse(self, response):
item = ItemLoader(WalmartItem(), response)
item.add_xpath(
'barcodes', "//div[#class='css-1dar8at e1cuz6d10']/div[#class='css-w8lmum e1cuz6d11']/div[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
item.add_xpath(
'sku', "//*[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
yield item.load_item()

Your xpath doesn't work,
one way to do it is using regex
import re,ast
sku = re.search(r'"sku":"(\d+)',response.text).groups()[0]
barcodes = ast.literal_eval(re.search(r'"upc":(\[.*?\])',response.text).groups()[0])

TL;DR: You cannot assume Scrapy will work to extract data from any web page.
Some websites load information using browser scripting (JavaScript code) or AJAX requests. These processes are executed in the browser after the initial response is received from the server. This means that when you receive the HTML response in Scrapy, you may not receive the information as you see it in the browser.
Instead, to check the response that you will receive in Scrapy, you should check on the Network tab inside the DevTools of your browser (In Google Chrome you can access them with Right Click > Inspect). Here, search for the initial request that the browser is doing to the server. Once you have found it, you can check which is the response to that request. This is the response you are going to receive in Scrapy.
Therefore, inside Scrapy you can only work with this HTML. And as you can see, the price is not available. In this cases you must find another alternatives such as: a) Using Selenium Web Driver, b) Finding the data of the product inside an script tag on the HTML (which is the way to go in this case, check the first script tag inside the HTML). c) Do an extraction via API.
Take a look at this walmart.ca extraction script which goes for b) solution for each product in a list of products:
https://github.com/juansimon27/scrapy-walmart/blob/master/product_scraping/spiders/spider.py
On top of this, in this specific case of walmart.ca, if you do not use the correct user agent in your requests, walmart.ca may respond you with an: <h2>Your web browser is not accepting cookies.</h2> or something like: Your browser is not able to execute JS.
Configure the following user agent to avoid these problems:
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36 '
}
In your script you can put this custom_settings definition just below your start_urls variable, or instead use a settings.py file with the USER_AGENT config.

Click Button in Scrapy-Splash

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123#example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/#href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item

Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.
To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
As you can see, you don't even need to log in into the website or Splash.

LinkExtractor - extract with condition

I have a crawler that takes in urls and then follows the nextpage links for each url in the start urls and its working
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[#class="pagnNext"]',)), callback="parse_start_url", follow= True),)
However as you can imagine I start getting captchas at some point for some urls. I've heard that there might be honeypots that are not visible for human but in the html code designed to make you click to identfy that you are a bot.
I wanna make the extractor extracts the link conditionally for example dont extract and click if CSS style display:none exists or something like that
is this doable

I would do something like this:
def parse_page1(self, response):
if (response.css("thing i want to check exists"))
return scrapy.Request(response.xpath('//a[#class="pagnNext"]'),
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
official docs:
https://doc.scrapy.org/en/latest/topics/request-response.html
note: as for your captcha issue try messing with your settings. at least make sure your DOWNLOAD_DELAY is set to something other then 0. check out the other options https://doc.scrapy.org/en/latest/topics/settings.html

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?

It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Python Scrapy - Ajax Pagination Tripadvisor

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages.
Here is the url I'm using : http://www.tripadvisor.com/members/scottca075
I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.
One GET and One POST request is sent:
On checking the parameters for the GET request, I see this:
action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`
The request url is
`http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`
The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.
The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?
I'm confused as to how to make the request in scrapy to get the data.
I tried using FormRequest as follows:
pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)
I also tried setting headers options in the FormRequest
headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}
If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.
And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.

Use ScrapyJS - Scrapy+JavaScript integration
To use ScrapyJS in your project, you first need to enable the middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go("http://example.com")
splash:runjs("$('#some-button').click()")
return splash:html()
end
For more details check here

so for you are doing correct,
add the yield in front of FormRequest as:
yield FormRequest(''')
secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".
Find the total number of reviews and start from 50 to total pages adding 50 with each request.
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.