Python Scrapy - Ajax Pagination Tripadvisor - python

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages.
Here is the url I'm using : http://www.tripadvisor.com/members/scottca075
I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.
One GET and One POST request is sent:
On checking the parameters for the GET request, I see this:
action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`
The request url is
`http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`
The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.
The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?
I'm confused as to how to make the request in scrapy to get the data.
I tried using FormRequest as follows:
pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)
I also tried setting headers options in the FormRequest
headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}
If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.
And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.

Use ScrapyJS - Scrapy+JavaScript integration
To use ScrapyJS in your project, you first need to enable the middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go("http://example.com")
splash:runjs("$('#some-button').click()")
return splash:html()
end
For more details check here

so for you are doing correct,
add the yield in front of FormRequest as:
yield FormRequest(''')
secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".
Find the total number of reviews and start from 50 to total pages adding 50 with each request.
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

Related

Scrapy gets only 24 first items of page

I tried many ways to scrape ikea page and I figured out that at last page ikea actually shows all the items. But when I try to scrape last page of ikea's product it only returns me the 24 first items (which corresponds to the items displayed for the first page.
this is the URL of the page:
https://www.ikea.com/fr/fr/cat/lits-bm003/?page=12
and this is the spider :
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Ikea'
pages = 9
start_urls = ['https://www.ikea.com/fr/fr/cat/canapes-fu003/?page=12']
def parse(self, response):
data = {}
products = response.css('div.plp-product-list')
for product in products:
for p in product.css('div.range-revamp-product-compact'):
yield {
'Title' : p.css('div.range-revamp-header-section__title--small::text').getall()[0],
'Price' : p.css('span.range-revamp-price__integer::text').getall()[0],
'Desc' : p.css('span.range-revamp-header-section__description-text::text').getall()[0],
'Img' : p.css('img.range-revamp-aspect-ratio-image__image::attr(src)').getall()[0]
}
Scrapy's spider doesn't run JavaScript (that's the job of a browser), it will only load the same response content as a cURL would.
To do what exactly you suggest, you need a browser-based solution, like Selenium (Python) or Cypress (JavaScript). Either that or go through each page separately. Try to use a 'headless browser'.
There are probably better ways of doing this, but to address your exact question, this is the intended answer.

Click Button in Scrapy-Splash

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123#example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/#href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.
To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
As you can see, you don't even need to log in into the website or Splash.

Selecting dependent dropdown with scrapy-splash

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())
This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Scrapy - Javascript website

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).
I'm trying to download historical data for commodities for some personal research from this website:
http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx
On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.
Things i've tried:
1) Look at the page source to see if data is being loaded but not shown (hidden)
2) Used firebug to see if there are any AJAX requests
3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.
Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.
Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:
import scrapy
from scrapy import FormRequest
class MxindiaSpider(scrapy.Spider):
name = "mxindia"
allowed_domains = ["mcxindia.com"]
start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)
def parse(self, response):
yield FormRequest.from_response(response,
formdata={
'mTbdate': '02/13/2015', # your date here
'ScriptManager1': 'MupdPnl|mImgBtnGo',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'mImgBtnGo.x': '12',
'mImgBtnGo.y': '9'
},
callback=self.parse_cal, )
def parse_cal(self, response):
inspect_response(response, self) # everything is there!
What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata.
However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument.
So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request:
So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.
I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

Categories

Resources