invoke javascript function when scraping the web page

invoke javascript function when scraping the web page - python

I am scraping a web page using python, the page has:
<input name="Submit" type="button" class="btn" value="query" onclick = "dataQuery();" />
I want to trigger the onclick event in python, how can I do that, thanks

You can use selenium and/or HTMLUnit to control/simulate a web browser.
Note that if you just want to scrape a specific site, it's often easier to just recreate the JavaScript logic in your own code.

For scraping web-pages you might be interested in scrapy
For submitting a form you can use FormRequest.from_response:
class Spider(BaseSpider):
name = 'my_spider_name'
start_urls = ['http://www.domain.com/']
allowed_domains = ['domain.com']
advertiser = 'domain.com'
def parse(self, response):
"""Parse first page and submit a form on it"""
...
formdata = {'input1': '1234', 'input2': 5678} # overriden form data
yield = FormRequest.from_response(response, 'form_name', formdata = formdata,
callback = self.parseFormSubmit)
For more info see the scrapy docs and search on this site questions tagged with scrapy.

Related

Click Button in Scrapy-Splash

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123#example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/#href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item

Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.
To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
As you can see, you don't even need to log in into the website or Splash.

How to scrape the items loaded via a "view more" button using Scrapy

Here is the inspection of View more button in a website. I can crawl through data that are shown in the website but I want it somehow that it can crawl through items that are hidden behind the view more button. How do I do that?
<div id="view-more" class="p20px pt10px">
<div id="view-more-loader" class="tac"></div>
<b class="icon-refresh xsmall mr5px"></b>View More Products..
</div>
My scrapy code:
import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['dummymart.net']
start_urls =['https://www.dummymart.com/catalog/car-dvd-player_cid100001018.html']
def parse(self, response):
Product = response.xpath('//div[#class="attr"]/h2/a/#title').extract()
Company = response.xpath('//div[#class="supplier"]/p/a/#title').extract()
Country = response.xpath('//*[#class="location a-color-secondary"]/span/text()').extract()
Category = response.xpath('//*[#class="attr category hide--mobile"]/span/a/text()').extract()
for item in zip(Product,Company,Country,Category):
scraped_info = {
'Product':item[0],
'Company': item[1],
'Country':item[2],
'Category':item[3]
}
yield scraped_info

The usual solution for a problem like this is:
Fire up the Developer Tools in your browser;
Go to the Network panel so that you can view the requests made by your browser;
Click the "view more" button in the page and check which request your browser did to fetch the data;
Make the same request on your spider.
This blog post may help you.

Selecting dependent dropdown with scrapy-splash

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())

This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Python Scrapy - Ajax Pagination Tripadvisor

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages.
Here is the url I'm using : http://www.tripadvisor.com/members/scottca075
I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.
One GET and One POST request is sent:
On checking the parameters for the GET request, I see this:
action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`
The request url is
`http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`
The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.
The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?
I'm confused as to how to make the request in scrapy to get the data.
I tried using FormRequest as follows:
pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)
I also tried setting headers options in the FormRequest
headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}
If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.
And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.

Use ScrapyJS - Scrapy+JavaScript integration
To use ScrapyJS in your project, you first need to enable the middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go("http://example.com")
splash:runjs("$('#some-button').click()")
return splash:html()
end
For more details check here

so for you are doing correct,
add the yield in front of FormRequest as:
yield FormRequest(''')
secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".
Find the total number of reviews and start from 50 to total pages adding 50 with each request.
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

scrapy avoid crawler logging out

I am using the scrapy library to facilitate crawling a website.
The website uses authentication and I can successfully login to the page using scrapy.
The page has a URL which will log out the user and destroy the session.
How do I ensure scrapy avoids the logout page when crawling?

If you are using Link Extractors and simply don't want to follow this particular "logout" link, you can set deny property:
rules = [Rule(SgmlLinkExtractor(deny=[r'logout/']), follow=True),]
Another option is to check for response.url inside your spider's parse method:
def parse(self, response):
if 'logout' in response.url:
return
# extract items
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.