scrapy avoid crawler logging out

scrapy avoid crawler logging out - python

I am using the scrapy library to facilitate crawling a website.
The website uses authentication and I can successfully login to the page using scrapy.
The page has a URL which will log out the user and destroy the session.
How do I ensure scrapy avoids the logout page when crawling?

If you are using Link Extractors and simply don't want to follow this particular "logout" link, you can set deny property:
rules = [Rule(SgmlLinkExtractor(deny=[r'logout/']), follow=True),]
Another option is to check for response.url inside your spider's parse method:
def parse(self, response):
if 'logout' in response.url:
return
# extract items
Hope that helps.

Related

Click Button in Scrapy-Splash

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123#example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/#href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item

Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.
To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
As you can see, you don't even need to log in into the website or Splash.

Selecting dependent dropdown with scrapy-splash

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.
I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class ExampleSpider(scrapy.Spider):
name = 'climatologia'
def start_requests(self):
urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 0.5},)
def parse(self, response):
print(response.url)
state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
print(state)
return SplashFormRequest(response.url, method='POST',
formdata={'sel-state-geo': 'SP'},
callback=self.state_selected,
args={'wait': 0.5})
def state_selected(self, response):
print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
print(response.css("select.slt-geo")[0].css("option::text").extract())
print(response.css("select.slt-geo")[1].css("option::text").extract())

This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.
My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting:
https://www.climatempo.com.br/json/busca-estados
This endpoint gives json like follows
{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}
Hopefully this is another way to get the data you are looking for?
Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?

I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

Scraping Page That Requires JavaScript Interaction

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"

Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).

You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.

invoke javascript function when scraping the web page

I am scraping a web page using python, the page has:
<input name="Submit" type="button" class="btn" value="query" onclick = "dataQuery();" />
I want to trigger the onclick event in python, how can I do that, thanks

You can use selenium and/or HTMLUnit to control/simulate a web browser.
Note that if you just want to scrape a specific site, it's often easier to just recreate the JavaScript logic in your own code.

For scraping web-pages you might be interested in scrapy
For submitting a form you can use FormRequest.from_response:
class Spider(BaseSpider):
name = 'my_spider_name'
start_urls = ['http://www.domain.com/']
allowed_domains = ['domain.com']
advertiser = 'domain.com'
def parse(self, response):
"""Parse first page and submit a form on it"""
...
formdata = {'input1': '1234', 'input2': 5678} # overriden form data
yield = FormRequest.from_response(response, 'form_name', formdata = formdata,
callback = self.parseFormSubmit)
For more info see the scrapy docs and search on this site questions tagged with scrapy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy avoid crawler logging out - python

I am using the scrapy library to facilitate crawling a website. The website uses authentication and I can successfully login to the page using scrapy. The page has a URL which will log out the user and destroy the session. How do I ensure scrapy avoids the logout page when crawling?

Related

Click Button in Scrapy-Splash

Selecting dependent dropdown with scrapy-splash

Python/Scrapy scraping from Techcrunch

Scraping Page That Requires JavaScript Interaction

invoke javascript function when scraping the web page

Categories

Resources