I'm using scrapy to crawl a web-site with authentication.
I want to be able to save the state of the crawler and I use
scrapy crawl myspider -s JOBDIR=mydir
After I resume with the same command I want to be able to login to the website before it reschedules all saved requests.
Basically, I want to be sure that my function login() and after_login() will be called before any other request is scheduled and executed. And I don't want to use cookies, because they don't allow me to pause the crawling for a long time.
I can call login() in start_requests(), but this works only when I run the crawler for the first time.
class MyCrawlSpider(CrawlSpider):
# ...
START_URLS = ['someurl.com', 'someurl2.com']
LOGIN_PAGE = u'https://login_page.php'
def login(self):
return Request(url=self.LOGIN_PAGE, callback=self.login_email,
dont_filter=True, priority=9999)
def login_form(self, response):
return FormRequest.from_response(response,
formdata={'Email': 'myemail',
'Passwd': 'mypasswd'},
callback=self.after_login,
dont_filter=True,
priority=9999)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
for url in self.START_URLS:
yield Request(url, callback=self.parse_artists_json, dont_filter=True)
Bottomline: Is there any callback which will always be called when I resume crawling with -s JOBDIR=... option before rescheduling previous requests? I will use it to call login() method.
You can use the spider_opened signal (more here)
This function is intended to resource allocation for the spiders and others initializations, so it doesn't expect you to yield a Request object from there.
You can go around this by having an array of pending requests. This is needed because scrapy doesn't allow you to manually scheduled requests.
Then, after resume the spider you can queue the login as the first requests on the queue:
def spider_opened(self, spider):
self.spider.requests.insert(0, self.spider.login())
You also need to add a next_request method into your spider
def next_request(self):
if self.requests:
yield self.requests.pop(0)
And queue all you requests by adding them to the requests array, and calling next_request add the end of each method:
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
if not self.requests:
for url in self.START_URLS:
self.requests.append(Request(url, callback=self.parse_artists_json, dont_filter=True)
yield self.next_request()
Related
I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github repo. so my question is what is the possible way to maintain the user sessions while following rules?
Just like I was able to pass the cookies and headers to after_login method, is there a way I can do the same while processing requests(process_request="use_splash") in the rules extractor using splash? I am a newbie in scrapy-splash, so I am sorry if my question sounds a bit foggy, and I just posted a minimal example here.
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#id= 'archive-tables']//tbody/tr[#xsid=1]/td/a"), process\_request="use_splash", follow=True),
Rule(LinkExtractor(restrict_xpaths="//div[#class = 'main-menu2 main-menu-gray']//strong/a"),process_request="use_splash", callback= 'parse', follow=True),
)
def start_requests(self):
"""called before crawling starts. Try to login"""
yield SplashRequest(
url=self.login_page, callback=self.after_login, endpoint='execute',
args={'lua_source': self.login_lua,'wait': 2,'timeout':90,'images':0},
)
def after_login(self, response):
cookies = response.data['cookies']
headers = response.data['headers']
url = 'https://www.oddsportal.com/results/'
yield SplashRequest(url=url, endpoint='execute',cache_args=\['lua_source'],
cookies=cookies, headers=headers, args=
{'lua_source':self.lua_request},errback=self.errback_httpbin,
dont_filter=True)
def use_splash(self, request, response):
#print(response.cookiejar)
#print('check the cookies',response.url,response.meta['splash']['args']['cookies'])
request.meta.update(splash={'args': {'wait': 0.5,'session_id':1,
\# 'lua\_source':self.lua\_request --gives me error
},'endpoint': 'execute','timeout':90,'images': 0})
return request
I have a list of entries in a database that each corresponds to some scraping task. Only once one is finished, do I want the spider to continue to the next one. Here is some pseudocode that gives the idea of what I want to do though it is not exactly what I want because it uses a while loop creating a massive backlog of entries waiting to be processed.
def start_requests(self):
while True:
rec = GetDocumentAndMarkAsProcessing()
if rec == None:
break;
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)
def parse(self, response):
... store results in database ...
How can I make scrapy work on the next entry only when it has received a response from the previous SplashRequest for the previous entry?
I am not sure if simple callback functions would be enough to do the trick or if I need something more sophisticated.
All I needed to do was explicitly call another request with yield in the parse function with parse as the callback itself. So in the end I have something like this:
def start_requests(self):
rec = GetDocumentAndMarkAsProcessing()
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)
def parse(self, response):
... store results in database ...
rec = GetDocumentAndMarkAsProcessing()
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)
I believe you might be able to achieve this by setting CONCURRENT_REQUESTS to 1 in your settings.py. That will make it so the crawler only sends one request at a time, although I admit I am not sure how the timing works on the second request - whether it sends it when the callback is finished executing or when the response is retrieved.
I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.
In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.
This is my current attempt (the link used within the script is just a placeholder):
import scrapy
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
handle_httpstatus_list = [301, 302]
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
if response.meta.get("lead_link"):
self.lead_link = response.meta.get("lead_link")
elif response.meta.get("redirect_urls"):
self.lead_link = response.meta.get("redirect_urls")[0]
try:
if response.status!=200 :raise
if not response.css("[itemprop='text'] > h2"):raise
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
except Exception:
print(self.lead_link)
yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
Question: How can I force scrapy to make a callback using the url that got redirected?
As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200
If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code
Then create a CustomMiddleware in middlewares.py
class CustomMiddleware(object):
def process_response(self, request, response, spider):
if not response.css("[itemprop='text'] > h2"):
logging.info('Desired text not found so re-scraping' % (request.url))
req = request.copy()
request.dont_filter = True
return req
if response.status in [301, 302]:
original_url = request.meta.get('redirect_urls', [response.url])[0]
logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
request._url = original_url
request.dont_filter = True
return request
return response
Then your spider should look like something this
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
}
}
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile
You may want to see this.
If you need to prevent redirecting it is possible by request meta:
request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request
Due to documentation this is a way to stop redirecting.
I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}
I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.
You can write your own retry middleware and put it to middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
and don't forget change settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}
Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it!
Instead, try using Deferred() from Twisted.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
More info here: Scrapy: non-blocking pause