How to pass splash cookie to Scrapy Rule attribute in CrawlSpider

How to pass splash cookie to Scrapy Rule attribute in CrawlSpider - python

I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github repo. so my question is what is the possible way to maintain the user sessions while following rules?
Just like I was able to pass the cookies and headers to after_login method, is there a way I can do the same while processing requests(process_request="use_splash") in the rules extractor using splash? I am a newbie in scrapy-splash, so I am sorry if my question sounds a bit foggy, and I just posted a minimal example here.
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#id= 'archive-tables']//tbody/tr[#xsid=1]/td/a"), process\_request="use_splash", follow=True),
Rule(LinkExtractor(restrict_xpaths="//div[#class = 'main-menu2 main-menu-gray']//strong/a"),process_request="use_splash", callback= 'parse', follow=True),
)
def start_requests(self):
"""called before crawling starts. Try to login"""
yield SplashRequest(
url=self.login_page, callback=self.after_login, endpoint='execute',
args={'lua_source': self.login_lua,'wait': 2,'timeout':90,'images':0},
)
def after_login(self, response):
cookies = response.data['cookies']
headers = response.data['headers']
url = 'https://www.oddsportal.com/results/'
yield SplashRequest(url=url, endpoint='execute',cache_args=\['lua_source'],
cookies=cookies, headers=headers, args=
{'lua_source':self.lua_request},errback=self.errback_httpbin,
dont_filter=True)
def use_splash(self, request, response):
#print(response.cookiejar)
#print('check the cookies',response.url,response.meta['splash']['args']['cookies'])
request.meta.update(splash={'args': {'wait': 0.5,'session_id':1,
\# 'lua\_source':self.lua\_request --gives me error
},'endpoint': 'execute','timeout':90,'images': 0})
return request

Related

How to give scrapy spider a new task only once the current one has received a response from the server

I have a list of entries in a database that each corresponds to some scraping task. Only once one is finished, do I want the spider to continue to the next one. Here is some pseudocode that gives the idea of what I want to do though it is not exactly what I want because it uses a while loop creating a massive backlog of entries waiting to be processed.
def start_requests(self):
while True:
rec = GetDocumentAndMarkAsProcessing()
if rec == None:
break;
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)
def parse(self, response):
... store results in database ...
How can I make scrapy work on the next entry only when it has received a response from the previous SplashRequest for the previous entry?
I am not sure if simple callback functions would be enough to do the trick or if I need something more sophisticated.

All I needed to do was explicitly call another request with yield in the parse function with parse as the callback itself. So in the end I have something like this:
def start_requests(self):
rec = GetDocumentAndMarkAsProcessing()
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)
def parse(self, response):
... store results in database ...
rec = GetDocumentAndMarkAsProcessing()
script = getScript(rec)
yield SplashRequest(..., callback=self.parse, endpoint="execute",
args={
'lua_source': script
}
)

I believe you might be able to achieve this by setting CONCURRENT_REQUESTS to 1 in your settings.py. That will make it so the crawler only sends one request at a time, although I admit I am not sure how the timing works on the second request - whether it sends it when the callback is finished executing or when the response is retrieved.

Unable to force scrapy to make a callback using the url that got redirected

I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.
In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.
This is my current attempt (the link used within the script is just a placeholder):
import scrapy
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
handle_httpstatus_list = [301, 302]
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
if response.meta.get("lead_link"):
self.lead_link = response.meta.get("lead_link")
elif response.meta.get("redirect_urls"):
self.lead_link = response.meta.get("redirect_urls")[0]
try:
if response.status!=200 :raise
if not response.css("[itemprop='text'] > h2"):raise
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
except Exception:
print(self.lead_link)
yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
Question: How can I force scrapy to make a callback using the url that got redirected?

As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200
If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code
Then create a CustomMiddleware in middlewares.py
class CustomMiddleware(object):
def process_response(self, request, response, spider):
if not response.css("[itemprop='text'] > h2"):
logging.info('Desired text not found so re-scraping' % (request.url))
req = request.copy()
request.dont_filter = True
return req
if response.status in [301, 302]:
original_url = request.meta.get('redirect_urls', [response.url])[0]
logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
request._url = original_url
request.dont_filter = True
return request
return response
Then your spider should look like something this
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
}
}
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile

You may want to see this.
If you need to prevent redirecting it is possible by request meta:
request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request
Due to documentation this is a way to stop redirecting.

Capturing HTTP Errors using scrapy

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?

handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.

So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

Scrapy do something before resume

I'm using scrapy to crawl a web-site with authentication.
I want to be able to save the state of the crawler and I use
scrapy crawl myspider -s JOBDIR=mydir
After I resume with the same command I want to be able to login to the website before it reschedules all saved requests.
Basically, I want to be sure that my function login() and after_login() will be called before any other request is scheduled and executed. And I don't want to use cookies, because they don't allow me to pause the crawling for a long time.
I can call login() in start_requests(), but this works only when I run the crawler for the first time.
class MyCrawlSpider(CrawlSpider):
# ...
START_URLS = ['someurl.com', 'someurl2.com']
LOGIN_PAGE = u'https://login_page.php'
def login(self):
return Request(url=self.LOGIN_PAGE, callback=self.login_email,
dont_filter=True, priority=9999)
def login_form(self, response):
return FormRequest.from_response(response,
formdata={'Email': 'myemail',
'Passwd': 'mypasswd'},
callback=self.after_login,
dont_filter=True,
priority=9999)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
for url in self.START_URLS:
yield Request(url, callback=self.parse_artists_json, dont_filter=True)
Bottomline: Is there any callback which will always be called when I resume crawling with -s JOBDIR=... option before rescheduling previous requests? I will use it to call login() method.

You can use the spider_opened signal (more here)
This function is intended to resource allocation for the spiders and others initializations, so it doesn't expect you to yield a Request object from there.
You can go around this by having an array of pending requests. This is needed because scrapy doesn't allow you to manually scheduled requests.
Then, after resume the spider you can queue the login as the first requests on the queue:
def spider_opened(self, spider):
self.spider.requests.insert(0, self.spider.login())
You also need to add a next_request method into your spider
def next_request(self):
if self.requests:
yield self.requests.pop(0)
And queue all you requests by adding them to the requests array, and calling next_request add the end of each method:
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
else:
print("Login Successful!!")
self.is_logged_in = True
if not self.requests:
for url in self.START_URLS:
self.requests.append(Request(url, callback=self.parse_artists_json, dont_filter=True)
yield self.next_request()

How to check the "HTTP status code is not handled or not allowed" with Scrapy?

I using Scrapy, and I would like to check the status code before to enter in the parse method.
My code look like:
class mywesbite(BaseSpider):
# Crawling Start
CrawlSpider.started_on = datetime.now()
# CrawlSpider
name = 'mywebsite'
DOWNLOAD_DELAY = 10
allowed_domains = ['mywebsite.com']
pathUrl = "URL/mywebsite.txt"
# Init
def __init__(self, local = None, *args, **kwargs):
# Heritage
super(mywebsite, self).__init__(*args, **kwargs)
# On Spider Closed
dispatcher.connect(self.spider_closed, signals.spider_closed)
def start_requests(self):
return [ Request(url = start_url) for start_url in [l.strip() for l in open(self.pathUrl).readlines()] ]
def parse(self, response):
print "==============="
print response.headers
print "==============="
# Selector
sel = Selector(response)
When my proxy is not blocked, I see the response headers, but when my IP is blocked, I just see in the output console this:
DEBUG: Ignoring response <999 https://www.mywebsite.com>: HTTP status
code is not handled or not allowed.
How to check the response header before to enter the parse method?
Edit:
answer: This bug appears when the spider is blocked/banned, by an anti-crawling system. You must use an unblocked proxy system.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass splash cookie to Scrapy Rule attribute in CrawlSpider - python

Related

How to give scrapy spider a new task only once the current one has received a response from the server

Unable to force scrapy to make a callback using the url that got redirected

Capturing HTTP Errors using scrapy

Scrapy do something before resume

How to check the "HTTP status code is not handled or not allowed" with Scrapy?

Categories

Resources