I am receiving a 302 response from a server while scrapping a website:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
I want to send request to GET urls instead of being redirected. Now I found this middleware:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?
Forgot about middlewares in this scenario, this will do the trick:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
That said, you will need to include meta parameter when you yield your request:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.
You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.
To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).
When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
Depending on the scenario, you might want to move this code to a downloader middleware.
You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py
I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].
I figured out how to bypass redirect by the following:
1- check if am redirected in parse().
2- if redirected, then arrange to simulate the action of escaping this redirection and return back to your required URL for scraping, you may need to check Network behavior in google chrome and simulate the POST of a request to get back to your page.
3- go to another process , using callback, and then be within this process to complete all scraping work by recursive loop calling itself, and put condition to break this loop at the end.
below example I used to bypass Disclaimer page and return back to my main url and start scraping.
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in below proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it from redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected ##########')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES_BASE says that RedirectMiddleware is already enabled by default, so what you did didn't matter.
I want to send request to GET urls instead of being redirected.
How? The server responds with 302 on your GET request. If you do GET on the same URL again you will be redirected again.
What are you trying to achieve?
If you want to not be redirected, see these questions:
Avoiding redirection
Facebook url returning an mobile version url response in scrapy
How to avoid redirection of the webcrawler to the mobile edition?
Related
I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.
In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.
This is my current attempt (the link used within the script is just a placeholder):
import scrapy
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
handle_httpstatus_list = [301, 302]
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
if response.meta.get("lead_link"):
self.lead_link = response.meta.get("lead_link")
elif response.meta.get("redirect_urls"):
self.lead_link = response.meta.get("redirect_urls")[0]
try:
if response.status!=200 :raise
if not response.css("[itemprop='text'] > h2"):raise
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
except Exception:
print(self.lead_link)
yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
Question: How can I force scrapy to make a callback using the url that got redirected?
As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200
If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code
Then create a CustomMiddleware in middlewares.py
class CustomMiddleware(object):
def process_response(self, request, response, spider):
if not response.css("[itemprop='text'] > h2"):
logging.info('Desired text not found so re-scraping' % (request.url))
req = request.copy()
request.dont_filter = True
return req
if response.status in [301, 302]:
original_url = request.meta.get('redirect_urls', [response.url])[0]
logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
request._url = original_url
request.dont_filter = True
return request
return response
Then your spider should look like something this
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
}
}
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile
You may want to see this.
If you need to prevent redirecting it is possible by request meta:
request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request
Due to documentation this is a way to stop redirecting.
I am receiving a 302 response from a server while scrapping a website:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
I want to send request to GET urls instead of being redirected. Now I found this middleware:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?
Forgot about middlewares in this scenario, this will do the trick:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
That said, you will need to include meta parameter when you yield your request:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.
You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.
To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).
When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
Depending on the scenario, you might want to move this code to a downloader middleware.
You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py
I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].
I figured out how to bypass redirect by the following:
1- check if am redirected in parse().
2- if redirected, then arrange to simulate the action of escaping this redirection and return back to your required URL for scraping, you may need to check Network behavior in google chrome and simulate the POST of a request to get back to your page.
3- go to another process , using callback, and then be within this process to complete all scraping work by recursive loop calling itself, and put condition to break this loop at the end.
below example I used to bypass Disclaimer page and return back to my main url and start scraping.
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in below proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it from redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected ##########')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES_BASE says that RedirectMiddleware is already enabled by default, so what you did didn't matter.
I want to send request to GET urls instead of being redirected.
How? The server responds with 302 on your GET request. If you do GET on the same URL again you will be redirected again.
What are you trying to achieve?
If you want to not be redirected, see these questions:
Avoiding redirection
Facebook url returning an mobile version url response in scrapy
How to avoid redirection of the webcrawler to the mobile edition?
I've written a script in python's scrapy to make a proxied requests using either of the newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. However, the problem is the proxy my script chooses to use may not be the good one always so sometimes it doesn't fetch valid response.
How can I let my script keep trying with different proxies until there is a valid response?
My script so far:
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess
class ProxySpider(scrapy.Spider):
name = "sslproxies"
check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
proxy_link = "https://www.sslproxies.org/"
def start_requests(self):
proxylist = self.get_proxies()
random.shuffle(proxylist)
proxy_ip_port = next(cycle(proxylist))
print(proxy_ip_port) #Checking out the proxy address
request = scrapy.Request(self.check_url, callback=self.parse,errback=self.errback_httpbin,dont_filter=True)
request.meta['proxy'] = "http://{}".format(proxy_ip_port)
yield request
def get_proxies(self):
response = requests.get(self.proxy_link)
soup = BeautifulSoup(response.text,"lxml")
proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxy
def parse(self, response):
print(response.meta.get("proxy")) #Compare this to the earlier one whether they both are the same
def errback_httpbin(self, failure):
print("Failure: "+str(failure))
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'DOWNLOAD_TIMEOUT' : 5,
})
c.crawl(ProxySpider)
c.start()
PS My intension is to seek any solution the way I've started here.
As we know http response needs to pass all middlewares in order to reach spider methods.
It means that only requests with valid proxies can proceed to spider callback functions.
In order to use valid proxies we need to check ALL proxies first and after that choose only from valid proxies.
When our previously chosen proxy doesn't work anymore - we mark this proxy as not valid and choose new one from remaining valid proxies in spider errback.
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http.request import Request
class ProxySpider(scrapy.Spider):
name = "sslproxies"
check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
proxy_link = "https://www.sslproxies.org/"
current_proxy = ""
proxies = {}
def start_requests(self):
yield Request(self.proxy_link,callback=self.parse_proxies)
def parse_proxies(self,response):
for row in response.css("table#proxylisttable tbody tr"):
if "yes" in row.extract():
td = row.css("td::text").extract()
self.proxies["http://{}".format(td[0]+":"+td[1])]={"valid":False}
for proxy in self.proxies.keys():
yield Request(self.check_url,callback=self.parse,errback=self.errback_httpbin,
meta={"proxy":proxy,
"download_slot":proxy},
dont_filter=True)
def parse(self, response):
if "proxy" in response.request.meta.keys():
#As script reaches this parse method we can mark current proxy as valid
self.proxies[response.request.meta["proxy"]]["valid"] = True
print(response.meta.get("proxy"))
if not self.current_proxy:
#Scraper reaches this code line on first valid response
self.current_proxy = response.request.meta["proxy"]
#yield Request(next_url, callback=self.parse_next,
# meta={"proxy":self.current_proxy,
# "download_slot":self.current_proxy})
def errback_httpbin(self, failure):
if "proxy" in failure.request.meta.keys():
proxy = failure.request.meta["proxy"]
if proxy == self.current_proxy:
#If current proxy after our usage becomes not valid
#Mark it as not valid
self.proxies[proxy]["valid"] = False
for ip_port in self.proxies.keys():
#And choose valid proxy from self.proxies
if self.proxies[ip_port]["valid"]:
failure.request.meta["proxy"] = ip_port
failure.request.meta["download_slot"] = ip_port
self.current_proxy = ip_port
return failure.request
print("Failure: "+str(failure))
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'COOKIES_ENABLED': False,
'DOWNLOAD_TIMEOUT' : 10,
'DOWNLOAD_DELAY' : 3,
})
c.crawl(ProxySpider)
c.start()
you need write a downloader middleware, to install a process_exception hook, scrapy calls this hook when exception raised. in the hook, you could return a new Request object, with dont_filter=True flag, to let scrapy reschedule the request until it succeeds.
in the meanwhile, you could verify response extensively in process_response hook, check the status code, response content etc., and reschedule request as necessary.
in order to change proxy easily, you should use built-in HttpProxyMiddleware, instead of tinker with environ:
request.meta['proxy'] = proxy_address
take a look at this project as an example.
I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}
I have written a scraper in Scrapy 1.5 that successfully navigates to a webpage (running ASP.NET running on IIS version 8.5), submits a form, and then gets to scraping. After a few hours, all of the pages start returning blank data. I believe that my ASP.NET session id is expiring when this happens. I can never make it through the entire table (several thousand pages) while crawling at a respectful rate, but the table doesn't change from session to session. My approach was to scrape until the pages were returned blank, then go back to the form submission page and resubmit the form. I am keeping track of the page number so that I can pick up where I left off. The problem is that when I resubmit the form, the pages are still returned blank. If I stop the scraper, and set the count variable manually to the last page scraped, it works fine when I restart the scraper. Using fiddler I can see that the only thing that is different is that I have a new ASP.NET session id. So my question is, how can I clear out my ASP.NET session id so that I am given a new one and I can continue scraping? Here is a redacted version of the spider:
class assessorSpider(scrapy.Spider):
name = 'redacted'
allowed_domains = ['redacted.redacted']
start_urls = ['http://redacted.redacted/search.aspx']
base_url = start_urls[0]
rows_selector = '#T1 > tbody:nth-child(3) > tr'
numberOfPages = -1
count = 1
def parse(self, response):
#ASP.NET Session Id gets stored in the headers after initial search
frmdata = {'id':'frmSearch', 'SearchField':'%','cmdGo':'Go'}
yield scrapy.FormRequest(url = self.base_url, formdata = frmdata, callback = self.parse_index)
self.log('Search submitted')
def parse_index(self, response):
self.log('proceeding to next page')
rows = response.css(self.rows_selector)
if (len(rows) < 50 and self.count != self.numberOfPages):
self.log('Deficient rows. Resubmitting')
yield scrapy.Request(callback=self.parse, url = self.base_url, headers='')
self.log('Ready to yield value')
for row in rows:
value = {
#a whole bunch of css selectors
}
yield value
if self.numberOfPages == -1:
self.numberOfPages = response.css('a.button::attr(href)')[2].extract().split('=')[-1]
self.count = self.count + 1
if self.count <= self.numberOfPages:
self.log( self.base_url + '?page=' + str(self.count))
yield scrapy.Request(callback=self.parse_index, url = self.base_url + '?page=' + str(self.count))
Note: I have read that making a request with an expired ASP.NET session id should result in a new one being issued (depending on how the site is set up), so it is possible that scrapy is not accepting the new session id. I am not sure how to diagnose this issue.
There are two things that come to mind:
1) Your "start a fresh session" request might be getting rejected by the downloader: By default, it filters URLs that it's already seen, such as your base url. Try yield scrapy.Request(callback=self.parse, url = self.base_url, dont_filter=True, headers='') in your "reset the session request"
2) If that doesn't work (or perhaps in addition to):
I'm pretty new to Scrapy and Python, so there might be a more direct method to "reset your cookies" but specifying a fresh cookiejar should work.
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#multiple-cookie-sessions-per-spider
The cookiejar is essentially a dict-like object that is tracking your current session's cookies. You can specify the cookiejar key by using meta.
# Set up a new session if bad news:
if (len(rows) < 50 and self.count != self.numberOfPages):
self.log('Deficient rows. Resubmitting')
yield scrapy.Request(
callback=self.parse,
url=self.base_url,
dont_filter=True,
meta={
# Since you are already tracking a counter,
# this might make for a reasonable "next cookiejar id"
'cookiejar': self.count
}
)
Now that you are specifying a new cookiejar, you are in a new session. You must take account for this in your other requests, checking whether a cookiejar is set, and continuing to pass this value. Otherwise you end up back in the default cookiejar. It might be easiest to manage this expectation from the very beginning by defining start_requests:
def start_requests(self):
return [
scrapy.Request(
url,
dont_filter=True,
meta={'cookiejar': self.count}
) for url in self.start_urls
]
Now your other request objects just need to implement the following pattern to "stay in the same session", such as in your parse method:
yield scrapy.FormRequest(
url = self.base_url,
formdata = frmdata,
callback = self.parse_index,
meta={
'cookiejar': response.meta.get('cookiejar')
}
)