I'm new to scrapy and have been trying to develop a spider that scrapes Tripadvisor's things to do page. Trip advisor paginates results with offset so I made it find the last page num, multiply the number of results per page, and loop over a range with a step of 30. However it returns only a fraction of the results its supposed to, and get_details prints out 7 of the 28 pages scraped. I believe what is happening is url redirection on random pages.
Scrapy logs this 301 redirection on the other pages, and it appears to be redirecting to the first page. I tried disabling redirection but that did not work.
2021-03-28 18:46:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-Nashville_Davidson_County_Tennessee.html> from <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa90-Nashville_Davidson_County_Tennessee.html>
Here's my code for the spider:
import scrapy
import re
class TripadvisorSpider(scrapy.Spider):
name = "tripadvisor"
start_urls = [
'https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa{}-Nashville_Davidson_County_Tennessee.html'
]
def parse(self, response):
num_pages = int(response.css(
'._37Nr884k .DrjyGw-P.IT-ONkaj::text')[-1].get())
for offset in range(0, num_pages * 30, 30):
formatted_url = self.start_urls[0].format(offset)
yield scrapy.Request(formatted_url, callback=self.get_details)
def get_details(self, response):
print('url is ' + response.url)
for listing in response.css('div._19L437XW._1qhi5DVB.CO7bjfl5'):
yield {
'title': listing.css('._392swiRT ._1gpq3zsA._1zP41Z7X::text')[1].get(),
'category': listing.css('._392swiRT ._1fV2VpKV .DrjyGw-P._26S7gyB4._3SccQt-T::text').get(),
'rating': float(re.findall(r"[-+]?\d*\.\d+|\d+", listing.css('svg.zWXXYhVR::attr(title)').get())[0]),
'rating_count': float(listing.css('._392swiRT .DrjyGw-P._26S7gyB4._14_buatE._1dimhEoy::text').get().replace(',', '')),
'url': listing.css('._3W_31Rvp._1nUIPWja._17LAEUXp._2b3s5IMB a::attr(href)').get(),
'main_image': listing.css('._1BR0J4XD').attrib['src']
}
Is there a way to get scrapy working for each page? What is causing this problem exactly?
Found a solution. Discovered I needed to handle the redirection manually and disable Scrapy's default middleware.
Here is the custom middleware I added to middlewares.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.selector import Selector
from scrapy.utils.response import get_meta_refresh
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
reason = 'redirect %d' % response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = Selector(response)
# test for captcha page
captcha = hxs.xpath(
".//input[contains(#id, 'captchacharacters')]").extract()
if captcha:
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
It is an updated version from this question's top answer.
Scrapy retry or redirect middleware
Related
I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.
In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.
This is my current attempt (the link used within the script is just a placeholder):
import scrapy
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
handle_httpstatus_list = [301, 302]
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
if response.meta.get("lead_link"):
self.lead_link = response.meta.get("lead_link")
elif response.meta.get("redirect_urls"):
self.lead_link = response.meta.get("redirect_urls")[0]
try:
if response.status!=200 :raise
if not response.css("[itemprop='text'] > h2"):raise
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
except Exception:
print(self.lead_link)
yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
Question: How can I force scrapy to make a callback using the url that got redirected?
As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200
If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code
Then create a CustomMiddleware in middlewares.py
class CustomMiddleware(object):
def process_response(self, request, response, spider):
if not response.css("[itemprop='text'] > h2"):
logging.info('Desired text not found so re-scraping' % (request.url))
req = request.copy()
request.dont_filter = True
return req
if response.status in [301, 302]:
original_url = request.meta.get('redirect_urls', [response.url])[0]
logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
request._url = original_url
request.dont_filter = True
return request
return response
Then your spider should look like something this
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
}
}
def start_requests(self):
yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
def parse(self,response):
answer_title = response.css("[itemprop='text'] > h2::text").get()
print(answer_title)
If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile
You may want to see this.
If you need to prevent redirecting it is possible by request meta:
request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request
Due to documentation this is a way to stop redirecting.
I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}
I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500
However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually:
2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)
But I wonder if there is a built-in support from scrapy.
challenge accepted!
there isn't something on scrapy that directly supports this, but you could separate it from your spider code with a Spider Middleware:
middlewares.py
from scrapy.http.request import Request
class StartRequestsCountMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
self.start_urls[i] = request.url
request.meta.update(start_request_index=i)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_request_index=response.meta['start_request_index'],
)
else:
spider.crawler.stats.inc_value(
'start_requests/item_scraped_count/{}'.format(
self.start_urls[response.meta['start_request_index']],
),
)
yield output
Remember to activate it on settings.py:
SPIDER_MIDDLEWARES = {
...
'myproject.middlewares.StartRequestsCountMiddleware': 200,
}
Now you should be able to see something like this on your spider stats:
'start_requests/item_scraped_count/START_URL1': ITEMCOUNT1,
'start_requests/item_scraped_count/START_URL2': ITEMCOUNT2,
I using Scrapy, and I would like to check the status code before to enter in the parse method.
My code look like:
class mywesbite(BaseSpider):
# Crawling Start
CrawlSpider.started_on = datetime.now()
# CrawlSpider
name = 'mywebsite'
DOWNLOAD_DELAY = 10
allowed_domains = ['mywebsite.com']
pathUrl = "URL/mywebsite.txt"
# Init
def __init__(self, local = None, *args, **kwargs):
# Heritage
super(mywebsite, self).__init__(*args, **kwargs)
# On Spider Closed
dispatcher.connect(self.spider_closed, signals.spider_closed)
def start_requests(self):
return [ Request(url = start_url) for start_url in [l.strip() for l in open(self.pathUrl).readlines()] ]
def parse(self, response):
print "==============="
print response.headers
print "==============="
# Selector
sel = Selector(response)
When my proxy is not blocked, I see the response headers, but when my IP is blocked, I just see in the output console this:
DEBUG: Ignoring response <999 https://www.mywebsite.com>: HTTP status
code is not handled or not allowed.
How to check the response header before to enter the parse method?
Edit:
answer: This bug appears when the spider is blocked/banned, by an anti-crawling system. You must use an unblocked proxy system.
I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.
You can write your own retry middleware and put it to middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
and don't forget change settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}
Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it!
Instead, try using Deferred() from Twisted.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
More info here: Scrapy: non-blocking pause