For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.
Related
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item to be called whenever there is a request from an allowed_domain (or sub-domain of one of those domains). Then I want parse_denied_item to be called for all requests that are not whitelisted by allowed_domains.
How can I do this?
I believe the best approach is not to use allowed_domains on LinkExtractor, and instead parse the domain out of response.url in your parse_* method and perform a different logic depending on the domain.
You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains (assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.
My Downloader middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py.
To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.
Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code?
For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:
if http_status_code in [403, 503, ..., n]:
proxy_ip = 'new ip'
# Then keep using it till it's gets another error code
Any ideas?
Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.
There are few ways of doing this.
First one, straight-forward just use it in your spider code:
def parse(self, response):
if response.status in range(400, 600):
return Request(response.url,
meta={'proxy': 'http://myproxy:8010'}
dont_filter=True) # you need to ignore filtering because you already did one request to this url
Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:
from project.settings import PROXY_URL
class MyDM(object):
def process_response(self, request, response, spider):
if response.status in range(400, 600):
logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
return Request(response.url,
meta={'proxy': PROXY_URL}
dont_filter=True)
return response
Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:
Specify that in Request Meta:
Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all
Request(url, meta={'handle_httpstatus_all': True})
Set a project/spider wide parameters:
HTTPERROR_ALLOW_ALL = True # for all
HTTPERROR_ALLOWED_CODES = [404, 505] # for specific
as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes
I have a scrapy spider, but it doesn't return requests sometimes.
I've found that by adding log messages before yielding request and after getting response.
Spider has iterating over a pages and parsing link for item scrapping on each page.
Here is a part of code
SampleSpider(BaseSpider):
....
def parse_page(self, response):
...
request = Request(target_link, callback=self.parse_item_general)
request.meta['date_updated'] = date_updated
self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
yield request
def parse_item_general(self, response):
self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
sel = Selector(response)
...
I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"
There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.
I've also added these parameters to minimize possible errors:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8
Because of asynchronous nature of twisted, I don't know how to debug this bug.
I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response
On, the same note as Rho, you can add the setting
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.
In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?
In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[#id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/#href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors