I'm working with scrapy. I want to loop through a db table and grab the starting page for each scrape (random_form_page), then yield a request for each start page. Please note that I am hitting an api to get a proxy with the initial request. I want to set up each request to have its own proxy, so using the callback model I have:
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR
)
yield PR
I notice:
[scrapy] DEBUG: Filtered duplicate request: <GET 'htp://my-api'> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
In my code I can see that although it loops through 8 times it only yields a request for the first page. The others I assume are being filtered out. I've looked at http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class but still unsure how to turn off this filtering action. How can I turn off the filtering?
use
dont_filter = True in Request object
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR,
dont_filter = True
)
yield PR
As/if you are accessing an API you most probably want to disable the duplicate filter altogether:
# settings.py
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
This way you don't have to clutter all your Request creation code with dont_filter=True.
One word of caution, though (thanks to Brick Yang's comment): if your spider is crawling a website by discovering, extracting and following links you should not do this, as your spider is likely picking up the same links multiple times and would be recrawling them over and over again - resulting in an endless crawling loop.
Related
In Scrapy how can I use different callback functions for allowed domains, and denied domains.
I'm using the following rules:
rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
Basically I want parse_item to be called whenever there is a request from an allowed_domain (or sub-domain of one of those domains). Then I want parse_denied_item to be called for all requests that are not whitelisted by allowed_domains.
How can I do this?
I believe the best approach is not to use allowed_domains on LinkExtractor, and instead parse the domain out of response.url in your parse_* method and perform a different logic depending on the domain.
You can keep separate parse_* methods and a triaging method that, depending on the domains, calls yield from self.parse_*(response) (Python 3) with the corresponding parse_* method:
rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]
def parse_all(self, response):
# [Get domain out of response.url]
if domain in allowed_domains:
yield from self.parse_item(response)
else:
yield from self.parse_denied_item(response)
Based on Gallaecio's answer. An alternate option is to use process_request of Rule. process_request will capture the request before it is sent.
From my understanding (which could be wrong) Scrapy will only crawl domains listed in self.allowed_domains (assuming its used). However, if an offsite link is encountered on a scraped page, Scrapy will send a single request to this offsite link in some cases [1]. I'm not sure why this happens. I think this is possibly occurring because the target site is performing a 301, or 302 redirect and the crawler is automatically following that URL. Otherwise, it's probably a bug.
process_request can be used be used to perform processing on a request before it is executed. In my case, I wanted to log all links that aren't being crawled. So I'm verifying an allowed domain is in request.url before proceeding, and logging any of those that aren't.
Here is an example:
rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]
def process_item(self, request):
found = False
for url in self.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False: #otherwise log it
self.logDeniedDomain(urlparse(request.url).netloc)
# according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
# setting request to None should prevent this call from being executed (which is not the case for all)
# middleware is used to catch these few requests
request = None
return request
[1]: If you're encountering this problem, using process_request in Downloader middleware appears to solve it though.
My Downloader middleware:
def process_request(self, request, spider):
#catch any requests that should be filtered, and ignore them
found = False
for url in spider.allowed_domains:
if url in request.url:
#an allowed domain is in the request.url, proceed
found = True
if found == False:
print("[ignored] "+request.url)
raise IgnoreRequest('Offsite link, ignore')
return None
Make sure you import IgnoreRequest as well:
from scrapy.exceptions import IgnoreRequest
and enable the Downloader middleware in settings.py.
To verify this, you can add some verification code in process_item of your crawler to ensure no requests to out of scope sites have been made.
when I send a request to scrape API sometimes it doesn't load properly and it returns me -1 instead of the price.
So I put a while loop to make it repeat the request as long as I get -1 but the spider stops after the first request because of duplicate request.
so my question is, how can I change it to process duplicate requests?
example code:
is_checked = False
while(not is_checked):
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json")
jsonresponse = loads(response.body)
sellPrice = jsonresponse['data']['Prices'][0]['Price']
if sellPrice!=-1:
is_checked = True
yield {'SellPrice': sellPrice}
bare in mind I use inline requests library but it is not relevant to the solution.
To force scheduling duplicate request, set dont_filter=True in Request's constructor. In your example above, change
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json")
to
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json", dont_filter=True)
For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.
I have a scrapy spider, but it doesn't return requests sometimes.
I've found that by adding log messages before yielding request and after getting response.
Spider has iterating over a pages and parsing link for item scrapping on each page.
Here is a part of code
SampleSpider(BaseSpider):
....
def parse_page(self, response):
...
request = Request(target_link, callback=self.parse_item_general)
request.meta['date_updated'] = date_updated
self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
yield request
def parse_item_general(self, response):
self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
sel = Selector(response)
...
I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"
There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.
I've also added these parameters to minimize possible errors:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8
Because of asynchronous nature of twisted, I don't know how to debug this bug.
I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response
On, the same note as Rho, you can add the setting
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors