I'm using scrapy to perform test on an internal web app.
Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.
It work very well except for this: the crawl at the end, GET things in a random order...
So, URL that perform DELETE operation are being executed before other operations.
I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:
from scrapy import log
class DeleteDelayer(object):
def enqueue_request(self, spider, request):
if request.url.find('delete') != -1:
log.msg("delay %s" % request.url, log.DEBUG)
request.priority = 50
But it does not work... I see delete being "delay" in the log but they are executed during the execution.
I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle signal is called to put them back in, but I'm not sure on how to do this.
What is the best way to acheive this?
default priority for request is 0, so you set priority to 50 will not work
you can use a middleware to collect (insert the requests into your own queue, e.g, redis set) and ignore (return IngnoreRequest Exception) those 'delete' request
start a 2nd crawl with requests load from your queue in step 2
Related
Just like in Scrapy request+response+download time, I would like to know how many time it takes to get a Response. The solution proposed doesn't meet my needs because of the following issue:
When a Downloader Middleware process_request method returns a Request object, the Request is rescheduled and isn't passed immediately to the remaining process_request methods. As a consequence, the solution proposed will include the time needed for the scheduler to return the Request to the Engine again.
What I want is only the time the Downloader takes to download a page (the time elapsed between the end of Downloader Middleware processings of the Request and first Downloader Middleware processing of the Response).
My idea is that one could either:
Disable the rescheduling of returned Request. But is it desirable and how can we do this?
Or Try to use the 'timer' used to trigger a TimeoutError. But I don't know how to access it.
Thanks in advance!
Isn't this exactly what download_latency request meta key contains? Or your requirement is different?
Whenever Scrapy gets a 302, that action is added as the last item in the queue. Is there a way to force Scrapy to finish the redirection and process next urls after that?
As stated by Tomáš in the comment REDIRECT_PRIORITY_ADJUST controls redirect priority.
However what you describe with default scrapy settings shouldn't happen since this setting defaults to +2. By default all scrapy requests are scheduled at 0, so all redirected requests should have priority over other requests.
You can schedule indidividual requests priority with priority argument.
for example if you want to set priority at 100, you'd write this:
yield Request("http://someurl.com", priority=100)
I'm using this script to randomize proxies in scrapy. The problem is that once it's allocated a proxy to a request, it won't allocate another one because of this code:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
return
That means that if there is a bad proxy which is not connecting to anything, then the request will fail. I'm intending to modify it like this:
if request.meta.get('retry_times',0) < 5:
return
thereby letting it allocate a new proxy if the current one fails 5 times. I'm assuming that if I set RETRY_TIMES to, say 20, in settings.py, then the request won't fail until 4 different proxies have each made 5 attempts.
I'd like to know if that will cause any problems. As I understand it, the reason that the check is there in the first place is for stateful transactions, such as those relying on log-ins, or perhaps cookies. Is that correct?
I bumped with the same problem.
I improved the aivarsk/scrapy-proxies. My middleware inherited by basic RetryMiddleware and trying to use one proxy RETRY_TIMES. If proxy is unavailable, the middleware change it.
Yes, I think the idea of that script was to check if the user is already defining a proxy on the meta parameter, so it can control it from the spider.
Setting it to change proxy every 5 times is ok, but I think you'll have to re login to the page, as most pages know when you changed from where you are making the request (proxy).
The idea if rotating proxies is not as easy as just selecting one randomly, because you could still end up using the same proxy, and also defining the rules for when a site "banned" you is not as simple as only checking statuses. This are the services I know for that thing you want: Crawlera and Proxymesh.
If you want direct functionality on scrapy for rotating proxies, I recommend to use Crawlera as it is already fully integrated.
I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs
The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.
The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.
Just have a spider that polls a database (or file?) that when presented with a new URL creates and yields a new Request() object for it.
You can build it by hand easily enough. There is probably a better way to do it than that, but thats basically what I did for an open-proxy scraper. The spider gets a list of all the 'potential' proxies from the database and generates a Request() object for each one - when they're returned they're then dispatched down the chain and verified by downstream middleware and their records are updated by item pipeline.
You could use a message queue (like IronMQ--full disclosure, I work for the company that makes IronMQ as a developer evangelist) to pass in the URLs.
Then in your crawler, poll for the URLs from the queue, and crawl based on the messages you retrieve.
The example you linked to could be updated (this is untested and pseudocode, but you should get the basic idea):
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ
mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
msg = q.get(timeout=120) # get messages from queue
# timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
time.sleep(1) # wait one second
continue # try again
spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it
I have my own templatetag:
#register.inclusion_tag('path_to_module.html', takes_context=True)
def getmodule(context, token):
try:
return slow_function(params)
except Exception, e:
return None
And it is very slow. Template waiting for this tags.
Can I call them asynchonously?
If it's cacheable (doesn't need to be unique per page view); then cache it. Either using Django's cache API in your templatetag, or template fragment caching directly in your template. As #jpic says, if it's something that takes a while to recalculate - pass it off to a task queue like Celery.
If you need this function to run every page view for whatever reason; then separate it out in to a new view and load it in to some container in your main template asynchronously using JavaScript.
You can execute python functions in a background process:
django-ztask
celery (django kombu for database transport)
uWSGI spooler (if using uWSGI for deployment)
You could create a background task that renders path_to_module and caches the output. When the cache should be invalidated: run slow_function in the background again.