I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs
The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.
The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.
Just have a spider that polls a database (or file?) that when presented with a new URL creates and yields a new Request() object for it.
You can build it by hand easily enough. There is probably a better way to do it than that, but thats basically what I did for an open-proxy scraper. The spider gets a list of all the 'potential' proxies from the database and generates a Request() object for each one - when they're returned they're then dispatched down the chain and verified by downstream middleware and their records are updated by item pipeline.
You could use a message queue (like IronMQ--full disclosure, I work for the company that makes IronMQ as a developer evangelist) to pass in the URLs.
Then in your crawler, poll for the URLs from the queue, and crawl based on the messages you retrieve.
The example you linked to could be updated (this is untested and pseudocode, but you should get the basic idea):
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ
mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
msg = q.get(timeout=120) # get messages from queue
# timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
time.sleep(1) # wait one second
continue # try again
spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it
Related
If there are many requests in scheduler, would scheduler reject more requests to be added?
I met a very tricky question. I am trying to scrape a forum with all posts and comments. The problem is scrapy seems never finish it jobs and quits without error messages. I am wondering if I yielded too many requests so that scrapy stopped yielding new requests and just quit.
But I could not find documentation says that scrapy will quit if too many requests in schedular. Here is my code:
https://github.com/spacegoing/sentiment_mqd/blob/a46b59866e8f0a888b43aba6df0481a03136cf21/guba_spiders/guba_spiders/spiders/guba_spider.py#L217
The strange thing is that scrapy seems can only scrape 22 pages. If I start from page 1, it will stop at page 21. If I start from page 21, then it will stop at page 41.... There is no exception raised and scraped results are desired outputs.
1.
The code on GitHub you shared at a46b598 is probably not the exact version you have locally for the sample jobs. E.g. I haven't observed any line for the log lines like <timestamp> [guba] INFO: <url>.
But, well, I assumed there's no too significant difference.
2.
It's suggested to have the log level configured to DEBUG when you encounter any issue.
3.
If you've got the log level configured to DEBUG, you'd probably see something like this:
2018-10-26 15:25:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://guba.eastmoney.com/topic,600000_22.html>: max redirections reached
Some more lines: https://gist.github.com/starrify/b2483f0ed822a02d238cdf9d32dfa60e
That happens because you're passing the full response.meta dict to the following requests (related code), and Scrapy's RedirectMiddleware relies on some meta values (e.g. "redirect_times" and "redirect_ttl") to perform the check.
And the solution is simple: pass only the values you need into next_request.meta.
4.
It's also observed that you're trying to rotate the user agent strings, possibly for avoiding web crawl bans. But there's no other action taken. That would make your requests fishy still, because:
Scrapy's cookie management is enabled by default, which would use a same cookie jar for all your requests.
All your requests come from a same source IP address.
Thus I'm unsure whether it's good enough for you to scrape the whole site properly, especially when you're not throttling the requests.
Whenever Scrapy gets a 302, that action is added as the last item in the queue. Is there a way to force Scrapy to finish the redirection and process next urls after that?
As stated by Tomáš in the comment REDIRECT_PRIORITY_ADJUST controls redirect priority.
However what you describe with default scrapy settings shouldn't happen since this setting defaults to +2. By default all scrapy requests are scheduled at 0, so all redirected requests should have priority over other requests.
You can schedule indidividual requests priority with priority argument.
for example if you want to set priority at 100, you'd write this:
yield Request("http://someurl.com", priority=100)
I've encountered page with Ajax hidden elements, which I need to crawl. I've found this neat tutorial which shows how to do this with Selenium, in case when there is no additional calls to the server (this is the case for me as well).
http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
However this and other sources mention a performance cost of using Selenium for this purpose. In this example the driver is initiated in the constructor, so I'm assuming all requests for the spider will go via Firefox then?
I just have a small portion of calls with Ajax involved, the rest is standard Scrapy crawling. Is it feasible to switch from Selenium/Browser in a single spider after part of the tasks were completed, back to the default Scrapy mechanism? If so how should I try to do this?
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
self.driver.get(response.url)
Edit
What I'm after is getting the Ajax based menu scraped from a single site, just the URLs. Then I want to pass this list to as start_urls to the main spider.
Your code does not break standard scrapy behaviour, try switch to standard way like this
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
self.driver.get(response.url)
# get hidden menu urls
yield scrapy.Request(hidden_menu_url, callback=self.parse_original_scrapy)
def parse_original_scrapy(self, response):
pass
You can to try my framework - Pomp instead of scrapy.
Start with phantomjs example and implement yours own Downloader that would be dispatch request to webdriver or fetch it by plain http request. It is not so easy to do, but much better then use webdriver inside parse method of scrapy spider.
Sorry for my poor English
There is no direct way to do this, as Scrapy crawls the request you send, plain text, without javascript rendering, something like curl if you've tried it.
The process of passing from Selenium to only Scrapy is possible by working every single (or just the necessary) request, you can use chrome dev tools or firebug to check which requests are being done for every call inside a browser, and then check the information you want and which requests are the necessary to get them.
I'm using scrapy to perform test on an internal web app.
Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.
It work very well except for this: the crawl at the end, GET things in a random order...
So, URL that perform DELETE operation are being executed before other operations.
I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:
from scrapy import log
class DeleteDelayer(object):
def enqueue_request(self, spider, request):
if request.url.find('delete') != -1:
log.msg("delay %s" % request.url, log.DEBUG)
request.priority = 50
But it does not work... I see delete being "delay" in the log but they are executed during the execution.
I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle signal is called to put them back in, but I'm not sure on how to do this.
What is the best way to acheive this?
default priority for request is 0, so you set priority to 50 will not work
you can use a middleware to collect (insert the requests into your own queue, e.g, redis set) and ignore (return IngnoreRequest Exception) those 'delete' request
start a 2nd crawl with requests load from your queue in step 2
I have a scraped that runs at a timed interval. I want to send an email when the scrape completes. What would be the best method to go about doing this?
I was thinking of writing an extension, but I cant figure out how to access the file that the output was being written to from within the extension.
Have you considered hooking the spider_closed signal and using the scrapy.mail.MailSender service ?
scrapy.signals.spider_closed(spider, reason)
[...]
reason (str) – a string which describes the reason why the spider was closed. If it was closed because the spider has completed scraping, the reason is 'finished'.