So, I made this class so that I can crawl on-demand using Scrapy:
from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings
class NewsCrawler(object):
def __init__(self, spiders=[]):
self.spiders = spiders
self.settings = Settings()
def crawl(self, start_date, end_date):
crawled_items = []
def add_item(item):
crawled_items.append(item)
process = CrawlerProcess(self.settings)
for spider in self.spiders:
crawler = Crawler(spider, self.settings)
crawler.signals.connect(add_item, signals.item_scraped)
process.crawl(crawler, start_date=start_date, end_date=end_date)
process.start()
return crawled_items
Basically, I have a long running process and I will call the above class' crawl method multiple times, like this:
import time
crawler = NewsCrawler(spiders=[Spider1, Spider2])
while True:
items = crawler.crawl(start_date, end_date)
# do something with crawled items ...
time.sleep(3600)
The problem is, the second time crawl being called, this error will occurs: twisted.internet.error.ReactorNotRestartable.
From what I gathered, it's because reactor can't be run after it's being stopped. Is there any workaround for that?
Thanks!
This is a limitation of scrapy(twisted) at the moment and makes it hard using scrapy as a lib.
What you can do is fork a new process which runs the crawler and stops the reactor when the crawl is finished. You can then wait for join and spawn a new process after the crawl has finished. If you want to handle the items in your main thread you can post the results to a Queue. I would recommend using a customized pipelines for your items though.
Have a look at the following answer by me: https://stackoverflow.com/a/22202877/2208253
You should be able to apply the same principles. But you would rather use multiprocessing instead of billiard.
Based on #bj-blazkowicz's answer above. I found out a solution with CrawlerRunner which is the recommended crawler to use when running multiple spiders as stated in the docs https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
Code in your main file:
from multiprocessing import Process, Queue
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
# Enable logging for CrawlerRunner
configure_logging()
class CrawlerRunnerProcess(Process):
def __init__(self, spider, q, *args):
Process.__init__(self)
self.runner = CrawlerRunner(get_project_settings())
self.spider = spider
self.q = q
self.args = args
def run(self):
deferred = self.runner.crawl(self.spider, self.q, self.args)
deferred.addBoth(lambda _: reactor.stop())
reactor.run(installSignalHandlers=False)
# The wrapper to make it run multiple spiders, multiple times
def run_spider(spider, *args): # optional arguments
q = Queue() # optional queue to return spider results
runner = CrawlerRunnerProcess(spider, q, *args)
runner.start()
runner.join()
return q.get()
Code in your spider file:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, q, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.q = q # optional queue
self.args = args # optional args
def parse(self, response):
my_item = MyItem()
self.q.put(my_item)
yield my_item
Related
I am trying to do Multiprocessing of my spider. I know CrawlerProcess runs the spider in a single process.
I want to run multiple times the same spider with different arguments.
I tried this but doesn't work.
How do I do multiprocessing?
Please do help. Thanks.
from scrapy.utils.project import get_project_settings
import multiprocessing
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings=get_project_settings())
process.crawl(Spider, data=all_batches[0])
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(Spider, data=all_batches[1])
p1 = multiprocessing.Process(target=process.start())
p2 = multiprocessing.Process(target=process1.start())
p1.start()
p2.start()
You need to run each scrapy crawler instance inside a separate process. This is because scrapy uses twisted, and you can't use it multiple times in the same process.
Also, you need to disable the telenet extension, because scrapy will try to bind to the same port on multiple processes.
Test code:
import scrapy
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
print('my_data -> ', self.settings['my_data'])
yield {'title': title.css('a ::text').get()}
def start_spider(spider, settings: dict = {}, data: dict = {}):
all_settings = {**settings, **{'my_data': data, 'TELNETCONSOLE_ENABLED': False}}
def crawler_func():
crawler_process = CrawlerProcess(all_settings)
crawler_process.crawl(spider)
crawler_process.start()
process = Process(target=crawler_func)
process.start()
return process
map(lambda x: x.join(), [
start_spider(TestSpider, data={'data': 'test_1'}),
start_spider(TestSpider, data={'data': 'test_2'})
])
Actually the scrapy docs explained how to chain two spyder like this
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
But in my use case, the MySpider2 need informations retrieved by MySpider1 after transformation using some transformFunction().
So i want something like this :
def transformFunction():
... transforme data retrieved by spyder1 ...
return newdata
def crawl():
yield runner.crawl(MySpider1)
newdata = transformFunction()
yield runner.crawl(MySpider2, data=newData)
reactor.stop()
What i want to be scheduled :
MySpider1 start, write data on disk then quit
transformFunction() transform data to newdata
MySpider2 start using the newData
So how can i manage this behavior using twisted reactor and scrapy ?
runner.crawl returns a Deferred so you could chain callbacks to it. Minor tweaks will have to be done to your code.
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
def crawl(reactor):
runner = CrawlerRunner()
d = runner.crawl(MySpider1)
d.addCallback(transformFunction)
d.addCallback(crawl2, runner)
return d
def transformFunction(result):
# crawl doesn't usually return any results if successful so ignore result var here
# ...
return newdata
def crawl2(result, runner):
# result == newdata from transformFunction
# runner is passed in from crawl()
return runner.crawl(MySpider2, data=result)
task.react(crawl)
The main function is crawl(), it gets executed by task.react() which will start and stop the reactor for you. The Deferred is returned from runner.crawl() and the transformFunction + crawl2 functions are chained to it so that when a function completes, the next one starts.
I have an existing script (main.py) that requires data to be scraped.
I started a scrapy project for retrieving this data. Now, is there any way main.py can retrieve the data from scrapy as an Item generator, rather than persisting data using the Item pipeline?
Something like this would be really convenient, but I couldn't find out how to do it, if it's feasible at all.
for item in scrapy.process():
I found a potential solution there: https://tryolabs.com/blog/2011/09/27/calling-scrapy-python-script/, using multithreading's queues.
Even though I understand this behaviour is not compatible with distributed crawling, which is what Scrapy is intended for, I'm still a little surprised that you wouldn't have this feature available for smaller projects.
You could send json data out from the crawler and grab the results. It can be done as follows:
Having the spider:
class MySpider(scrapy.Spider):
# some attributes
accomulated=[]
def parse(self, response):
# do your logic here
page_text = response.xpath('//text()').extract()
for text in page_text:
if conditionsAreOk( text ):
self.accomulated.append(text)
def closed( self, reason ):
# call when the crawler process ends
print JSON.dumps(self.accomulated)
Write a runner.py script like:
import sys
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from spiders import MySpider
def main(argv):
url = argv[0]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s', 'LOG_ENABLED':False })
runner = CrawlerRunner( get_project_settings() )
d = runner.crawl( MySpider, url=url)
# For Multiple in the same process
#
# runner.crawl('craw')
# runner.crawl('craw2')
# d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
if __name__ == "__main__":
main(sys.argv[1:])
And then call it from your main.py as:
import json, subprocess, sys, time
def main(argv):
# urlArray has http:// or https:// like urls
for url in urlArray:
p = subprocess.Popen(['python', 'runner.py', url ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# do something with your data
print out
print json.loads(out)
# This just helps to watch logs
time.sleep(0.5)
if __name__ == "__main__":
main(sys.argv[1:])
Note
This is not the best way of using Scrapy as you know, but for fast results which do not require a complex post processing, this solution can provide what you need.
I hope it helps.
You can do it this way in a Twisted or Tornado app:
import collections
from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals
def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
"""
Start a crawl and return an object (ItemCursor instance)
which allows to retrieve scraped items and wait for items
to become available.
Example:
.. code-block:: python
#inlineCallbacks
def f():
runner = CrawlerRunner()
async_items = scrape_items(runner, my_spider)
while (yield async_items.fetch_next):
item = async_items.next_item()
# ...
# ...
This convoluted way to write a loop should become unnecessary
in Python 3.5 because of ``async for``.
"""
# this requires scrapy >= 1.1rc1
crawler = crawler_runner.create_crawler(crawler_or_spidercls)
# for scrapy < 1.1rc1 the following code is needed:
# crawler = crawler_or_spidercls
# if not isinstance(crawler_or_spidercls, Crawler):
# crawler = crawler_runner._create_crawler(crawler_or_spidercls)
d = crawler_runner.crawl(crawler, *args, **kwargs)
return ItemCursor(d, crawler)
class ItemCursor(object):
def __init__(self, crawl_d, crawler):
self.crawl_d = crawl_d
self.crawler = crawler
crawler.signals.connect(self._on_item_scraped, signals.item_scraped)
crawl_d.addCallback(self._on_finished)
crawl_d.addErrback(self._on_error)
self.closed = False
self._items_available = Deferred()
self._items = collections.deque()
def _on_item_scraped(self, item):
self._items.append(item)
self._items_available.callback(True)
self._items_available = Deferred()
def _on_finished(self, result):
self.closed = True
self._items_available.callback(False)
def _on_error(self, failure):
self.closed = True
self._items_available.errback(failure)
#property
def fetch_next(self):
"""
A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
asynchronously retrieve the next item, waiting for an item to be
crawled if necessary. Resolves to ``False`` if the crawl is finished,
otherwise :meth:`next_item` is guaranteed to return an item
(a dict or a scrapy.Item instance).
"""
if self.closed:
# crawl is finished
d = Deferred()
d.callback(False)
return d
if self._items:
# result is ready
d = Deferred()
d.callback(True)
return d
# We're active, but item is not ready yet. Return a Deferred which
# resolves to True if item is scraped or to False if crawl is stopped.
return self._items_available
def next_item(self):
"""Get a document from the most recently fetched batch, or ``None``.
See :attr:`fetch_next`.
"""
if not self._items:
return None
return self._items.popleft()
The main idea is to listen to item_scraped signal, and then wrap it to an object with a nicer API.
Note that you need an event loop in your main.py script for this to work; the example above works with twisted.defer.inlineCallbacks or tornado.gen.coroutine.
The official docs give many ways for running scrapy crawlers from code:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?
I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit:
from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process
class CrawlerScript(Process):
def __init__(self, spider):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider.__class__, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
def run(self):
self.crawler.crawl(self.spider)
reactor.run()
def crawl_async():
spider = MySpider()
crawler = CrawlerScript(spider)
crawler.start()
crawler.join()
So now when I call crawl_async, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy, so may be this isn't a very good solution but it worked for me.
I used these versions of the libraries:
cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22
Netimen's answer is correct. process.start() calls reactor.run(), which blocks the thread. Just that I don't think it is necessary to subclass billiard.Process. Although poorly documented, billiard.Process does have a set of APIs to call another function asynchronously without subclassing.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from billiard import Process
crawler = CrawlerProcess(get_project_settings())
process = Process(target=crawler.start, stop_after_crawl=False)
def crawl(*args, **kwargs):
crawler.crawl(*args, **kwargs)
process.start()
Note that if you don't have stop_after_crawl=False, you may run into ReactorNotRestartable exception when you run the crawler for more than twice.
I'm trying to execute scrapy spider in separate script and when I execute this script in a loop (for instance run the same spider with different parameters), I get ReactorAlreadyRunning. My snippet:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
#task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
#param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
My suggestion is that it caused by multiple twisted reactor runs.
How can I avoid it? Is there in general a way to run the spiders without reactor?
I figured out, what caused the problem: if you call execute_spider method somehow in CrawlerWorker process (for instance via recursion ), it causes creating second reactor, which is not possible.
My solution: to move all statements, causing recursive calls, in execute_spider method, so they will trigger the spider execution in the same process, not in secondary CrawlerWorker. I also built in such a statement
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
to catch unintentionally recursive calls of spiders.