I am having this error when I run a crawl process multiples times.
I am using scrapy 2.6
This is my code:
from scrapy.crawler import CrawlerProcess
from football.spiders.laliga import LaligaSpider
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings=get_project_settings())
for i in range(1, 29):
process.crawl(LaligaSpider, **{'week': i})
process.start()
For me this worked, I put it before the CrawlerProcess
import sys
if "twisted.internet.reactor" in sys.modules:
del sys.modules["twisted.internet.reactor"]
This solution avoids use of CrawlerProcess as stated in the docs.
https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from football.spiders.laliga import LaligaSpider
# Enable logging for CrawlerRunner
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
for i in range(1, 29):
runner.crawl(LaligaSpider, **{'week': i})
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I've just run into this issue as well. It appears that the docs at https://docs.scrapy.org/en/latest/topics/practices.html are incorrect in stating that CrawlerProcess can be used to run multiple crawlers built with spiders, since each new crawler attempts to load a new reactor instance if you give it a spider. I was able to get my code to work by using CrawlerRunner instead, as also detailed on the same page.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
settings = get_project_settings() # settings not required if running
runner = CrawlerRunner(settings) # from script, defaults provided
runner.crawl(MySpider1) # your loop would go here
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I have encountered this problem, but it is solved after updating both Scrapy and Twisted.
the current version of the packages.
Twisted==22.8.0
Scrapy==2.6.2
Related
I'm trying to change the settings for Scrapy. I've managed to successfully do this for CrawlerProcess before. But I can't seem to get it to work for CrawlerRunner. The log should be disabled but I'm still seeing output from the log. What am I doing wrong? Thanks.
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
class MySpider1(scrapy.Spider):
name = "spider1"
class MySpider2(scrapy.Spider):
name = "spider2"
configure_logging()
s = get_project_settings()
s.update({
"LOG_ENABLED": "False"
})
runner = CrawlerRunner(s)
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run()
According to the doc, and the api, you should use your setting to init the logger, so you should adjust your code like that:
# comment that line
# configure_logging()
s = get_project_settings()
s.update({
"LOG_ENABLED": "False"
})
# init the logger using setting
configure_logging(s)
runner = CrawlerRunner(s)
Then you will get what you want.
Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:
using CrawlerProcess
using CrawlerRunner
What is the difference between the two? When should I use "process" and when "runner"?
Scrapy's documentation does a pretty bad job at giving examples on real applications of both.
CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.
from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop
Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function
The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)
CrawlerRunner:
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
CrawlerProcess:
This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within your application.
It sounds like the CrawlerProcess is what you want unless you're adding your crawlers to an existing Twisted application.
The official docs give many ways for running scrapy crawlers from code:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?
I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit:
from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process
class CrawlerScript(Process):
def __init__(self, spider):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider.__class__, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
def run(self):
self.crawler.crawl(self.spider)
reactor.run()
def crawl_async():
spider = MySpider()
crawler = CrawlerScript(spider)
crawler.start()
crawler.join()
So now when I call crawl_async, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy, so may be this isn't a very good solution but it worked for me.
I used these versions of the libraries:
cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22
Netimen's answer is correct. process.start() calls reactor.run(), which blocks the thread. Just that I don't think it is necessary to subclass billiard.Process. Although poorly documented, billiard.Process does have a set of APIs to call another function asynchronously without subclassing.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from billiard import Process
crawler = CrawlerProcess(get_project_settings())
process = Process(target=crawler.start, stop_after_crawl=False)
def crawl(*args, **kwargs):
crawler.crawl(*args, **kwargs)
process.start()
Note that if you don't have stop_after_crawl=False, you may run into ReactorNotRestartable exception when you run the crawler for more than twice.
I'm trying to run spiders from python script following scrapy document: http://doc.scrapy.org/en/latest/topics/practices.html
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
But python just cannot import the module, the error looks like this:
Traceback (most recent call last):
...
from scrapy.crawler import Crawler
File "aappp/scrapy.py", line 1, in <module>
ImportError: No module named crawler
The issue is briefly mentioned in faq of scrapy document, but it doesn't help too much for me.
Have you tried doing it this way?
from scrapy.project import crawler
(That's how it's done on http://doc.scrapy.org/en/latest/faq.html - looks like they already answered your question there.)
It also gives a more recent way of doing it and calls this previous method deprecated:
"This way to access the crawler object is deprecated, the code should be ported to use from_crawler class method, for example:
class SomeExtension(object):
#classmethod
def from_crawler(cls, crawler):
o = cls()
o.crawler = crawler
return o
"
I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
a settings.py file for the project
items and pipelines
a crawler class which extends BaseSpider and requires arguments upon initialisation.
I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:
launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
I want all the pipelines and middleware to be used as per the specification in settings.py.
I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.
So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
See this part of the docs
#wilfred's answer from official docs works fine except logging part, here's mine:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider()
crawler = crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start_from_settings(get_project_settings())
reactor.run()