CrawlerProcess vs CrawlerRunner - python

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:
using CrawlerProcess
using CrawlerRunner
What is the difference between the two? When should I use "process" and when "runner"?

Scrapy's documentation does a pretty bad job at giving examples on real applications of both.
CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.
from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop
Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function
The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

CrawlerRunner:
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
CrawlerProcess:
This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within your application.
It sounds like the CrawlerProcess is what you want unless you're adding your crawlers to an existing Twisted application.

Related

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I am having this error when I run a crawl process multiples times.
I am using scrapy 2.6
This is my code:
from scrapy.crawler import CrawlerProcess
from football.spiders.laliga import LaligaSpider
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings=get_project_settings())
for i in range(1, 29):
process.crawl(LaligaSpider, **{'week': i})
process.start()
For me this worked, I put it before the CrawlerProcess
import sys
if "twisted.internet.reactor" in sys.modules:
del sys.modules["twisted.internet.reactor"]
This solution avoids use of CrawlerProcess as stated in the docs.
https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from football.spiders.laliga import LaligaSpider
# Enable logging for CrawlerRunner
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
for i in range(1, 29):
runner.crawl(LaligaSpider, **{'week': i})
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I've just run into this issue as well. It appears that the docs at https://docs.scrapy.org/en/latest/topics/practices.html are incorrect in stating that CrawlerProcess can be used to run multiple crawlers built with spiders, since each new crawler attempts to load a new reactor instance if you give it a spider. I was able to get my code to work by using CrawlerRunner instead, as also detailed on the same page.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
settings = get_project_settings() # settings not required if running
runner = CrawlerRunner(settings) # from script, defaults provided
runner.crawl(MySpider1) # your loop would go here
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I have encountered this problem, but it is solved after updating both Scrapy and Twisted.
the current version of the packages.
Twisted==22.8.0
Scrapy==2.6.2

Scrapy Running multiple spiders from one file

I have made 1 file with 2 spiders/classes. the 2nd spider with use some data from the first one. but it doesn't seem to work. here is what i do to initiate and start the spiders
process=CrawlerProcess()
process.crawl(Zoopy1)
process.crawl(Zoopy2)
process.start()
what do you suggest
Your code will run 2 spiders simultaneously.
Running spiders sequentially (start Zoopy2 after completion of Zoopy1) can be achieved with #defer.inlineCallbacks:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Zoopy1)
yield runner.crawl(Zoopy2)
reactor.stop()
crawl()
reactor.run()
Alternative option (if it is suitable for Your task) - is to merge logic from 2 spiders into single spider Class,

ReactorNotRestartable with scrapy when using Google Cloud Functions

I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop.
The way to solve this is by putting the start() outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent.
Is the CrawlerProcess somehow cached with Cloud Functions? And if so, how can we remove this behaviour.
I tried for instance to put the import and initialization process inside a function, instead of outside, to prevent the caching of imports, but that did not work:
# main.py
def run_single_crawl(data, context):
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(MySpider)
process.start()
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use scrapydo to run your existing spider in a blocking fashion:
requirements.txt:
scrapydo
main.py:
import scrapy
import scrapydo
scrapydo.setup()
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/"]
def parse(self, response):
yield MyItem(url=response.url)
def run_single_crawl(data, context):
results = scrapydo.run_spider(MySpider)
This also shows a simple example of how to yield one or more scrapy.Item from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo.
Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.
You can simply crawl the spider in a sequence.
main.py
from scrapy.crawler import CrawlerProcess
def run_single_crawl(data, context):
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

Scrapy: how to use multiple spiders in one redis queue

I'm learning scrapy-redis,I have multiple spiders in one scrapy-redis project.How can I smart control every spider start and stop?And it's smart to use multiple spiders in one project just for every spider to share a setting?
My code is like this
from scrapy_redis.spiders import RedisSpider
from scrapy.crawler import CrawlerProcess
class MySpider1(RedisSpider):
...
class MySpider2(RedisSpider):
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
I have to lpush start_url to every spider,it's inconvenient.Is there any smart way to control it?

What is the simplest way to programatically start a crawler in Scrapy >= 0.14

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
a settings.py file for the project
items and pipelines
a crawler class which extends BaseSpider and requires arguments upon initialisation.
I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:
launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
I want all the pipelines and middleware to be used as per the specification in settings.py.
I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.
So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
See this part of the docs
#wilfred's answer from official docs works fine except logging part, here's mine:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider()
crawler = crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start_from_settings(get_project_settings())
reactor.run()

Categories

Resources