Scrapy: Running multiple spiders from the same python process via cmdLine fails - python

Here's the code:
if __name__ == '__main__':
cmdline.execute("scrapy crawl spider_a -L INFO".split())
cmdline.execute("scrapy crawl spider_b -L INFO".split())
I intend to run multiple spiders from within the same main portal under a scrapy project but it turns out that only the first spider has run successfully, whereas the second one seems like being ignored. Any suggestions?

Just do
import subprocess
subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)

From the scrapy documentation: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
import scrapy
from scrapy.crawler import CrawlerProcess
from .spiders import Spider1, Spider2
process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished
EDIT: If you wish to run multiple spiders two by two, you can do the following:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
runner = CrawlerRunner()
spiders = [Spider1, Spider2, Spider3, Spider4]
def join_spiders(spiders):
"""Setup a new runner with the provided spiders"""
runner = CrawlerRunner()
# Add each spider to the current runner
for spider in spider:
runner.crawl(MySpider1)
# This will yield when all the spiders inside the runner finished
yield runner.join()
#defer.inlineCallbacks
def crawl(group_by=2):
# Yield a new runner containing `group_by` spiders
for i in range(0, len(spiders), step=group_by):
yield join_spiders(spiders[i:i + group_by])
# When we finished running all the spiders, stop the twisted reactor
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Didn't tested all of this though, let me know if it works !

Related

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I am having this error when I run a crawl process multiples times.
I am using scrapy 2.6
This is my code:
from scrapy.crawler import CrawlerProcess
from football.spiders.laliga import LaligaSpider
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings=get_project_settings())
for i in range(1, 29):
process.crawl(LaligaSpider, **{'week': i})
process.start()
For me this worked, I put it before the CrawlerProcess
import sys
if "twisted.internet.reactor" in sys.modules:
del sys.modules["twisted.internet.reactor"]
This solution avoids use of CrawlerProcess as stated in the docs.
https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from football.spiders.laliga import LaligaSpider
# Enable logging for CrawlerRunner
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
for i in range(1, 29):
runner.crawl(LaligaSpider, **{'week': i})
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I've just run into this issue as well. It appears that the docs at https://docs.scrapy.org/en/latest/topics/practices.html are incorrect in stating that CrawlerProcess can be used to run multiple crawlers built with spiders, since each new crawler attempts to load a new reactor instance if you give it a spider. I was able to get my code to work by using CrawlerRunner instead, as also detailed on the same page.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
settings = get_project_settings() # settings not required if running
runner = CrawlerRunner(settings) # from script, defaults provided
runner.crawl(MySpider1) # your loop would go here
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
I have encountered this problem, but it is solved after updating both Scrapy and Twisted.
the current version of the packages.
Twisted==22.8.0
Scrapy==2.6.2

Scrapy Running multiple spiders from one file

I have made 1 file with 2 spiders/classes. the 2nd spider with use some data from the first one. but it doesn't seem to work. here is what i do to initiate and start the spiders
process=CrawlerProcess()
process.crawl(Zoopy1)
process.crawl(Zoopy2)
process.start()
what do you suggest
Your code will run 2 spiders simultaneously.
Running spiders sequentially (start Zoopy2 after completion of Zoopy1) can be achieved with #defer.inlineCallbacks:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Zoopy1)
yield runner.crawl(Zoopy2)
reactor.stop()
crawl()
reactor.run()
Alternative option (if it is suitable for Your task) - is to merge logic from 2 spiders into single spider Class,

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:
using CrawlerProcess
using CrawlerRunner
What is the difference between the two? When should I use "process" and when "runner"?
Scrapy's documentation does a pretty bad job at giving examples on real applications of both.
CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.
from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop
Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function
The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)
CrawlerRunner:
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
CrawlerProcess:
This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within your application.
It sounds like the CrawlerProcess is what you want unless you're adding your crawlers to an existing Twisted application.

How to stop reactor when both spiders finished

I have this code and when both spiders finished program is still running.
#!C:\Python27\python.exe
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from carrefour.spiders.tesco import TescoSpider
from carrefour.spiders.carr import CarrSpider
from scrapy.utils.project import get_project_settings
import threading
import time
def tescofcn():
tescoSpider = TescoSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(tescoSpider)
crawler.start()
def carrfcn():
carrSpider = CarrSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(carrSpider)
crawler.start()
t1=threading.Thread(target=tescofcn)
t2=threading.Thread(target=carrfcn)
t1.start()
t2.start()
log.start()
reactor.run()
When i tried insert this to both function
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
, the spider which was faster end reactor for both spiders and the slower spider was terminated although he not finished.
What you could do is create a function that checks the list of running of spiders and connect that to singals.spider_closed.
from scrapy.utils.trackref import iter_all
def close_reactor_if_no_spiders():
running_spiders = [spider for spider in iter_all('Spider')]
if not running_spiders:
reactor.stop()
crawler.signals.connect(close_reactor_if_no_spiders, signal=signals.spider_closed)
Although, I still would recommend using scrapyd to manage running multiple spiders.

What is the simplest way to programatically start a crawler in Scrapy >= 0.14

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
a settings.py file for the project
items and pipelines
a crawler class which extends BaseSpider and requires arguments upon initialisation.
I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:
launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
I want all the pipelines and middleware to be used as per the specification in settings.py.
I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.
So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
See this part of the docs
#wilfred's answer from official docs works fine except logging part, here's mine:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider()
crawler = crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start_from_settings(get_project_settings())
reactor.run()

Categories

Resources