Scrapy - Reactor not Restartable [duplicate]

Scrapy - Reactor not Restartable [duplicate] - python

This question already has answers here:
ReactorNotRestartable error in while loop with scrapy
(10 answers)
Closed 3 years ago.
with:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
I've always ran this process sucessfully:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
but since I've moved this code into a web_crawler(self) function, like so:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
and started calling the method using class instantiation, like:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
and running:
test()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
what is wrong?

You cannot restart the reactor, but you should be able to run it more times by forking a separate process:
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Run it twice:
configure_logging()
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
Result:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
Hope this will help somebody, as it helped for me :)

As per the Scrapy documentation, the start() method of the CrawlerProcess class does the following:
"[...] starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE."
The error you are receiving is being thrown by Twisted, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.
Honestly, if you think you need to restart the reactor, you're likely doing something wrong.
Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.

As some people pointed out already: You shouldn't need to restart the reactor.
Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.
For example, I've been using this loop spider that follows this pattern:
1. Crawl A
2. Sleep N
3. goto 1
And this is how it looks in scrapy:
import time
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.body)
def sleep(_, duration=5):
print(f'sleeping for: {duration}')
time.sleep(duration) # block here
def crawl(runner):
d = runner.crawl(HttpbinSpider)
d.addBoth(sleep)
d.addBoth(lambda _: crawl(runner))
return d
def loop_crawl():
runner = CrawlerRunner(get_project_settings())
crawl(runner)
reactor.run()
if __name__ == '__main__':
loop_crawl()
To explain the process more the crawl function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).
$ python endless_crawl.py
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5

The mistake is in this code:
def __call__(self):
result1 = test.web_crawler()[1]
result2 = test.web_crawler()[0] # here
web_crawler() returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by #Rejected.
obtaining results running one single process, and storing both results in a tuple, is the way to go here:
def __call__(self):
result1, result2 = test.web_crawler()

This solved my problem,put below code after reactor.run() or process.start():
time.sleep(0.5)
os.execl(sys.executable, sys.executable, *sys.argv)

Related

Scrapy Running multiple spiders from one file

I have made 1 file with 2 spiders/classes. the 2nd spider with use some data from the first one. but it doesn't seem to work. here is what i do to initiate and start the spiders
process=CrawlerProcess()
process.crawl(Zoopy1)
process.crawl(Zoopy2)
process.start()
what do you suggest

Your code will run 2 spiders simultaneously.
Running spiders sequentially (start Zoopy2 after completion of Zoopy1) can be achieved with #defer.inlineCallbacks:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Zoopy1)
yield runner.crawl(Zoopy2)
reactor.stop()
crawl()
reactor.run()
Alternative option (if it is suitable for Your task) - is to merge logic from 2 spiders into single spider Class,

Create restartable scrapy spider reactor in PyQt5 with qt5reactor

My GUI has an "Update Database" button and every time the user presses it, I want to start a Scrapy spider that stores the data scraped into a Sqlite3 Database. I implemented qt5reactor, as this answer suggests, but now I'm getting a ReactorNotRestartable error when I press the update button for a second time. How can I get around this? I tried switching from CrawlerRunner to CrawlerProcess, but it still throws the same error (but maybe I'm doing it wrong, though). I also cannot use this answer, because q.get() locks the event loop, so the GUI freezes when I run the spider. I'm new to multiprocessing, so sorry if I'm missing something incredibly obvious.
In main.py
... # PyQt5 imports
import qt5reactor
from scrapy import crawler
from twisted.internet import reactor
from currency_scraper.currency_scraper.spiders.investor import InvestorSpider
class MyGUI(QMainWindow):
def __init__(self):
self.update_db_button.clicked.connect(self.on_clicked_update)
...
def on_clicked_update(self):
"""Gives command to run scraper and fetch data from the website"""
runner = crawler.CrawlerRunner(
{
"USER_AGENT": "currency scraper",
"SCRAPY_SETTINGS_MODULE": "currency_scraper.currency_scraper.settings",
"ITEM_PIPELINES": {
"currency_scraper.currency_scraper.pipelines.Sqlite3Pipeline": 300,
}
}
)
deferred = runner.crawl(InvestorSpider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # has to be run here or the crawling doesn't start
update_notification()
... # other stuff
if __name__ == "__main__":
open_window()
qt5reactor.install()
reactor.run()
Error log:
Traceback (most recent call last):
File "c:/Users/Familia/Documents/ProgramaþÒo/Python/Projetos/Currency_converter/main.py", line 330, in on_clicked_update
reactor.run()
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

How can I make Selenium run in parallel with Scrapy?

I'm trying to scrape some urls with Scrapy and Selenium.
Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.
The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.
I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.
In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?
import time
import scrapy
from selenium.webdriver import Firefox
def load_with_selenium(url):
with Firefox() as driver:
driver.get(url)
time.sleep(10) # Do something
page = driver.page_source
return page
class TestSpider(scrapy.Spider):
name = 'test_spider'
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def start_requests(self):
for task in self.tasks:
yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)
def parse(self, response):
if response.meta['selenium']:
response = response.replace(body=load_with_selenium(response.meta['start_url']))
for url in response.xpath('//a/#href').getall():
print(url)

It seems that I've found a solution.
I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.
I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.
So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.
import multiprocessing
import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def crawl(tasks):
process = CrawlerProcess(get_project_settings())
def run_spider(_, index=0):
if index < len(tasks):
deferred = process.crawl('test_spider', task=tasks[index])
deferred.addCallback(run_spider, index + 1)
return deferred
run_spider(None)
process.start()
def main():
processes = 2
with multiprocessing.Pool(processes) as pool:
pool.map(crawl, np.array_split(tasks, processes))
if __name__ == '__main__':
main()
The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.
def __init__(self, task):
scrapy.Spider.__init__(self)
self.task = task
def start_requests(self):
yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)

Scrapy crawl multiple times in long running process

So, I made this class so that I can crawl on-demand using Scrapy:
from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings
class NewsCrawler(object):
def __init__(self, spiders=[]):
self.spiders = spiders
self.settings = Settings()
def crawl(self, start_date, end_date):
crawled_items = []
def add_item(item):
crawled_items.append(item)
process = CrawlerProcess(self.settings)
for spider in self.spiders:
crawler = Crawler(spider, self.settings)
crawler.signals.connect(add_item, signals.item_scraped)
process.crawl(crawler, start_date=start_date, end_date=end_date)
process.start()
return crawled_items
Basically, I have a long running process and I will call the above class' crawl method multiple times, like this:
import time
crawler = NewsCrawler(spiders=[Spider1, Spider2])
while True:
items = crawler.crawl(start_date, end_date)
# do something with crawled items ...
time.sleep(3600)
The problem is, the second time crawl being called, this error will occurs: twisted.internet.error.ReactorNotRestartable.
From what I gathered, it's because reactor can't be run after it's being stopped. Is there any workaround for that?
Thanks!

This is a limitation of scrapy(twisted) at the moment and makes it hard using scrapy as a lib.
What you can do is fork a new process which runs the crawler and stops the reactor when the crawl is finished. You can then wait for join and spawn a new process after the crawl has finished. If you want to handle the items in your main thread you can post the results to a Queue. I would recommend using a customized pipelines for your items though.
Have a look at the following answer by me: https://stackoverflow.com/a/22202877/2208253
You should be able to apply the same principles. But you would rather use multiprocessing instead of billiard.

Based on #bj-blazkowicz's answer above. I found out a solution with CrawlerRunner which is the recommended crawler to use when running multiple spiders as stated in the docs https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
Code in your main file:
from multiprocessing import Process, Queue
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
# Enable logging for CrawlerRunner
configure_logging()
class CrawlerRunnerProcess(Process):
def __init__(self, spider, q, *args):
Process.__init__(self)
self.runner = CrawlerRunner(get_project_settings())
self.spider = spider
self.q = q
self.args = args
def run(self):
deferred = self.runner.crawl(self.spider, self.q, self.args)
deferred.addBoth(lambda _: reactor.stop())
reactor.run(installSignalHandlers=False)
# The wrapper to make it run multiple spiders, multiple times
def run_spider(spider, *args): # optional arguments
q = Queue() # optional queue to return spider results
runner = CrawlerRunnerProcess(spider, q, *args)
runner.start()
runner.join()
return q.get()
Code in your spider file:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, q, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.q = q # optional queue
self.args = args # optional args
def parse(self, response):
my_item = MyItem()
self.q.put(my_item)
yield my_item

Start scrapy multiple spider without blocking the process

I'm trying to execute scrapy spider in separate script and when I execute this script in a loop (for instance run the same spider with different parameters), I get ReactorAlreadyRunning. My snippet:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
#task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
#param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
My suggestion is that it caused by multiple twisted reactor runs.
How can I avoid it? Is there in general a way to run the spiders without reactor?

I figured out, what caused the problem: if you call execute_spider method somehow in CrawlerWorker process (for instance via recursion ), it causes creating second reactor, which is not possible.
My solution: to move all statements, causing recursive calls, in execute_spider method, so they will trigger the spider execution in the same process, not in secondary CrawlerWorker. I also built in such a statement
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
to catch unintentionally recursive calls of spiders.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Reactor not Restartable [duplicate] - python

This solved my problem,put below code after reactor.run() or process.start(): time.sleep(0.5) os.execl(sys.executable, sys.executable, *sys.argv)

Related

Scrapy Running multiple spiders from one file

Create restartable scrapy spider reactor in PyQt5 with qt5reactor

How can I make Selenium run in parallel with Scrapy?

Scrapy crawl multiple times in long running process

Start scrapy multiple spider without blocking the process

Categories

Resources