ReactorNotRestartable with scrapy when using Google Cloud Functions - python

I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop.
The way to solve this is by putting the start() outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent.
Is the CrawlerProcess somehow cached with Cloud Functions? And if so, how can we remove this behaviour.
I tried for instance to put the import and initialization process inside a function, instead of outside, to prevent the caching of imports, but that did not work:
# main.py
def run_single_crawl(data, context):
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use scrapydo to run your existing spider in a blocking fashion:
requirements.txt:
scrapydo
main.py:
import scrapy
import scrapydo
scrapydo.setup()
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/"]
def parse(self, response):
yield MyItem(url=response.url)
def run_single_crawl(data, context):
results = scrapydo.run_spider(MySpider)
This also shows a simple example of how to yield one or more scrapy.Item from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo.
Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.

You can simply crawl the spider in a sequence.
main.py
from scrapy.crawler import CrawlerProcess
def run_single_crawl(data, context):
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

Related

How can I make Selenium run in parallel with Scrapy?

I'm trying to scrape some urls with Scrapy and Selenium.
Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.
The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.
I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.
In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?
import time
import scrapy
from selenium.webdriver import Firefox
def load_with_selenium(url):
with Firefox() as driver:
driver.get(url)
time.sleep(10) # Do something
page = driver.page_source
return page
class TestSpider(scrapy.Spider):
name = 'test_spider'
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def start_requests(self):
for task in self.tasks:
yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)
def parse(self, response):
if response.meta['selenium']:
response = response.replace(body=load_with_selenium(response.meta['start_url']))
for url in response.xpath('//a/#href').getall():
print(url)
It seems that I've found a solution.
I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.
I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.
So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.
import multiprocessing
import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def crawl(tasks):
process = CrawlerProcess(get_project_settings())
def run_spider(_, index=0):
if index < len(tasks):
deferred = process.crawl('test_spider', task=tasks[index])
deferred.addCallback(run_spider, index + 1)
return deferred
run_spider(None)
process.start()
def main():
processes = 2
with multiprocessing.Pool(processes) as pool:
pool.map(crawl, np.array_split(tasks, processes))
if __name__ == '__main__':
main()
The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.
def __init__(self, task):
scrapy.Spider.__init__(self)
self.task = task
def start_requests(self):
yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)

Scrapy: how to use multiple spiders in one redis queue

I'm learning scrapy-redis,I have multiple spiders in one scrapy-redis project.How can I smart control every spider start and stop?And it's smart to use multiple spiders in one project just for every spider to share a setting?
My code is like this
from scrapy_redis.spiders import RedisSpider
from scrapy.crawler import CrawlerProcess
class MySpider1(RedisSpider):
...
class MySpider2(RedisSpider):
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
I have to lpush start_url to every spider,it's inconvenient.Is there any smart way to control it?

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:
using CrawlerProcess
using CrawlerRunner
What is the difference between the two? When should I use "process" and when "runner"?
Scrapy's documentation does a pretty bad job at giving examples on real applications of both.
CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.
from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop
Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function
The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)
CrawlerRunner:
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
CrawlerProcess:
This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within your application.
It sounds like the CrawlerProcess is what you want unless you're adding your crawlers to an existing Twisted application.

Start scrapy multiple spider without blocking the process

I'm trying to execute scrapy spider in separate script and when I execute this script in a loop (for instance run the same spider with different parameters), I get ReactorAlreadyRunning. My snippet:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
#task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
#param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
My suggestion is that it caused by multiple twisted reactor runs.
How can I avoid it? Is there in general a way to run the spiders without reactor?
I figured out, what caused the problem: if you call execute_spider method somehow in CrawlerWorker process (for instance via recursion ), it causes creating second reactor, which is not possible.
My solution: to move all statements, causing recursive calls, in execute_spider method, so they will trigger the spider execution in the same process, not in secondary CrawlerWorker. I also built in such a statement
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
to catch unintentionally recursive calls of spiders.

What is the simplest way to programatically start a crawler in Scrapy >= 0.14

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
a settings.py file for the project
items and pipelines
a crawler class which extends BaseSpider and requires arguments upon initialisation.
I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:
launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
I want all the pipelines and middleware to be used as per the specification in settings.py.
I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.
So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
See this part of the docs
#wilfred's answer from official docs works fine except logging part, here's mine:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider()
crawler = crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start_from_settings(get_project_settings())
reactor.run()

Categories

Resources