How to make scrapy spider crawl again if condition is not met? - python

In my close function, I am checking for the presence of a document scraped today, and I'd like to tell my Spider to scrape again if no such document is found. Basically, I need a robust way for the scraper to keep calling its crawl routine until a certain condition is met or MAX_RETRIES has been exhausted.

To execute the spider after the spider has finished, you will need to use the reactor and CrawlerRunner class. The crawl method returns a deferred once the spider has finished scraping which you can use to add a callback in which you can do your checks. See below example where the spider will rerun until the number of retries >= 3 at which point it stops.
You will need to be careful how you do your checks because this is asynchronous code and the sequence of code execution might not be as one might expect.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
yield {
"url": response.url
}
if __name__ == '__main__':
RETRIES = 0
configure_logging()
runner = CrawlerRunner()
d = runner.crawl(ExampleSpider)
def finished():
global RETRIES
# do your checks in this callback and run the spider again if needed
# in this example, we check if the number of retries is less than the required value
# if not we stop the reactor
if RETRIES < 3:
RETRIES += 1
d = runner.crawl(ExampleSpider)
d.addBoth(lambda _: finished())
else:
reactor.stop() # stop the reactor if the condition is not met
d.addBoth(lambda _: finished())
reactor.run()

Related

Scrapy Running multiple spiders from one file

I have made 1 file with 2 spiders/classes. the 2nd spider with use some data from the first one. but it doesn't seem to work. here is what i do to initiate and start the spiders
process=CrawlerProcess()
process.crawl(Zoopy1)
process.crawl(Zoopy2)
process.start()
what do you suggest
Your code will run 2 spiders simultaneously.
Running spiders sequentially (start Zoopy2 after completion of Zoopy1) can be achieved with #defer.inlineCallbacks:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Zoopy1)
yield runner.crawl(Zoopy2)
reactor.stop()
crawl()
reactor.run()
Alternative option (if it is suitable for Your task) - is to merge logic from 2 spiders into single spider Class,

Chain Scrapy Spiders which have data dependencies in a twisted reactor

Actually the scrapy docs explained how to chain two spyder like this
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
But in my use case, the MySpider2 need informations retrieved by MySpider1 after transformation using some transformFunction().
So i want something like this :
def transformFunction():
... transforme data retrieved by spyder1 ...
return newdata
def crawl():
yield runner.crawl(MySpider1)
newdata = transformFunction()
yield runner.crawl(MySpider2, data=newData)
reactor.stop()
What i want to be scheduled :
MySpider1 start, write data on disk then quit
transformFunction() transform data to newdata
MySpider2 start using the newData
So how can i manage this behavior using twisted reactor and scrapy ?
runner.crawl returns a Deferred so you could chain callbacks to it. Minor tweaks will have to be done to your code.
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
def crawl(reactor):
runner = CrawlerRunner()
d = runner.crawl(MySpider1)
d.addCallback(transformFunction)
d.addCallback(crawl2, runner)
return d
def transformFunction(result):
# crawl doesn't usually return any results if successful so ignore result var here
# ...
return newdata
def crawl2(result, runner):
# result == newdata from transformFunction
# runner is passed in from crawl()
return runner.crawl(MySpider2, data=result)
task.react(crawl)
The main function is crawl(), it gets executed by task.react() which will start and stop the reactor for you. The Deferred is returned from runner.crawl() and the transformFunction + crawl2 functions are chained to it so that when a function completes, the next one starts.

Scrapy: Running multiple spiders from the same python process via cmdLine fails

Here's the code:
if __name__ == '__main__':
cmdline.execute("scrapy crawl spider_a -L INFO".split())
cmdline.execute("scrapy crawl spider_b -L INFO".split())
I intend to run multiple spiders from within the same main portal under a scrapy project but it turns out that only the first spider has run successfully, whereas the second one seems like being ignored. Any suggestions?
Just do
import subprocess
subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)
From the scrapy documentation: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
import scrapy
from scrapy.crawler import CrawlerProcess
from .spiders import Spider1, Spider2
process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished
EDIT: If you wish to run multiple spiders two by two, you can do the following:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
runner = CrawlerRunner()
spiders = [Spider1, Spider2, Spider3, Spider4]
def join_spiders(spiders):
"""Setup a new runner with the provided spiders"""
runner = CrawlerRunner()
# Add each spider to the current runner
for spider in spider:
runner.crawl(MySpider1)
# This will yield when all the spiders inside the runner finished
yield runner.join()
#defer.inlineCallbacks
def crawl(group_by=2):
# Yield a new runner containing `group_by` spiders
for i in range(0, len(spiders), step=group_by):
yield join_spiders(spiders[i:i + group_by])
# When we finished running all the spiders, stop the twisted reactor
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Didn't tested all of this though, let me know if it works !

Scrapy - Reactor not Restartable [duplicate]

This question already has answers here:
ReactorNotRestartable error in while loop with scrapy
(10 answers)
Closed 3 years ago.
with:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
I've always ran this process sucessfully:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
but since I've moved this code into a web_crawler(self) function, like so:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
and started calling the method using class instantiation, like:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
and running:
test()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
what is wrong?
You cannot restart the reactor, but you should be able to run it more times by forking a separate process:
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Run it twice:
configure_logging()
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
Result:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
Hope this will help somebody, as it helped for me :)
As per the Scrapy documentation, the start() method of the CrawlerProcess class does the following:
"[...] starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE."
The error you are receiving is being thrown by Twisted, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.
Honestly, if you think you need to restart the reactor, you're likely doing something wrong.
Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.
As some people pointed out already: You shouldn't need to restart the reactor.
Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.
For example, I've been using this loop spider that follows this pattern:
1. Crawl A
2. Sleep N
3. goto 1
And this is how it looks in scrapy:
import time
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.body)
def sleep(_, duration=5):
print(f'sleeping for: {duration}')
time.sleep(duration) # block here
def crawl(runner):
d = runner.crawl(HttpbinSpider)
d.addBoth(sleep)
d.addBoth(lambda _: crawl(runner))
return d
def loop_crawl():
runner = CrawlerRunner(get_project_settings())
crawl(runner)
reactor.run()
if __name__ == '__main__':
loop_crawl()
To explain the process more the crawl function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).
$ python endless_crawl.py
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
The mistake is in this code:
def __call__(self):
result1 = test.web_crawler()[1]
result2 = test.web_crawler()[0] # here
web_crawler() returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by #Rejected.
obtaining results running one single process, and storing both results in a tuple, is the way to go here:
def __call__(self):
result1, result2 = test.web_crawler()
This solved my problem,put below code after reactor.run() or process.start():
time.sleep(0.5)
os.execl(sys.executable, sys.executable, *sys.argv)

Start scrapy multiple spider without blocking the process

I'm trying to execute scrapy spider in separate script and when I execute this script in a loop (for instance run the same spider with different parameters), I get ReactorAlreadyRunning. My snippet:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
#task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
#param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
My suggestion is that it caused by multiple twisted reactor runs.
How can I avoid it? Is there in general a way to run the spiders without reactor?
I figured out, what caused the problem: if you call execute_spider method somehow in CrawlerWorker process (for instance via recursion ), it causes creating second reactor, which is not possible.
My solution: to move all statements, causing recursive calls, in execute_spider method, so they will trigger the spider execution in the same process, not in secondary CrawlerWorker. I also built in such a statement
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
to catch unintentionally recursive calls of spiders.

Categories

Resources