How to set the pipeline for a spider - python

I have a spider that crawls a site. But in one crawl I need to grab certain data and save it to one database. In another, grab different data and put it somewhere else.
Right now pass a target parameter when I create the spider to manage which options it will use. The init method then tweaks the search parameters.
Is there a way I can have the spider set its pipeline in the init? Or something in the crawl script that would do it?
At the moment, I start the crawl like this:
process = CrawlerProcess(get_project_settings())
process.crawl('my_spider', target='target_one')
process.start()
I have a separate script for whichever target I intend to run.

Changing settings in Spider, this will override settings mentioned in settings.py
class MySpider(scrapy.Spider):
name = "My_spider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'my_scraper.middlewarees.MyMiddleware': 300,
},
'ITEM_PIPELINES': {
'my_scraper.pipelines.MyPipeline': 300,
}
}
Or if you are using CrawlerProcess
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from testspiders.spiders.followall import FollowAllSpider
FollowAllSpider.custom_settings={'ITEM_PIPELINES': {'my_scraper.pipelines.MyPipeline': 300}}
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

Related

How to get stats value after CrawlerProcess finished, i.e. at line after process.start()

I am using this code somewhere inside spider:
raise scrapy.exceptions.CloseSpider('you_need_to_rerun')
So, when this exceptions raised, eventually my spider closing working and I get in console stats with this string:
'finish_reason': 'you_need_to_rerun',
But - how I can get it from code? Cause I want to run spider again in loop, based on info from this stats, something like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import spaida.spiders.spaida_spider
import spaida.settings
you_need_to_rerun = True
while you_need_to_rerun:
process = CrawlerProcess(get_project_settings())
process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider)
process.start(stop_after_crawl=False) # the script will block here until the crawling is finished
finish_reason = 'and here I get somehow finish_reason from stats' # <- how??
if finish_reason == 'finished':
print("everything ok, I don't need to rerun this")
you_need_to_rerun = False
I found in docs this thing, but can't get it right, where is that "The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain name.": https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats
P.S.: I'm also getting error twisted.internet.error.ReactorNotRestartable when using process.start(), and recommendations to use process.start(stop_after_crawl=False) - and then spider just stops and do nothing, but this is another problem...
You need to access stats object via Crawler object:
process = CrawlerProcess(get_project_settings())
crawler = process.crawlers[0]
reason = crawler.stats.get_value('finish_reason')

Scrapy: crawl multiple spiders sharing same items, pipeline, and settings but with separate outputs

I am trying to run multiple spiders using a Python script based on the code provided in the official documentation. My scrapy project contains multiple spider (Spider1, Spider2, etc.) which crawl different websites and save the content of each website in a different JSON file (output1.json, output2.json, etc.).
The items collected on the different websites share the same structure, therefore the spiders use the same item, pipeline, and setting classes. The output is generated by a custom JSON class in the pipeline.
When I run the spiders separately they work as expected, but when I use the script below to run the spiders from with scrapy API the items get mixed in the pipeline. Output1.json should only contain items crawled by Spider1, but it also contains the items of Spider2. How can I crawl multiple spiders with scrapy API using same items, pipeline, and settings but generating separate outputs?
Here is the code I used to run multiple spiders:
import scrapy
from scrapy.crawler import CrawlerProcess
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(Spider1)
process.crawl(Spider2)
process.start()
Example output1.json:
{
"Name": "Thomas"
"source": "Spider1"
}
{
"Name": "Paul"
"source": "Spider2"
}
{
"Name": "Nina"
"source": "Spider1"
}
Example output2.json:
{
"Name": "Sergio"
"source": "Spider1"
}
{
"Name": "David"
"source": "Spider1"
}
{
"Name": "James"
"source": "Spider2"
}
Normally, all the names crawled by spider1 ("source": "Spider1") should be in output1.json, and all the names crawled by spider2 ("source": "Spider2") should be in output2.json
Thank you for your help!
The first problem was that spiders were running in the same process. Running the spiders sequentially by chaining the deferreds solved this problem:
#scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
#spiders
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
yield runner.crawl(Spider2)
reactor.stop()
crawl()
reactor.run()
I also had a second mistake in my pipeline: I didn't clear my list of results when close_spider. Therefore, spider2 was adding items to a list that already contained the items of spider1.
class ExportJSON(object):
results = []
def process_item(self, item, spider):
self.results.append(dict(item))
return item
def close_spider(self, spider):
file = open(file_name, 'w')
line = json.dumps(self.results)
file.write(line)
file.close()
self.results.clear()
Thank you!
Acording to docs to run spiders sequentially on the same process, you must chain deferreds.
Try this:
import scrapy
from scrapy.crawler import CrawlerRunner
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2
settings = get_project_settings()
runner = CrawlerRunner(settings)
#defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1)
yield runner.crawl(Spider2)
reactor.stop()
crawl()
reactor.run()
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
#inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()

scrapy LOG_FILE and LOG_LEVEL setting does not work per spider

I have several spiders in my project, and I want to log every spider in independent log file (such as brand.log, product.log ...).
So I use the custom_settings per spider, but it doesn't seem to work. Is it a bug until now? Is there any easy config can solve this problem?
Very thanks for your help!
It's working for me. Here is the spider inside a dummy project:
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com/']
custom_settings = {
'LOG_FILE': '/tmp/example.log',
}
def parse(self, response):
self.logger.info('XXXXX')
I start the spider using scrapy crawl example and the log file is successfully written to /tmp/example.log.

How to access spiders attributes after crawl

I've created a test spider. This spider gets one object which has url and xpath attributes. It scrapes the url and then populates self.result dictionary accordingly. So self.result can be {'success':True,'httpresponse':200} or {'success':False,'httpresponse':404} etc.
The problem is that I don't know how to access spider.result since there is no object spider.
..
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider
process = CrawlerProcess({...})
process.crawl(ts,[object,])
process.start()
print ts.result
I tried:
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider(object)
process = CrawlerProcess({...})
process.crawl(ts)
process.start()
print ts.result
But it says that crawl needs 2 arguments.
Do you know how to do that? I don't want to save results into the file or db.
Thats how you call crawl
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider() , arg1=val1, arg2=val2)

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).
Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"

Categories

Resources