How to access spiders attributes after crawl

How to access spiders attributes after crawl - python

I've created a test spider. This spider gets one object which has url and xpath attributes. It scrapes the url and then populates self.result dictionary accordingly. So self.result can be {'success':True,'httpresponse':200} or {'success':False,'httpresponse':404} etc.
The problem is that I don't know how to access spider.result since there is no object spider.
..
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider
process = CrawlerProcess({...})
process.crawl(ts,[object,])
process.start()
print ts.result
I tried:
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider(object)
process = CrawlerProcess({...})
process.crawl(ts)
process.start()
print ts.result
But it says that crawl needs 2 arguments.
Do you know how to do that? I don't want to save results into the file or db.

Thats how you call crawl
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider() , arg1=val1, arg2=val2)

Related

Only getting one result when running two spiders sequentially with scrapy

I have two spiders in my spider.py class, and I want to run them and generate a csv file.
Below is the structure of my spider.py
class tmallSpider(scrapy.Spider):
name = 'tspider'
...
class jdSpider(scrapy.Spider):
name = 'jspider'
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(tmallSpider)
yield runner.crawl(jdSpider)
reactor.stop()
crawl()
reactor.run()
Below is the structure for my items.py
class TmallspiderItem(scrapy.Item):
# define the fields for your item here like:
product_name_tmall = scrapy.Field()
product_price_tmall = scrapy.Field()
class JdspiderItem(scrapy.Item):
product_name_jd = scrapy.Field()
product_price_jd = scrapy.Field()
I want to generate a csv file with four columns:
product_name_tmall | product_price_tmall | product_name_jd | product_price_jd
I did scrapy crawl -o prices.csv in pycharm's terminal but nothing is generated.
I scrolled up and find out only the jd items are printed in terminal, I do not see any tmall items printed.
However, if I add a open_in_browser command for the tmall spider, the brower DOES open. I guess the code was executed, but somehow the data is not recorded?
If I run scrapy crawl tspider and scrapy crawl jspider individually, everything is correct and the csv file is generated.
Is this a problem with how I ran the program or is there a problem with my code? Any ideas how to fix it?

I think it's going wrong how you are initiating spider runs.
You can simply use CrawlerProcess to initiate jobs.
You can have a look at this page https://docs.scrapy.org/en/latest/topics/practices.html for the usage of CrawlerProcess.

How to get stats value after CrawlerProcess finished, i.e. at line after process.start()

I am using this code somewhere inside spider:
raise scrapy.exceptions.CloseSpider('you_need_to_rerun')
So, when this exceptions raised, eventually my spider closing working and I get in console stats with this string:
'finish_reason': 'you_need_to_rerun',
But - how I can get it from code? Cause I want to run spider again in loop, based on info from this stats, something like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import spaida.spiders.spaida_spider
import spaida.settings
you_need_to_rerun = True
while you_need_to_rerun:
process = CrawlerProcess(get_project_settings())
process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider)
process.start(stop_after_crawl=False) # the script will block here until the crawling is finished
finish_reason = 'and here I get somehow finish_reason from stats' # <- how??
if finish_reason == 'finished':
print("everything ok, I don't need to rerun this")
you_need_to_rerun = False
I found in docs this thing, but can't get it right, where is that "The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain name.": https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats
P.S.: I'm also getting error twisted.internet.error.ReactorNotRestartable when using process.start(), and recommendations to use process.start(stop_after_crawl=False) - and then spider just stops and do nothing, but this is another problem...

You need to access stats object via Crawler object:
process = CrawlerProcess(get_project_settings())
crawler = process.crawlers[0]
reason = crawler.stats.get_value('finish_reason')

How to set the pipeline for a spider

I have a spider that crawls a site. But in one crawl I need to grab certain data and save it to one database. In another, grab different data and put it somewhere else.
Right now pass a target parameter when I create the spider to manage which options it will use. The init method then tweaks the search parameters.
Is there a way I can have the spider set its pipeline in the init? Or something in the crawl script that would do it?
At the moment, I start the crawl like this:
process = CrawlerProcess(get_project_settings())
process.crawl('my_spider', target='target_one')
process.start()
I have a separate script for whichever target I intend to run.

Changing settings in Spider, this will override settings mentioned in settings.py
class MySpider(scrapy.Spider):
name = "My_spider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'my_scraper.middlewarees.MyMiddleware': 300,
},
'ITEM_PIPELINES': {
'my_scraper.pipelines.MyPipeline': 300,
}
}
Or if you are using CrawlerProcess
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from testspiders.spiders.followall import FollowAllSpider
FollowAllSpider.custom_settings={'ITEM_PIPELINES': {'my_scraper.pipelines.MyPipeline': 300}}
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

Using Scrapy to crawl websites and only scrape pages containing keywords

I'm trying to crawl various websites looking for particular keywords of interest and only scraping those pages. I've written the script to run as a standalone Python script rather than the traditional Scrapy project structure (following this example) and using the CrawlSpider class. The idea is that from a given homepage the Spider will crawl pages within that domain and only scrape links from pages which contain the keyword. I'm also trying to save a copy of the page when I find one containing the keyword. The previous version of this question related to a syntax error (see comments below, thanks #tegancp for helping me clear that up) but now although my code runs I am still unable to crawl links only on pages of interest as intended.
I think I want to either i) remove the call to LinkExtractor in the __init__ function or ii) only call LinkExtractor from within __init__ but with a rule based on what I find when I visit that page rather than some attribute of the URL. I can't do i) because the CrawlSpider class wants a rule and I can't do ii) because LinkExtractor doesn't have a process_links option like the old SgmlLinkExtractor which appears to be deprecated. I'm new to Scrapy so wondering if my only option is to write my own LinkExtractor?
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class GenItem(Item):
url = Field()
# define a spider
class GenSpider(CrawlSpider):
name = "genspider3"
# requires 'start_url', 'allowed_domains' and 'folderpath' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
def __init__(self):
self.start_urls = [sys.argv[1]]
self.allowed_domains = [sys.argv[2]]
self.folder = sys.argv[3]
self.writefile1 = self.folder + 'hotlinks.txt'
self.writefile2 = self.folder + 'pages.txt'
self.rules = [Rule(LinkExtractor(allow_domains=(sys.argv[2],)), follow=True, callback='parse_links')]
super(GenSpider, self).__init__()
def parse_start_url(self, response):
# get list of links on start_url page and process using parse_links
list(self.parse_links(response))
def parse_links(self, response):
# if this page contains a word of interest save the HTML to file and crawl the links on this page
theHTML = response.body
if 'keyword' in theHTML:
with open(self.writefile2, 'a+') as f2:
f2.write(theHTML + '\n')
with open(self.writefile1, 'a+') as f1:
f1.write(response.url + '\n')
for link in LinkExtractor(allow_domains=(sys.argv[2],)).extract_links(response):
linkitem = GenItem()
linkitem['url'] = link.url
log.msg(link.url)
with open(self.writefile1, 'a+') as f1:
f1.write(link.url + '\n')
return linkitem
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
#settings.set('DEPTH_LIMIT', 2)
settings.set('DOWNLOAD_DELAY', 0.25)
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = GenSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start(loglevel=log.DEBUG)
# start the reactor (blocks execution)
reactor.run()

How does Scrapy find Spider class by its name?

Say I have This spider:
class SomeSPider(Spider):
name ='spname'
Then I can crawl my spider, by creating a new instance of SomeSpider and call the crawler like this for example:
spider= SomeSpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
....
Can I Do the same thing using just the spider name? I mean 'spname' ?
crawler.crawl('spname') ## I give just the spider name here
How to dynamically create the Spider ?
I guess the scrapy manager do it internally, since this works fine:
Scrapy crawl spname
One solution, is to parse my spiders folders , get all Spiders classes and filter them using name attribute? but this looks like a far-fetched solution!
Thank you in advance for your help.

Please take a look at the source code:
# scrapy/commands/crawl.py
class Command(ScrapyCommand):
def run(self, args, opts):
...
# scrapy/spidermanager.py
class SpiderManager(object):
def _load_spiders(self, module):
...
def create(self, spider_name, **spider_kwargs):
...
# scrapy/utils/spider.py
def iter_spider_classes(module):
"""Return an iterator over all spider classes defined in the given module
that can be instantiated (ie. which have name)
"""
...

Inspired by #kev answer, here a function that inspect spider class:
from scrapy.utils.misc import walk_modules
from scrapy.utils.spider import iter_spider_classes
def _load_spiders(module='spiders.SomeSpider'):
for module in walk_modules(module):
for spcls in iter_spider_classes(module):
self._spiders[spcls.name] = spcls
Then you can instantiate :
somespider = self._spiders['spname']()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to access spiders attributes after crawl - python

Thats how you call crawl process = CrawlerProcess(get_project_settings()) process.crawl(TestSpider() , arg1=val1, arg2=val2)

Related

Only getting one result when running two spiders sequentially with scrapy

How to get stats value after CrawlerProcess finished, i.e. at line after process.start()

How to set the pipeline for a spider

Using Scrapy to crawl websites and only scrape pages containing keywords

How does Scrapy find Spider class by its name?

Categories

Resources