python-scrapy set stats value in extension - python

I'm trying to write a simple scrapy extension-class to send crawler-stats when the spider closes via email. This is what I have so far, which works fine.
class SpiderClosedCommit(object):
def __init__(self, stats):
self.stats = stats
#classmethod
def from_crawler(cls, crawler):
ext = cls(crawler.stats)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider):
spider_stats = self.stats.get_stats(spider)
# some more code to send an email with stats ...
But now I'm trying to figure out how to add a list to the stats with the domains that were scraped. I looked through the docs but I couldn't figure out how the code should look like and where to put it, in the extension or in the spider-class. How can I get access to the scraped domains in the extension class or how can I get access to the stats in the spider-class?
Thanks in advance and all the best
Jacques

Here's one way to do it:
make your extension hook to the response_received signal and extract the domain from response.url
keep a set() in your extension with the domains seen
when closing the spider, add those domains tospider_stats before sending by email

Related

how can I access a variable parameter at spider class from pipelines.py

I have 3 number of spider files and classes. And I want to save item informations at csv file which has different filename defendant the variable parameter of searching condition. For that, I need to access the spider class parameter.
So, my questions are three.
How can I access the spider class's parameter?
What is the best way to make each csv files? The trigger condition is that will call request at parse function for new searching result.
logger = logging.getLogger(__name__) it's not working in pipelines.py
How can I print that information?
Bellow is my log code style
logger.log(logging.INFO,'\n======= %s ========\n', filename)
I had been searching the ways in google so many times. But I couldn't find the solution.
I did try to use from_crawler function, but I couldn't find the adapt case
Scrapy 1.6.0
python 3.7.3
os window 7 / 32bit
Code:
class CensusGetitemSpider(scrapy.Spider):
name = 'census_getitem'
startmonth=1
filename = None
def parse(self, response):
for x in data:
self.filename = str(startmonth+1)
.
.
.
yield item
yield scrapy.Request(link, callback=self.parse)
you can access spider class and instance attributes from pipeline.py using the spider parameter passed in most of pipeline methods.
For example, :
open_spider(self, spider):
self.filename = spider.name
You can see more about item pipelines here https://docs.scrapy.org/en/latest/topics/item-pipeline.html
You can save it directly from the command line, just define a filename:
scrapy crawl yourspider -o output.csv
But if you really need it to be set from the spider, you can use a custom setting per spider, for example:
class YourSpider(scrapy.Spider):
name = 'yourspider'
start_urls = 'www.yoursite.com'
custom_settings = {
'FEED_URI':'output.csv',
'FEED_FORMAT': 'csv',
}
Use spider.logger.info('Your message')

Scrapy use item and save data in a json file

I want to use scrapy item and manipulate data and saving all in json file (using json file like a db).
# Spider Class
class Spider(scrapy.Spider):
name = 'productpage'
start_urls = ['https://www.productpage.com']
def parse(self, response):
for product in response.css('article'):
link = product.css('a::attr(href)').get()
id = link.split('/')[-1]
title = product.css('a > span::attr(content)').get()
product = Product(self.name, id, title, price,'', link)
yield scrapy.Request('{}.json'.format(link), callback=self.parse_product, meta={'product': product})
yield scrapy.Request(url=response.url, callback=self.parse, dont_filter=True)
def parse_product(self, response):
product = response.meta['product']
for size in json.loads(response.body_as_unicode()):
product.size.append(size['name'])
if self.storage.update(product.__dict__):
product.send('url')
# STORAGE CLASS
class Storage:
def __init__(self, name):
self.name = name
self.path = '{}.json'.format(self.name)
self.load() """Load json database"""
def update(self, new_item):
# .... do things and update data ...
return True
# Product Class
class Product:
def __init__(self, name, id, title, size, link):
self.name = name
self.id = id
self.title = title
self.size = []
self.link = link
def send(self, url):
return # send notify...
Spider class search for products in main page of start_url, then it parse product page to catch also sizes.
Finally it search if there are updates on self.storage.update(product.__dict__) and if it's true send a notification.
How can I implement Item in my code? I thought I could insert it in Product Class, but I can't include send method...
You should define the item you want. And yield it after parsed.
Last, run the command:
scrapy crawl [spider] -o xx.json
PS:
Default scrapy had support export json file.
#Jadian's answer will get you a file with JSON in it, but not quite db like access to it. In order to do this properly from a design stand point I would follow the below instructions. You don't have to use mongo either there are plenty of other nosql dbs available that use JSON.
What I would recommend in this situation is that you build out the items properly using scrapy.Item() classes. Then you can use json.dumps into mongoDB. You will need to assign a PK to each item, but mongo is basically made to be a non relational json store. So what you would do is then create an item pipeline which checks for the PK of the item and if its found and no details are changed then raise DropItem() else update/store new data into the mongodb. You could even pipe into the json exporter if you wanted to probably, but I think just dumping the python object to json into mongo is the way to go and then mongo will present you with json to work with on the front end.
I hope that you understand this answer, but I think from a design point this will be a much easier solution since mongo is basically a non relational data store based on JSON, and you will be dividing your item pipeline logic into its own area instead of cluttering your spider with it.
I would provide a code sample, but most of mine are using ORM for SQL db. Mongo is actually easier to use than this...

Pointing Scrapy at a local cache instead of performing a normal spidering process

I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.

Scrapy Spider - Saving data through Stats Collection

I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api
You need to save this data to disk in one way or another (in a file or database).
The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.
I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.
If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.
Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).
Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"

Categories

Resources