Scrapy Spider - Saving data through Stats Collection - python

I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api

You need to save this data to disk in one way or another (in a file or database).
The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.
I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.
If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.
Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836

Related

Scrapy not writing all outputs to CSV

I am using Scrapy to ping a bunch of webpages and determine which of the homepages are no longer live.
The full code of my spider is very simple, and reproduced below:
class BrokenLinksSpider(scrapy.Spider):
name = 'broken-link-spider'
def __init__(self, id_homepage_mapping, *args, **kwargs):
self.org_homepages = id_homepage_mapping
self.download_timeout = 10
def start_requests(self):
for qId, url in self.org_homepages.items():
if url.endswith("/"):
url = url[0:-1]
yield scrapy.Request(url,
callback=self.complete_callback,
errback=self.error_callback,
dont_filter=True,
meta={ 'handle_httpstatus_all': True, 'id': id })
def complete_callback(self, response):
if str(response.status)[0] in ("4", "5"):
yield BrokenLinkCheckerItem(
url=response.request.url,
status=response.status,
qId=response.meta['id'])
def error_callback(self, failure):
yield BrokenLinkCheckerItem(
url=failure.request.url,
status=repr(failure),
qId=failure.request.meta['id'])
In the logs of my app, I can see in the scrapy stats collector that there were exceptions with fetching about 40,000 links. However, the csv output file that I'm writing to only contains about 8k rows. Furthermore, the CSV file seems to contain all of the different types of exceptions reported by the StatsCollector, just less instances of each.
So my question is, why is there such a discrepancy between what the statsCollector reports and what's actually written to the CSV? Given my understanding of the errback callback function, shouldn't all those exceptions end up being written to the CSV?
Thanks!

Parse single URL without crawlling

I'm trying to scrape a single url with scrapy. I don't want it to crawl, just parse the item, run the pipelines and return. My pipeline just updates the database. The following code is what i've done so far and is taking around 3 seconds but seems like most of the time is spend loading scrapy. If there a better way todo this?
Ideally I want to parse a single url from a python script and not command line.
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
if 'item.asp' in response.url:
yield Request(response.url, callback=self.parse_item)
Then i'm running from command line like the following
time scrapy crawl --loglevel=DEBUG MySpider -a start_url="www.example.com"
I did also try the following but never worked with the pipeline parameter.
time scrape parse "www.example.com" --spider=MySpider --callback parse_item --pipelines AddToDB
check the documentation for scrapy parse http://doc.scrapy.org/en/latest/topics/commands.html?highlight=parse#std:command-parse
In your case you are misunderstanding the --pipelines argument. it enables all of the pipelines defined in the settings.py
so just run without AddToDB.
If you want to disable some pipelines from running it might be tricky and you might want to just have a child of your spider, add class attribute custom_settings and restrict the pipelines in it.
So in your case something like:
class MySpider2(MySpider):
name = 'spider2'
custom_settings = {'ITEM_PIPELINES': 'project.pipelines.AddToDB'}
and then use scrapy parse 'http://example.com' --spider=spider2 --pipelines.

Pointing Scrapy at a local cache instead of performing a normal spidering process

I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.

python-scrapy set stats value in extension

I'm trying to write a simple scrapy extension-class to send crawler-stats when the spider closes via email. This is what I have so far, which works fine.
class SpiderClosedCommit(object):
def __init__(self, stats):
self.stats = stats
#classmethod
def from_crawler(cls, crawler):
ext = cls(crawler.stats)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider):
spider_stats = self.stats.get_stats(spider)
# some more code to send an email with stats ...
But now I'm trying to figure out how to add a list to the stats with the domains that were scraped. I looked through the docs but I couldn't figure out how the code should look like and where to put it, in the extension or in the spider-class. How can I get access to the scraped domains in the extension class or how can I get access to the stats in the spider-class?
Thanks in advance and all the best
Jacques
Here's one way to do it:
make your extension hook to the response_received signal and extract the domain from response.url
keep a set() in your extension with the domains seen
when closing the spider, add those domains tospider_stats before sending by email

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).
Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"

Categories

Resources