Using one Scrapy spider for several websites - python

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.

WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager

What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).

Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.

Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"

Related

how can I access a variable parameter at spider class from pipelines.py

I have 3 number of spider files and classes. And I want to save item informations at csv file which has different filename defendant the variable parameter of searching condition. For that, I need to access the spider class parameter.
So, my questions are three.
How can I access the spider class's parameter?
What is the best way to make each csv files? The trigger condition is that will call request at parse function for new searching result.
logger = logging.getLogger(__name__) it's not working in pipelines.py
How can I print that information?
Bellow is my log code style
logger.log(logging.INFO,'\n======= %s ========\n', filename)
I had been searching the ways in google so many times. But I couldn't find the solution.
I did try to use from_crawler function, but I couldn't find the adapt case
Scrapy 1.6.0
python 3.7.3
os window 7 / 32bit
Code:
class CensusGetitemSpider(scrapy.Spider):
name = 'census_getitem'
startmonth=1
filename = None
def parse(self, response):
for x in data:
self.filename = str(startmonth+1)
.
.
.
yield item
yield scrapy.Request(link, callback=self.parse)
you can access spider class and instance attributes from pipeline.py using the spider parameter passed in most of pipeline methods.
For example, :
open_spider(self, spider):
self.filename = spider.name
You can see more about item pipelines here https://docs.scrapy.org/en/latest/topics/item-pipeline.html
You can save it directly from the command line, just define a filename:
scrapy crawl yourspider -o output.csv
But if you really need it to be set from the spider, you can use a custom setting per spider, for example:
class YourSpider(scrapy.Spider):
name = 'yourspider'
start_urls = 'www.yoursite.com'
custom_settings = {
'FEED_URI':'output.csv',
'FEED_FORMAT': 'csv',
}
Use spider.logger.info('Your message')

Parse single URL without crawlling

I'm trying to scrape a single url with scrapy. I don't want it to crawl, just parse the item, run the pipelines and return. My pipeline just updates the database. The following code is what i've done so far and is taking around 3 seconds but seems like most of the time is spend loading scrapy. If there a better way todo this?
Ideally I want to parse a single url from a python script and not command line.
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
if 'item.asp' in response.url:
yield Request(response.url, callback=self.parse_item)
Then i'm running from command line like the following
time scrapy crawl --loglevel=DEBUG MySpider -a start_url="www.example.com"
I did also try the following but never worked with the pipeline parameter.
time scrape parse "www.example.com" --spider=MySpider --callback parse_item --pipelines AddToDB
check the documentation for scrapy parse http://doc.scrapy.org/en/latest/topics/commands.html?highlight=parse#std:command-parse
In your case you are misunderstanding the --pipelines argument. it enables all of the pipelines defined in the settings.py
so just run without AddToDB.
If you want to disable some pipelines from running it might be tricky and you might want to just have a child of your spider, add class attribute custom_settings and restrict the pipelines in it.
So in your case something like:
class MySpider2(MySpider):
name = 'spider2'
custom_settings = {'ITEM_PIPELINES': 'project.pipelines.AddToDB'}
and then use scrapy parse 'http://example.com' --spider=spider2 --pipelines.

Scrapy Spider - Saving data through Stats Collection

I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api
You need to save this data to disk in one way or another (in a file or database).
The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.
I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.
If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.
Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836

scrapy how to import the settings to override it

this is my code
class Test(Spider):
self.settings.overrides['JOBDIR']= "seen"
I got:
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 46, in settings
return self.crawler.settings
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 41, in crawler
assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
AssertionError: Spider not bounded to any crawler
I am extending Spider and I am not using Crawler because I don't have links nor rules to follow
I am guessing that my problem is because I didn't import the settings well and I need your help please
In order to change the settings in the spider you can:
class TestSpider(Spider):
def set_crawler(self, crawler):
super(TestSpider, self).set_crawler(crawler)
crawler.settings.set('JOBDIR','seen')
# rest of spider code
According to documentation, individual settings of each spider can be set as a class attribute custom_settings, which should be a dictionary. In your case it will look like this:
class TestSpider(Spider):
custom_settings = {'JOBDIR':"seen"}
# The rest of the spider goes here
Not sure if this will work with early versions of scrapy.

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass
You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']
I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...
Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.
As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)

Categories

Resources