Say I have This spider:
class SomeSPider(Spider):
name ='spname'
Then I can crawl my spider, by creating a new instance of SomeSpider and call the crawler like this for example:
spider= SomeSpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
....
Can I Do the same thing using just the spider name? I mean 'spname' ?
crawler.crawl('spname') ## I give just the spider name here
How to dynamically create the Spider ?
I guess the scrapy manager do it internally, since this works fine:
Scrapy crawl spname
One solution, is to parse my spiders folders , get all Spiders classes and filter them using name attribute? but this looks like a far-fetched solution!
Thank you in advance for your help.
Please take a look at the source code:
# scrapy/commands/crawl.py
class Command(ScrapyCommand):
def run(self, args, opts):
...
# scrapy/spidermanager.py
class SpiderManager(object):
def _load_spiders(self, module):
...
def create(self, spider_name, **spider_kwargs):
...
# scrapy/utils/spider.py
def iter_spider_classes(module):
"""Return an iterator over all spider classes defined in the given module
that can be instantiated (ie. which have name)
"""
...
Inspired by #kev answer, here a function that inspect spider class:
from scrapy.utils.misc import walk_modules
from scrapy.utils.spider import iter_spider_classes
def _load_spiders(module='spiders.SomeSpider'):
for module in walk_modules(module):
for spcls in iter_spider_classes(module):
self._spiders[spcls.name] = spcls
Then you can instantiate :
somespider = self._spiders['spname']()
Related
I have 3 number of spider files and classes. And I want to save item informations at csv file which has different filename defendant the variable parameter of searching condition. For that, I need to access the spider class parameter.
So, my questions are three.
How can I access the spider class's parameter?
What is the best way to make each csv files? The trigger condition is that will call request at parse function for new searching result.
logger = logging.getLogger(__name__) it's not working in pipelines.py
How can I print that information?
Bellow is my log code style
logger.log(logging.INFO,'\n======= %s ========\n', filename)
I had been searching the ways in google so many times. But I couldn't find the solution.
I did try to use from_crawler function, but I couldn't find the adapt case
Scrapy 1.6.0
python 3.7.3
os window 7 / 32bit
Code:
class CensusGetitemSpider(scrapy.Spider):
name = 'census_getitem'
startmonth=1
filename = None
def parse(self, response):
for x in data:
self.filename = str(startmonth+1)
.
.
.
yield item
yield scrapy.Request(link, callback=self.parse)
you can access spider class and instance attributes from pipeline.py using the spider parameter passed in most of pipeline methods.
For example, :
open_spider(self, spider):
self.filename = spider.name
You can see more about item pipelines here https://docs.scrapy.org/en/latest/topics/item-pipeline.html
You can save it directly from the command line, just define a filename:
scrapy crawl yourspider -o output.csv
But if you really need it to be set from the spider, you can use a custom setting per spider, for example:
class YourSpider(scrapy.Spider):
name = 'yourspider'
start_urls = 'www.yoursite.com'
custom_settings = {
'FEED_URI':'output.csv',
'FEED_FORMAT': 'csv',
}
Use spider.logger.info('Your message')
I am new to python. I want to create my own class instance variable_1, variable_2 in to scrapy spider class. The following code is working good.
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
But I want to make it class instance variable, so that I do not have to modify variable's value inside spider everytime I run it. I decide to create def __init__(self, url, varialbe_1, variable_2) into scrapy spider, and I expect to use SpiderTest1(url, variable_1, variable_2) to run it. The following is new code that I expect to result as the code above does, but this is not working good:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
It result:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
Thank when anyone can tell how to achieve it.
Thank, my code is working fine with your way.
But I find things slightly different from Common Practices
this is our code:
process.crawl(SpiderTest1, url, variable_1, variable_2)
this is from Common Practices
process.crawl('followall', domain='scrapinghub.com')
The first variable as your suggest is using class's name SpiderTest1, but the other one uses string 'followall'
What does 'followall'refer to?
It refers to directory: testspiders/testspiders/spiders/followall.py or just the class's variable name = 'followall'under followall.py
I am asking it because I am still confused when I should call string or class name in scrapy spider.
Thank.
According to Common Practices and API documentation, you should call the crawl method like this to pass arguments to the spider constructor:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
UPDATE:
The documentation also mentions this form of running the spider:
process.crawl('followall', domain='scrapinghub.com')
In this case, 'followall' is the name of the spider in the project (i.e. the value of name attribute of the spider class). In your specific case where you define the spider as follows:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
you would use this code to run your spider using spider name:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
I've created a test spider. This spider gets one object which has url and xpath attributes. It scrapes the url and then populates self.result dictionary accordingly. So self.result can be {'success':True,'httpresponse':200} or {'success':False,'httpresponse':404} etc.
The problem is that I don't know how to access spider.result since there is no object spider.
..
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider
process = CrawlerProcess({...})
process.crawl(ts,[object,])
process.start()
print ts.result
I tried:
def test(self):
from scrapy.crawler import CrawlerProcess
ts = TestSpider(object)
process = CrawlerProcess({...})
process.crawl(ts)
process.start()
print ts.result
But it says that crawl needs 2 arguments.
Do you know how to do that? I don't want to save results into the file or db.
Thats how you call crawl
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider() , arg1=val1, arg2=val2)
this is my code
class Test(Spider):
self.settings.overrides['JOBDIR']= "seen"
I got:
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 46, in settings
return self.crawler.settings
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 41, in crawler
assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
AssertionError: Spider not bounded to any crawler
I am extending Spider and I am not using Crawler because I don't have links nor rules to follow
I am guessing that my problem is because I didn't import the settings well and I need your help please
In order to change the settings in the spider you can:
class TestSpider(Spider):
def set_crawler(self, crawler):
super(TestSpider, self).set_crawler(crawler)
crawler.settings.set('JOBDIR','seen')
# rest of spider code
According to documentation, individual settings of each spider can be set as a class attribute custom_settings, which should be a dictionary. In your case it will look like this:
class TestSpider(Spider):
custom_settings = {'JOBDIR':"seen"}
# The rest of the spider goes here
Not sure if this will work with early versions of scrapy.
I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).
Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"