I have the following code:
#FirstSpider.py
class FirstSpider(scrapy.Spider):
name = 'first'
start_urls = ['https://www.basesite.com']
next_urls = []
def parse(self, response):
for url in response.css('bunch > of > css > here'):
self.next_urls.append(url.css('more > css > here'))
l = Loader(item=Item(), selector=url.css('more > css'))
l.add_css('add', 'more > css')
...
...
yield l.load_item()
for url in self.next_urls:
new_urls = self.start_urls[0] + url
yield scrapy.Request(new_urls, callback=SecondSpider.parse_url)
#SecondSpider.py
class SecondSpider(scrapy.Spider):
name = 'second'
start_urls = ['https://www.basesite.com']
def parse_url(self):
"""Parse team data."""
return self
# self is a HtmlResponse not a 'response' object
def parse(self, response):
"""Parse all."""
summary = self.parse_url(response)
return summary
#ThirdSpider.py
class ThirdSpider(scrapy.Spider):
# take links from second spider, continue:
I want to be able to pass the url scraped in Spider 1 to Spider 2 (in a different script). I'm curious as to why when I do, the 'response' is a HtmlResponse and not a response object ( When doing something similar to a method in the same class as Spider 1; I don't have this issue )
What am i missing here? How do i just pass the original response(s) to the second spider? ( and from the second onto the third, etc..?)
You could use Redis as shared resource between all spiders https://github.com/rmax/scrapy-redis
Run all N spiders (don't close on idle state), so each of them will be connected to same Redis and waiting tasks(url, request headers) from there;
As the side-effect push task data to Redis from X_spider with specific key (Y_spider name).
What about using inheritance? "parse" function names should be different.
If your first spider inherits from the second, it will be able to set the callback to self.parse_function_spider2
Related
I previously used some code like this to visit a page and change the url around a bit to generate a second request which gets passed to a second parse method:
from scrapy.http import Request
def parse_final_page(self, response):
# do scraping here:
def get_next_page(self, response, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
yield req
def parse(self, response):
if 'substring' in response.url:
new_url = 'some_new_url'
yield from self.get_next_page(response, new_url)
else:
pass
# continue..
# scraping items
# yield
This snippet is pretty old (2 years or so) and i'm currently using Scrapy 2.2, although i'm not sure if that's relevant. Note that get_next_page gets called, but parse_final_page never runs, which I don't get...
Why is parse_final_page not being called? Or more to the point.. is there an easier way for me to just generate a new request on the fly? I would prefer to not use a middleware or change start_urls in this context.
1 - "Why is parse_final_page not being called?"
Your script works fine for me on Scrapy v2.2.1, so its probably an issue with the specific request you're trying to make.
2 - "...is there an easier way for me to just generate a new request on the fly?"
You could try this variation where you return the request from the get_next_page callback, instead of yielding it (note I removed the from keyword and did not send the response object to the callback):
def parse(self, response):
if 'substring' in response.url:
new_url = ''
yield self.get_next_page(new_url)
else:
# continue..
# scraping items
# yield
def get_next_page(self, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
return req
def parse_final_page(self, response):
# do scraping here:
I am trying to parse a list from a javascript website. When I run it, it only gives me back one entry on each column and then the spider shuts down. I have already set up my middleware settings. I am not sure what is going wrong. Thanks in advance!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[#id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[#class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[#class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
The .extract_first() (now .get()) you used will always return the first result. It's not an iterator so there is no sense to call it several times. You should try the .getall() method. That will be something like:
names = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-1"]/text()').getall()
sources = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-2"]/text()').getall()
I am new to python. I want to create my own class instance variable_1, variable_2 in to scrapy spider class. The following code is working good.
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
But I want to make it class instance variable, so that I do not have to modify variable's value inside spider everytime I run it. I decide to create def __init__(self, url, varialbe_1, variable_2) into scrapy spider, and I expect to use SpiderTest1(url, variable_1, variable_2) to run it. The following is new code that I expect to result as the code above does, but this is not working good:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
It result:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
Thank when anyone can tell how to achieve it.
Thank, my code is working fine with your way.
But I find things slightly different from Common Practices
this is our code:
process.crawl(SpiderTest1, url, variable_1, variable_2)
this is from Common Practices
process.crawl('followall', domain='scrapinghub.com')
The first variable as your suggest is using class's name SpiderTest1, but the other one uses string 'followall'
What does 'followall'refer to?
It refers to directory: testspiders/testspiders/spiders/followall.py or just the class's variable name = 'followall'under followall.py
I am asking it because I am still confused when I should call string or class name in scrapy spider.
Thank.
According to Common Practices and API documentation, you should call the crawl method like this to pass arguments to the spider constructor:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
UPDATE:
The documentation also mentions this form of running the spider:
process.crawl('followall', domain='scrapinghub.com')
In this case, 'followall' is the name of the spider in the project (i.e. the value of name attribute of the spider class). In your specific case where you define the spider as follows:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
you would use this code to run your spider using spider name:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
I'm interested in using Scrapy-Redis to store scraped items in Redis. In particular, the Redis-based request duplicates filter seems like a useful feature.
To start off, I adapted the spider at https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider as follows:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}}
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
where I generated the project using scrapy startproject tutorial at the command line and defined QuoteItem in items.py as
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Basically, I've implemented the settings in the "Usage" section of the README in the settings per-spider and made the spider yield an Item object instead of a regular Python dictionary. (I figured this would be necessary to trigger the Item Pipeline).
Now, if I crawl the spider using scrapy crawl quotes from the command line and then do redis-cli, I see a quotes:items key:
127.0.0.1:6379> keys *
1) "quotes:items"
which is a list of length 20:
127.0.0.1:6379> llen quotes:items
(integer) 20
If I run scrapy crawl quotes again, the length of the list doubles to 40:
127.0.0.1:6379> llen quotes:items
(integer) 40
However, I would expect the length of quotes:items to still be 20, since I have simply re-scraped the same pages. Am I doing something wrong here?
Scrapy-redis doesn't filter duplicate items automatically.
The (requests) dupefilter is about the requests in a crawl. What you want seems to be something similar to the deltafetch middleware: https://github.com/scrapy-plugins/scrapy-deltafetch
You would need to adapt deltafetch to work with a distributed storage, perhaps redis' bitmap feature will fit this case.
Here is how I fixed the problem in the end. First of all, as pointed out to me in a separate question, How to implement a custom dupefilter in Scrapy?, using the start_urls class variable results in an implementation of start_requests in which the yielded Request objects have dont_filter=True. To disable this and use the default dont_filter=False instead, I implemented start_requests directly:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}
}
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
Secondly, as pointed out by Rolando, the fingerprints aren't by default persisted across different crawls. To implement this, I subclassed Scrapy-Redis' RFPDupeFilter class:
import scrapy_redis.dupefilter
from scrapy_redis.connection import get_redis_from_settings
class RedisDupeFilter(scrapy_redis.dupefilter.RFPDupeFilter):
#classmethod
def from_settings(cls, settings):
server = get_redis_from_settings(settings)
key = "URLs_seen" # Use a fixed key instead of one containing a timestamp
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server=server, key=key, debug=debug)
def request_seen(self, request):
added = self.server.sadd(self.key, request.url)
return added == 0
def clear(self):
pass # Don't delete the key from Redis
The main differences are (1) the key is set to a fixed value (not one containing a time stamp) and (2) the clear method, which in Scrapy-Redis' implementation deletes the key from Redis, is effectively disabled.
Now, when I run scrapy crawl quotes the second time, I see the expected log output
2017-05-05 15:13:46 [scrapy_redis.dupefilter] DEBUG: Filtered duplicate request <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
and no items are scraped.
My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass
You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']
I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...
Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.
As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)