I am trying to use the Rule class to go to the next page in my crawler. Here is my code
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview
class GdSpider(CrawlSpider):
name = "gd"
allowed_domains = ["glassdoor.com"]
start_urls = [
"http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
]
rules = (
# Extract next links and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="next"]/a/#href',)), follow= True)
)
def parse(self, response):
company_name = response.xpath('//*[#id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()
'''loop over every review in this page'''
for sel in response.xpath('//*[#id="EmployerReviews"]/ol/li'):
review = Item()
review['company_name'] = company_name
review['id'] = str(sel.xpath('#id').extract()[0]).split('_')[1] #sel.xpath('#id/text()').extract()
review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()
yield review
My question is about the rules section. In this rule, the link extracted doesn't contain the domain name. For example, it will return something like
"/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
How can I make sure that my crawler will append the domain to the returned link?
Thanks
You can be sure since this is the default behavior of link extractors in Scrapy (source code).
Also, the restrict_xpaths argument should not point to #href attribute, but instead it should either point to a elements or containers having a elements as descendants. Plus, restrict_xpaths can be defined as string.
In other words, replace:
restrict_xpaths=('//li[#class="next"]/a/#href',)
with:
restrict_xpaths='//li[#class="next"]/a'
Besides, you need to switch to LxmlLinkExtractor from SgmlLinkExtractor:
SGMLParser based link extractors are unmantained and its usage is
discouraged. It is recommended to migrate to LxmlLinkExtractor if you
are still using SgmlLinkExtractor.
Personally, I usually use the LinkExractor shortcut to LxmlLinkExtractor:
from scrapy.contrib.linkextractors import LinkExtractor
To summarize, this is what I would have in rules:
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[#class="next"]/a'), follow=True)
]
Related
I want to extract the link I want only on the first page, and I set DEPTH_LIMIT to 1 in the crawler, and the parameter rule() in the matching rule follows=False, but I still initiated multiple requests, I I don't know why. I hope someone can answer my doubts.
Thanks in advance.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
class OfficialSpider(CrawlSpider):
name = 'official'
allowed_domains = ['news.chd.edu.cn','www.chd.edu.cn']
start_urls = ['http://www.chd.edu.cn']
custom_settings = {
'DOWNLOAD_DELAY':0,
'DEPTH_LIMIT':1,
}
rules = (
# Rule(LinkExtractor(allow=('http://news.chd.edu.cn/',)),callback='parse_news',follow=False),
Rule(LinkExtractor(allow=('http://www.chd.edu.cn/')),callback='parse_item',follow=False),
Rule(LinkExtractor(allow=("",)),follow=False),
)
def parse_news(self,response):
print(response.url)
return {}
def parse_item(self,response):
self.log("item链接:")
self.log(response.url)
output:
enter image description here
From the docs:
follow is a boolean which specifies if links should be followed from each response extracted with this rule.
This means that follow=False will only stop the crawler from following links found when processing the response created by this rule, it can't affect those found when parsing the result of start_urls.
There would be no point to the follow argument disabling a rule completely; if you don't want to use a rule, why would you create it at all?
I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?
The Spider is as below:
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
join = Join()
sel = Selector(response)
bits = sel.xpath('//*')
scraped_bits = []
for bit in bits:
scraped_bit = LsbuItem()
scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
scraped_bit['desc'] = join(bit.xpath('//*[#id="main_content_main_column"]//text()').extract()).strip()
scraped_bits.append(scraped_bit)
return scraped_bits
My settings.py file looks like this
BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'
Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...
As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).
Thanks...
The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.
I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)
$ pip install scrapy --pre
The simplified spider I tested:
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule
class LsbuItem(Item):
title = Field()
url = Field()
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
scraped_bit = LsbuItem()
scraped_bit['url'] = response.url
yield scraped_bit
Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business, which when opened contains the link to "http://www.lsbu.ac.uk/business-and-partners/partners, and that one contains again the link to the first one. Thus, you go in circles indefinitely.
In order to overcome this, you need to create better rules, eliminating the circular references.
And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.
This is driving me nuts. It drove me to consolidate and simplify a lot of code, but I just can't fix the problem. Here is an example of two spiders I wrote. The top one has a memory leak that causes the memory to slowly expand until its full.
They are almost Identical and they use the same items and everything else outside of the spider so I do not think there is anything in the rest of my code to blame. I have also isolated bits of code here and there, tried deleting variables towards the end. I've looked over the scrapy docs and I am still stumped. Anyone have any magic to work?
import scrapy
from wordscrape.items import WordScrapeItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class EnglishWikiSpider(CrawlSpider):
name='englishwiki'
allowed_domains = ['en.wikipedia.org']
start_urls = [
'http://en.wikipedia.org/wiki/'
]
rules = (
Rule(SgmlLinkExtractor(allow=('/wiki/', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# This takes all the text that is in that div and extracts it, only the text, not html tags (see: //text())
# if it meets the conditions of my regex
english_text = response.xpath('//*[#id="mw-content-text"]//text()').re(ur'[a-zA-Z\'-]+')
english_dict = {}
for i in english_text:
if len(i) > 1:
word = i.lower()
if word in english_dict:
english_dict[word] += 1
else:
english_dict[word] = 1
# Dump into json string and put it in the word item, it will be ['word': {<<jsondict>>}, 'site' : url, ...]
jsondump = json.dumps(english_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
The second, and stable spider:
import scrapy
from wordscrape.items import WordScrapeItem
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class NaverNewsSpider(CrawlSpider):
name='navernews'
allowed_domains = ['news.naver.com']
start_urls = [
'http://news.naver.com',
'http://news.naver.com/main/read.nhn?oid=001&sid1=102&aid=0007354749&mid=shm&cid=428288&mode=LSD&nh=20150114125510',
]
rules = (
Rule(SgmlLinkExtractor(allow=('main/read\.nhn', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[#id="articleBodyContents"]//text()').re(ur'[\uac00-\ud7af]+')
# Go through all hangul syllables found and adds to value or adds key
hangul_dict = {}
for i in hangul_syllables:
if i in hangul_dict:
hangul_dict[i] += 1
else:
hangul_dict[i] = 1
jsondump = json.dumps(hangul_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
I think Jepio is right in his comment. I think the spidder is finding too many links to follow and therefore having to store them all in the interim perdiod.
EDIT: So, the problem is that it is storing all of those links in memory instead of on disk and it eventually fills up all my memory. The solution was to run scrapy with a job directory, and that forces them to be stored on disk where there is plenty of space.
$ scrapy crawl spider -s JOBDIR=somedirname
My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass
You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']
I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...
Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.
As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)
Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.