Stop Scrapy crawling the same URLs - python

I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?
The Spider is as below:
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
join = Join()
sel = Selector(response)
bits = sel.xpath('//*')
scraped_bits = []
for bit in bits:
scraped_bit = LsbuItem()
scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
scraped_bit['desc'] = join(bit.xpath('//*[#id="main_content_main_column"]//text()').extract()).strip()
scraped_bits.append(scraped_bit)
return scraped_bits
My settings.py file looks like this
BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'
Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...
As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).
Thanks...

The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.
I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)
$ pip install scrapy --pre
The simplified spider I tested:
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule
class LsbuItem(Item):
title = Field()
url = Field()
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
scraped_bit = LsbuItem()
scraped_bit['url'] = response.url
yield scraped_bit

Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business, which when opened contains the link to "http://www.lsbu.ac.uk/business-and-partners/partners, and that one contains again the link to the first one. Thus, you go in circles indefinitely.
In order to overcome this, you need to create better rules, eliminating the circular references.
And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.

Related

Scrapy - Selecting and crawling a specific type of sitemap nodes

This is the sitemap of the website I'm crawling. The 3rd and 4th <sitemap> nodes have the urls which goes to the item details. Is there any way to apply crawling logic only to those
nodes? (like selecting them by their indices)
class MySpider(SitemapSpider):
name = 'myspider'
sitemap_urls = [
'https://www.dfimoveis.com.br/sitemap_index.xml',
]
sitemap_rules = [
('/somehow targeting the 3rd and 4th node', 'parse_item')
]
def parse_item(self, response):
# scraping the item
You don't need to use SitemapSpider, just use regex and standard spider.
def start_requests(self):
sitemap = 'https://www.dfimoveis.com.br/sitemap_index.xml'
yield scrapy.Request(url=sitemap, callback=self.parse_sitemap)
def parse_sitemap(self, response):
sitemap_links = re.findall(r"<loc>(.*?)</loc>", response.text, re.DOTALL)
sitemap_links = sitemap_links[2:4] # Only 3rd and 4th nodes.
for sitemap_link in sitemap_links:
yield scrapy.Request(url=sitemap_link, callback=self.parse)
Scrapy’s Spider subclasses, including SitemapSpider are meant to make very common scenarios very easy.
You want to do something that is rather uncommon, so you should read the source code of SitemapSpider, try to understand what it does, and either subclass SitemapSpider overriding the behavior you want to change or directly write your own spider from scratch based on the code of SitemapSpider.

Scrapy only scrapes the first start url in a list of 15 start urls

I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.
I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}
your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please

How to use the Rule class in scrapy

I am trying to use the Rule class to go to the next page in my crawler. Here is my code
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview
class GdSpider(CrawlSpider):
name = "gd"
allowed_domains = ["glassdoor.com"]
start_urls = [
"http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
]
rules = (
# Extract next links and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="next"]/a/#href',)), follow= True)
)
def parse(self, response):
company_name = response.xpath('//*[#id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()
'''loop over every review in this page'''
for sel in response.xpath('//*[#id="EmployerReviews"]/ol/li'):
review = Item()
review['company_name'] = company_name
review['id'] = str(sel.xpath('#id').extract()[0]).split('_')[1] #sel.xpath('#id/text()').extract()
review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()
yield review
My question is about the rules section. In this rule, the link extracted doesn't contain the domain name. For example, it will return something like
"/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
How can I make sure that my crawler will append the domain to the returned link?
Thanks
You can be sure since this is the default behavior of link extractors in Scrapy (source code).
Also, the restrict_xpaths argument should not point to #href attribute, but instead it should either point to a elements or containers having a elements as descendants. Plus, restrict_xpaths can be defined as string.
In other words, replace:
restrict_xpaths=('//li[#class="next"]/a/#href',)
with:
restrict_xpaths='//li[#class="next"]/a'
Besides, you need to switch to LxmlLinkExtractor from SgmlLinkExtractor:
SGMLParser based link extractors are unmantained and its usage is
discouraged. It is recommended to migrate to LxmlLinkExtractor if you
are still using SgmlLinkExtractor.
Personally, I usually use the LinkExractor shortcut to LxmlLinkExtractor:
from scrapy.contrib.linkextractors import LinkExtractor
To summarize, this is what I would have in rules:
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[#class="next"]/a'), follow=True)
]

Example of two Scrapy spiders, one has a memory leak and I can't find it

This is driving me nuts. It drove me to consolidate and simplify a lot of code, but I just can't fix the problem. Here is an example of two spiders I wrote. The top one has a memory leak that causes the memory to slowly expand until its full.
They are almost Identical and they use the same items and everything else outside of the spider so I do not think there is anything in the rest of my code to blame. I have also isolated bits of code here and there, tried deleting variables towards the end. I've looked over the scrapy docs and I am still stumped. Anyone have any magic to work?
import scrapy
from wordscrape.items import WordScrapeItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class EnglishWikiSpider(CrawlSpider):
name='englishwiki'
allowed_domains = ['en.wikipedia.org']
start_urls = [
'http://en.wikipedia.org/wiki/'
]
rules = (
Rule(SgmlLinkExtractor(allow=('/wiki/', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# This takes all the text that is in that div and extracts it, only the text, not html tags (see: //text())
# if it meets the conditions of my regex
english_text = response.xpath('//*[#id="mw-content-text"]//text()').re(ur'[a-zA-Z\'-]+')
english_dict = {}
for i in english_text:
if len(i) > 1:
word = i.lower()
if word in english_dict:
english_dict[word] += 1
else:
english_dict[word] = 1
# Dump into json string and put it in the word item, it will be ['word': {<<jsondict>>}, 'site' : url, ...]
jsondump = json.dumps(english_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
The second, and stable spider:
import scrapy
from wordscrape.items import WordScrapeItem
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class NaverNewsSpider(CrawlSpider):
name='navernews'
allowed_domains = ['news.naver.com']
start_urls = [
'http://news.naver.com',
'http://news.naver.com/main/read.nhn?oid=001&sid1=102&aid=0007354749&mid=shm&cid=428288&mode=LSD&nh=20150114125510',
]
rules = (
Rule(SgmlLinkExtractor(allow=('main/read\.nhn', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[#id="articleBodyContents"]//text()').re(ur'[\uac00-\ud7af]+')
# Go through all hangul syllables found and adds to value or adds key
hangul_dict = {}
for i in hangul_syllables:
if i in hangul_dict:
hangul_dict[i] += 1
else:
hangul_dict[i] = 1
jsondump = json.dumps(hangul_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
I think Jepio is right in his comment. I think the spidder is finding too many links to follow and therefore having to store them all in the interim perdiod.
EDIT: So, the problem is that it is storing all of those links in memory instead of on disk and it eventually fills up all my memory. The solution was to run scrapy with a job directory, and that forces them to be stored on disk where there is plenty of space.
$ scrapy crawl spider -s JOBDIR=somedirname

Scrapy spider is not working

Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.

Categories

Resources