Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.
Related
I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.
I want to extract the link I want only on the first page, and I set DEPTH_LIMIT to 1 in the crawler, and the parameter rule() in the matching rule follows=False, but I still initiated multiple requests, I I don't know why. I hope someone can answer my doubts.
Thanks in advance.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
class OfficialSpider(CrawlSpider):
name = 'official'
allowed_domains = ['news.chd.edu.cn','www.chd.edu.cn']
start_urls = ['http://www.chd.edu.cn']
custom_settings = {
'DOWNLOAD_DELAY':0,
'DEPTH_LIMIT':1,
}
rules = (
# Rule(LinkExtractor(allow=('http://news.chd.edu.cn/',)),callback='parse_news',follow=False),
Rule(LinkExtractor(allow=('http://www.chd.edu.cn/')),callback='parse_item',follow=False),
Rule(LinkExtractor(allow=("",)),follow=False),
)
def parse_news(self,response):
print(response.url)
return {}
def parse_item(self,response):
self.log("item链接:")
self.log(response.url)
output:
enter image description here
From the docs:
follow is a boolean which specifies if links should be followed from each response extracted with this rule.
This means that follow=False will only stop the crawler from following links found when processing the response created by this rule, it can't affect those found when parsing the result of start_urls.
There would be no point to the follow argument disabling a rule completely; if you don't want to use a rule, why would you create it at all?
I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?
The Spider is as below:
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
join = Join()
sel = Selector(response)
bits = sel.xpath('//*')
scraped_bits = []
for bit in bits:
scraped_bit = LsbuItem()
scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
scraped_bit['desc'] = join(bit.xpath('//*[#id="main_content_main_column"]//text()').extract()).strip()
scraped_bits.append(scraped_bit)
return scraped_bits
My settings.py file looks like this
BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'
Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...
As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).
Thanks...
The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.
I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)
$ pip install scrapy --pre
The simplified spider I tested:
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule
class LsbuItem(Item):
title = Field()
url = Field()
class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
"http://www.lsbu.ac.uk"
]
rules = [
Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]
def parse_item(self, response):
scraped_bit = LsbuItem()
scraped_bit['url'] = response.url
yield scraped_bit
Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business, which when opened contains the link to "http://www.lsbu.ac.uk/business-and-partners/partners, and that one contains again the link to the first one. Thus, you go in circles indefinitely.
In order to overcome this, you need to create better rules, eliminating the circular references.
And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.
I am trying to use the Rule class to go to the next page in my crawler. Here is my code
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview
class GdSpider(CrawlSpider):
name = "gd"
allowed_domains = ["glassdoor.com"]
start_urls = [
"http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
]
rules = (
# Extract next links and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="next"]/a/#href',)), follow= True)
)
def parse(self, response):
company_name = response.xpath('//*[#id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()
'''loop over every review in this page'''
for sel in response.xpath('//*[#id="EmployerReviews"]/ol/li'):
review = Item()
review['company_name'] = company_name
review['id'] = str(sel.xpath('#id').extract()[0]).split('_')[1] #sel.xpath('#id/text()').extract()
review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()
yield review
My question is about the rules section. In this rule, the link extracted doesn't contain the domain name. For example, it will return something like
"/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
How can I make sure that my crawler will append the domain to the returned link?
Thanks
You can be sure since this is the default behavior of link extractors in Scrapy (source code).
Also, the restrict_xpaths argument should not point to #href attribute, but instead it should either point to a elements or containers having a elements as descendants. Plus, restrict_xpaths can be defined as string.
In other words, replace:
restrict_xpaths=('//li[#class="next"]/a/#href',)
with:
restrict_xpaths='//li[#class="next"]/a'
Besides, you need to switch to LxmlLinkExtractor from SgmlLinkExtractor:
SGMLParser based link extractors are unmantained and its usage is
discouraged. It is recommended to migrate to LxmlLinkExtractor if you
are still using SgmlLinkExtractor.
Personally, I usually use the LinkExractor shortcut to LxmlLinkExtractor:
from scrapy.contrib.linkextractors import LinkExtractor
To summarize, this is what I would have in rules:
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[#class="next"]/a'), follow=True)
]
I've made a lot of headway with this spider- am just growing accustomed to coding and am enjoying every minute of it. However, as I'm learning the majority of my programming is problem solving. Here's my current error:
My spider shows all of the data I want in the terminal window. When I go to output, nothing shows up. Here is my code.
import re
import json
from urlparse import urlparse
from scrapy.selector import Selector
try:
from scrapy.spider import Spider
except:
from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from database.items import databaseItem
from scrapy.log import *
class CommonSpider(CrawlSpider):
name = 'fenders.py'
allowed_domains = ['usedprice.com']
start_urls = ['http://www.usedprice.com/items/guitars-musical-instruments/fender/?ob=model_asc#results']
rules = (
Rule(LinkExtractor(allow=( )), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = []
data = hxs.select('//tr[#class="oddItemColor baseText"]')
tmpNextPage = hxs.select('//div[#class="baseText blue"]/span[#id="pnLink"]/a/#href').extract()
for attr in data:
#item = RowItem()
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
print "Instrument Info: ", instrInfo
yield instrInfo
As JoeLinux said, you're yielding a string, instead of returning the item. If you're mostly working off the tutorial, you probably have an "items.py" file someplace (maybe some other name), where you item is defined - it would appear that it's called "RowItem()". Here you've got several fields, or maybe just one.
What you need to do is figure out how you want to store the data in the item. So, making a gross assumption, you probably want RowItem() to include a field called instrInfo. So your items.py file might include something like this:
class RowItem(scrapy.Item):
instrInfo = scrapy.Field()
Then your spider should include something like:
item = RowItem()
data = data = hxs.select('//tr[#class="oddItemColor baseText"]')
for attr in data:
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
item['instrInfo'].append = instrInfo
return item
This will send the item off to your pipeline for processing.
As I said, some gross assumptions about what you're trying to do, and the format of your information, but hopefully this gets you started.
Separately, the print function probably isn't necessary. When the item is returned, it's displayed in the terminal (or log) as the spider runs.
Good luck!