Scrapy - basic scraper example returns no output - python

I am running scrapy on Anaconda and have tried to run example code from this DigitalOcean guide as shown below:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
I am a beginner with Scrapy so keep this in mind.This code executes but no output is shown. There is supposed to be output based on the article I got the code from. Please let me know how to view the information the spider gathers. I am running the module off my IDLE, if I try to do "runspider" in cmd it says it cannot find my python file even though I can see the file directory and open it on IDLE.Thanks in advance.

Your spider is missing a callback method to handle the response from http://brickset.com/sets/year-2016.
Try defining a callback method like this:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
self.log('I visited: {}'.format(response.url))
By default, Scrapy calls the parse method defined in your spider to handle the responses for the requests that your spider generates.
Have a look at the official Scrapy tutorial too: https://doc.scrapy.org/en/latest/intro/tutorial.html

Related

Why does scrapy provide unable to load error?

So i am working on a small crawler using scrapy and python on this website https://www.theverge.com/reviews. From there i am trying to extract the reviews based on the rules i have set which should match links that match this criteria:
example: https://www.theverge.com/22274747/tern-hsd-p9-ebike-review-electric-cargo-bike-price-specs
Extracting the url from the review page, title of the page, name of who made the review and the link to their profile. However i assume there is something either wrong with my code or something wrong with the way i have my files sorted. Because this error when i try to run it:
runspider: error: Unable to load 'spiders/vergespider.py': No module named 'oblig3.oblig3'
My folders look like this.
So my intended results should look something like this. Visiting up to 20 pages, which i don't quite understand how to fix through the scrapy settings, but that is another problem.
authorlink,authorname,title,url
"https://www.theverge.com/authors/cameron-faulkner,https://www.twitter.com/camfaulkner",Cameron
Faulkner,"Gigabyte’s Aorus 15G is great at gaming, but not much
else",https://www.theverge.com/22299226/gigabyte-aorus-15g-review-gaming-laptop-price-specs-features
So my question is what could be causing the error i am getting why am i not getting any csv output from this code. I am fairly new at python and scrapy oo any tips or improvement to the code are appreciated. I would like to keep the "solutions" through scrapy and python as those are the things i am trying to learn atm.
Edit:
This is what i use to run the code with scrapy runspider spiders/vergespider.py -o vergetest.csv -t csv. And this is what i have coded so far.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from oblig3.items import VergeReview
class VergeSpider(CrawlSpider):
name = 'verge'
allowed_domains = ['theverge.com']
start_urls = ['https://www.theverge.com/reviews']
rules = [
Rule(LinkExtractor(allow=r'^(https://www.theverge.com/)(/d+)/([%5E/]+$)%27'),
callback='parse_items', follow=True),
Rule(LinkExtractor(allow=r'.*'),
callback='parse_items', cb_kwargs={'is_verge': False})
]
def parse(self, response, is_verge):
if is_verge:
verge = VergeReview()
verge['url'] = response.url
verge['title'] = response.xpath("//h1/text()").extract_first()
verge['authorname'] = response.xpath("//span[#class='c-byline__author-name']/text()").extract()
verge['authorlink'] = response.xpath("//*/span[#class = 'c-byline__item'][1]/a/#href").extract()
yield verge
else:
# Do something else
pass
My items file
import scrapy
class VergeReview(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
authorname = scrapy.Field()
authorlink = scrapy.Field()
And my settings file is unchanged though i should implement CLOSESPIDER_PAGECOUNT = 20 but idk how.
The error you have is:
runspider: error ..... No module named 'oblig3.oblig3'
What I can see from your screenprint is that oblig3 is the name of your project.
This is a common error when you try to run your spider using:
scrapy runspider spider_file.py
If you are running your spider this way, you need to change the way you are running the spider:
First, make sure that you are in the directory where scrapy.cfg is located
then run
scrapy list
This should give you a list of all the spiders it found.
After that, you should use this command to run your spider.
scrapy crawl <spidername>
If this does not solve your problem, you need to share the code and share the details about how you are running your spider.

Scrapy selector working in shell, but not in the spider itself

I'm relatively new to python and scrapy; just started learning from tutorials a few days ago.
Like the title says, I've been trying to get a simple text scraping operation done as practice, by scraping chapters from fanfiction.net. However, I've run into a roadblock where even though my test selectors in the shell work perfectly fine, the spider itself still returns nothing when I run the command scrapy crawl fanfiction -o fanfiction.json.
Here is the code that I have so far; it's essentially a modified version of the tutorial code from doc.scrapy.org.
import scrapy
class FFSpider(scrapy.Spider):
name = "fanfiction"
start_urls = [
'https://www.fanfiction.net/s/12580108/1/Insane-Gudako',
'https://www.fanfiction.net/s/12580108/2/Insane-Gudako',
]
def parse(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
for chapter in response.css('div.fanfiction'):
yield {
'summary': chapter.css('.storytext p').xpath('text()').extract()
}
In the inline shell call, testing using chapter.css('.storytext p').xpath('text()').extract() returns the text properly, but once the spider finishes crawling, fanfiction.json is still an empty file.
What is the problem here?

Scrapy callback doesn't work at all in my script

I've written a script in python scrapy to parse name and prices of different items available in a webpage. I tried to implement logic in my script the way I've learnt so far. However, when I execute it, I get the following error. I suppose I can't make the callback method work properly. Here is the script I've tried with:
The spider names "sth.py" contains:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
class SephoraSpider(CrawlSpider):
name = "sephorasp"
def start_requests(self):
yield Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
def parse_pages(self, response):
for link in response.xpath('//ul[#class="nav-primary"]//a[contains(#class,"level0")]/#href').extract():
yield Request(url = link, callback = self.parse_inner_pages)
def parse_inner_pages(self, response):
for links in response.xpath('//li[contains(#class,"amshopby-cat")]/a/#href').extract():
yield Request(url = links, callback = self.target_page)
def target_page(self, response):
for titles in response.xpath('//div[#class="product-info"]'):
product = titles.xpath('.//div[contains(#class,"product-name")]/a/text()').extract_first()
rate = titles.xpath('.//span[#class="price"]/text()').extract_first()
yield {'Name':product,'Price':rate}
"items.py" contains:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
Partial error looks like:
if cookie.secure and request.type != "https":
AttributeError: 'WrappedRequest' object has no attribute 'type'
Here is the total error log:
"https://www.dropbox.com/s/kguw8174ye6p3q9/output.log?dl=0"
Looks like you are running scrapy v1.1 when the current release is v1.4. As far as I remember there was a bug regarding some early 1.something version and WrappedRequest object used for handling cookies.
Try upgrading to v1.4:
pip install scrapy --upgrade

Parse single URL without crawlling

I'm trying to scrape a single url with scrapy. I don't want it to crawl, just parse the item, run the pipelines and return. My pipeline just updates the database. The following code is what i've done so far and is taking around 3 seconds but seems like most of the time is spend loading scrapy. If there a better way todo this?
Ideally I want to parse a single url from a python script and not command line.
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
if 'item.asp' in response.url:
yield Request(response.url, callback=self.parse_item)
Then i'm running from command line like the following
time scrapy crawl --loglevel=DEBUG MySpider -a start_url="www.example.com"
I did also try the following but never worked with the pipeline parameter.
time scrape parse "www.example.com" --spider=MySpider --callback parse_item --pipelines AddToDB
check the documentation for scrapy parse http://doc.scrapy.org/en/latest/topics/commands.html?highlight=parse#std:command-parse
In your case you are misunderstanding the --pipelines argument. it enables all of the pipelines defined in the settings.py
so just run without AddToDB.
If you want to disable some pipelines from running it might be tricky and you might want to just have a child of your spider, add class attribute custom_settings and restrict the pipelines in it.
So in your case something like:
class MySpider2(MySpider):
name = 'spider2'
custom_settings = {'ITEM_PIPELINES': 'project.pipelines.AddToDB'}
and then use scrapy parse 'http://example.com' --spider=spider2 --pipelines.

Scrapy spider is not working

Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.

Categories

Resources