Why does scrapy provide unable to load error? - python

So i am working on a small crawler using scrapy and python on this website https://www.theverge.com/reviews. From there i am trying to extract the reviews based on the rules i have set which should match links that match this criteria:
example: https://www.theverge.com/22274747/tern-hsd-p9-ebike-review-electric-cargo-bike-price-specs
Extracting the url from the review page, title of the page, name of who made the review and the link to their profile. However i assume there is something either wrong with my code or something wrong with the way i have my files sorted. Because this error when i try to run it:
runspider: error: Unable to load 'spiders/vergespider.py': No module named 'oblig3.oblig3'
My folders look like this.
So my intended results should look something like this. Visiting up to 20 pages, which i don't quite understand how to fix through the scrapy settings, but that is another problem.
authorlink,authorname,title,url
"https://www.theverge.com/authors/cameron-faulkner,https://www.twitter.com/camfaulkner",Cameron
Faulkner,"Gigabyte’s Aorus 15G is great at gaming, but not much
else",https://www.theverge.com/22299226/gigabyte-aorus-15g-review-gaming-laptop-price-specs-features
So my question is what could be causing the error i am getting why am i not getting any csv output from this code. I am fairly new at python and scrapy oo any tips or improvement to the code are appreciated. I would like to keep the "solutions" through scrapy and python as those are the things i am trying to learn atm.
Edit:
This is what i use to run the code with scrapy runspider spiders/vergespider.py -o vergetest.csv -t csv. And this is what i have coded so far.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from oblig3.items import VergeReview
class VergeSpider(CrawlSpider):
name = 'verge'
allowed_domains = ['theverge.com']
start_urls = ['https://www.theverge.com/reviews']
rules = [
Rule(LinkExtractor(allow=r'^(https://www.theverge.com/)(/d+)/([%5E/]+$)%27'),
callback='parse_items', follow=True),
Rule(LinkExtractor(allow=r'.*'),
callback='parse_items', cb_kwargs={'is_verge': False})
]
def parse(self, response, is_verge):
if is_verge:
verge = VergeReview()
verge['url'] = response.url
verge['title'] = response.xpath("//h1/text()").extract_first()
verge['authorname'] = response.xpath("//span[#class='c-byline__author-name']/text()").extract()
verge['authorlink'] = response.xpath("//*/span[#class = 'c-byline__item'][1]/a/#href").extract()
yield verge
else:
# Do something else
pass
My items file
import scrapy
class VergeReview(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
authorname = scrapy.Field()
authorlink = scrapy.Field()
And my settings file is unchanged though i should implement CLOSESPIDER_PAGECOUNT = 20 but idk how.

The error you have is:
runspider: error ..... No module named 'oblig3.oblig3'
What I can see from your screenprint is that oblig3 is the name of your project.
This is a common error when you try to run your spider using:
scrapy runspider spider_file.py
If you are running your spider this way, you need to change the way you are running the spider:
First, make sure that you are in the directory where scrapy.cfg is located
then run
scrapy list
This should give you a list of all the spiders it found.
After that, you should use this command to run your spider.
scrapy crawl <spidername>
If this does not solve your problem, you need to share the code and share the details about how you are running your spider.

Related

I don't understand how to print scrapy data in a table

I've seen several things, but I'm not able to play this in a table or .csv to print the table on the screen, can anyone help me?
I'm lost
import scrapy
class SinonimoSpider(scrapy.Spider):
name = 'sinonimo'
start_urls = ['https://www.sinonimos.com.br/pedido/']
def parse(self, response):
for i in response.css('.sinonimo'):
yield{
'sinonimo': i.css('a.sinonimo ::text').get()
}
Do you mean you are unable to see the data and want the spider to store the data in CSV?
There are many ways to do it.
The most popular one that we use when we run the spider from the terminal
$ scrapy crawl sinonimo -O sinonimo.csv # in case of CSV
$ scrapy crawl sinonimo -O sinonimo.json # in case of json
If you need any help, just leave a comment.

Only getting one result when running two spiders sequentially with scrapy

I have two spiders in my spider.py class, and I want to run them and generate a csv file.
Below is the structure of my spider.py
class tmallSpider(scrapy.Spider):
name = 'tspider'
...
class jdSpider(scrapy.Spider):
name = 'jspider'
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(tmallSpider)
yield runner.crawl(jdSpider)
reactor.stop()
crawl()
reactor.run()
Below is the structure for my items.py
class TmallspiderItem(scrapy.Item):
# define the fields for your item here like:
product_name_tmall = scrapy.Field()
product_price_tmall = scrapy.Field()
class JdspiderItem(scrapy.Item):
product_name_jd = scrapy.Field()
product_price_jd = scrapy.Field()
I want to generate a csv file with four columns:
product_name_tmall | product_price_tmall | product_name_jd | product_price_jd
I did scrapy crawl -o prices.csv in pycharm's terminal but nothing is generated.
I scrolled up and find out only the jd items are printed in terminal, I do not see any tmall items printed.
However, if I add a open_in_browser command for the tmall spider, the brower DOES open. I guess the code was executed, but somehow the data is not recorded?
If I run scrapy crawl tspider and scrapy crawl jspider individually, everything is correct and the csv file is generated.
Is this a problem with how I ran the program or is there a problem with my code? Any ideas how to fix it?
I think it's going wrong how you are initiating spider runs.
You can simply use CrawlerProcess to initiate jobs.
You can have a look at this page https://docs.scrapy.org/en/latest/topics/practices.html for the usage of CrawlerProcess.

Scrapy - basic scraper example returns no output

I am running scrapy on Anaconda and have tried to run example code from this DigitalOcean guide as shown below:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
I am a beginner with Scrapy so keep this in mind.This code executes but no output is shown. There is supposed to be output based on the article I got the code from. Please let me know how to view the information the spider gathers. I am running the module off my IDLE, if I try to do "runspider" in cmd it says it cannot find my python file even though I can see the file directory and open it on IDLE.Thanks in advance.
Your spider is missing a callback method to handle the response from http://brickset.com/sets/year-2016.
Try defining a callback method like this:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
self.log('I visited: {}'.format(response.url))
By default, Scrapy calls the parse method defined in your spider to handle the responses for the requests that your spider generates.
Have a look at the official Scrapy tutorial too: https://doc.scrapy.org/en/latest/intro/tutorial.html

Scrapy can't find spider

I'm inching my way through this (1) tutorial.
I'm working in a folder I created as a scrapy project from the command line:
Users/myname/Desktop/MyProject/MyProject/Spider/MyProject_spider.py
My code is
import [everything necessary]
class myProjectSpider(CrawlSpider):
name = 'myProject'
allowed_domains = ['http://www.reddit.com/r/listentothis']
start_urls = ['http://www.reddit.com/r/listentothis']
rules = (Rule(LinkExtractor(allow=('http://www.reddit.com/r/listentothis/.+'), deny_domains=('www.youtube.com', 'www.soundcloud.com', 'www.reddit.com/user/.+')),'parse_start_url',follow=False),)
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
title1 = hxs.select('class="Title"').extract(text)
yield request
In the command line, I navigate to Desktop>MyProject and enter
scrapy crawl myProject
The error I always get is
"Spider not found: myProject."
I've tried using different names (making the spider name match the class name, making the class lame match the file name, making the file name match the project name, and every combination of the above), and I tried calling the command from different files in the project.
From the current folder you need to run scrapy runspider MyProject_spider
and if you want to crawl you need to create a project,place MyProject_Spider.py in the spider directory and then go to the top level directory and run scrapy crawl myProject.

Scrapy spider is not working

Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.

Categories

Resources