I'm inching my way through this (1) tutorial.
I'm working in a folder I created as a scrapy project from the command line:
Users/myname/Desktop/MyProject/MyProject/Spider/MyProject_spider.py
My code is
import [everything necessary]
class myProjectSpider(CrawlSpider):
name = 'myProject'
allowed_domains = ['http://www.reddit.com/r/listentothis']
start_urls = ['http://www.reddit.com/r/listentothis']
rules = (Rule(LinkExtractor(allow=('http://www.reddit.com/r/listentothis/.+'), deny_domains=('www.youtube.com', 'www.soundcloud.com', 'www.reddit.com/user/.+')),'parse_start_url',follow=False),)
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
title1 = hxs.select('class="Title"').extract(text)
yield request
In the command line, I navigate to Desktop>MyProject and enter
scrapy crawl myProject
The error I always get is
"Spider not found: myProject."
I've tried using different names (making the spider name match the class name, making the class lame match the file name, making the file name match the project name, and every combination of the above), and I tried calling the command from different files in the project.
From the current folder you need to run scrapy runspider MyProject_spider
and if you want to crawl you need to create a project,place MyProject_Spider.py in the spider directory and then go to the top level directory and run scrapy crawl myProject.
Related
So i am working on a small crawler using scrapy and python on this website https://www.theverge.com/reviews. From there i am trying to extract the reviews based on the rules i have set which should match links that match this criteria:
example: https://www.theverge.com/22274747/tern-hsd-p9-ebike-review-electric-cargo-bike-price-specs
Extracting the url from the review page, title of the page, name of who made the review and the link to their profile. However i assume there is something either wrong with my code or something wrong with the way i have my files sorted. Because this error when i try to run it:
runspider: error: Unable to load 'spiders/vergespider.py': No module named 'oblig3.oblig3'
My folders look like this.
So my intended results should look something like this. Visiting up to 20 pages, which i don't quite understand how to fix through the scrapy settings, but that is another problem.
authorlink,authorname,title,url
"https://www.theverge.com/authors/cameron-faulkner,https://www.twitter.com/camfaulkner",Cameron
Faulkner,"Gigabyte’s Aorus 15G is great at gaming, but not much
else",https://www.theverge.com/22299226/gigabyte-aorus-15g-review-gaming-laptop-price-specs-features
So my question is what could be causing the error i am getting why am i not getting any csv output from this code. I am fairly new at python and scrapy oo any tips or improvement to the code are appreciated. I would like to keep the "solutions" through scrapy and python as those are the things i am trying to learn atm.
Edit:
This is what i use to run the code with scrapy runspider spiders/vergespider.py -o vergetest.csv -t csv. And this is what i have coded so far.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from oblig3.items import VergeReview
class VergeSpider(CrawlSpider):
name = 'verge'
allowed_domains = ['theverge.com']
start_urls = ['https://www.theverge.com/reviews']
rules = [
Rule(LinkExtractor(allow=r'^(https://www.theverge.com/)(/d+)/([%5E/]+$)%27'),
callback='parse_items', follow=True),
Rule(LinkExtractor(allow=r'.*'),
callback='parse_items', cb_kwargs={'is_verge': False})
]
def parse(self, response, is_verge):
if is_verge:
verge = VergeReview()
verge['url'] = response.url
verge['title'] = response.xpath("//h1/text()").extract_first()
verge['authorname'] = response.xpath("//span[#class='c-byline__author-name']/text()").extract()
verge['authorlink'] = response.xpath("//*/span[#class = 'c-byline__item'][1]/a/#href").extract()
yield verge
else:
# Do something else
pass
My items file
import scrapy
class VergeReview(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
authorname = scrapy.Field()
authorlink = scrapy.Field()
And my settings file is unchanged though i should implement CLOSESPIDER_PAGECOUNT = 20 but idk how.
The error you have is:
runspider: error ..... No module named 'oblig3.oblig3'
What I can see from your screenprint is that oblig3 is the name of your project.
This is a common error when you try to run your spider using:
scrapy runspider spider_file.py
If you are running your spider this way, you need to change the way you are running the spider:
First, make sure that you are in the directory where scrapy.cfg is located
then run
scrapy list
This should give you a list of all the spiders it found.
After that, you should use this command to run your spider.
scrapy crawl <spidername>
If this does not solve your problem, you need to share the code and share the details about how you are running your spider.
I have two spiders in my spider.py class, and I want to run them and generate a csv file.
Below is the structure of my spider.py
class tmallSpider(scrapy.Spider):
name = 'tspider'
...
class jdSpider(scrapy.Spider):
name = 'jspider'
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(tmallSpider)
yield runner.crawl(jdSpider)
reactor.stop()
crawl()
reactor.run()
Below is the structure for my items.py
class TmallspiderItem(scrapy.Item):
# define the fields for your item here like:
product_name_tmall = scrapy.Field()
product_price_tmall = scrapy.Field()
class JdspiderItem(scrapy.Item):
product_name_jd = scrapy.Field()
product_price_jd = scrapy.Field()
I want to generate a csv file with four columns:
product_name_tmall | product_price_tmall | product_name_jd | product_price_jd
I did scrapy crawl -o prices.csv in pycharm's terminal but nothing is generated.
I scrolled up and find out only the jd items are printed in terminal, I do not see any tmall items printed.
However, if I add a open_in_browser command for the tmall spider, the brower DOES open. I guess the code was executed, but somehow the data is not recorded?
If I run scrapy crawl tspider and scrapy crawl jspider individually, everything is correct and the csv file is generated.
Is this a problem with how I ran the program or is there a problem with my code? Any ideas how to fix it?
I think it's going wrong how you are initiating spider runs.
You can simply use CrawlerProcess to initiate jobs.
You can have a look at this page https://docs.scrapy.org/en/latest/topics/practices.html for the usage of CrawlerProcess.
I have 3 number of spider files and classes. And I want to save item informations at csv file which has different filename defendant the variable parameter of searching condition. For that, I need to access the spider class parameter.
So, my questions are three.
How can I access the spider class's parameter?
What is the best way to make each csv files? The trigger condition is that will call request at parse function for new searching result.
logger = logging.getLogger(__name__) it's not working in pipelines.py
How can I print that information?
Bellow is my log code style
logger.log(logging.INFO,'\n======= %s ========\n', filename)
I had been searching the ways in google so many times. But I couldn't find the solution.
I did try to use from_crawler function, but I couldn't find the adapt case
Scrapy 1.6.0
python 3.7.3
os window 7 / 32bit
Code:
class CensusGetitemSpider(scrapy.Spider):
name = 'census_getitem'
startmonth=1
filename = None
def parse(self, response):
for x in data:
self.filename = str(startmonth+1)
.
.
.
yield item
yield scrapy.Request(link, callback=self.parse)
you can access spider class and instance attributes from pipeline.py using the spider parameter passed in most of pipeline methods.
For example, :
open_spider(self, spider):
self.filename = spider.name
You can see more about item pipelines here https://docs.scrapy.org/en/latest/topics/item-pipeline.html
You can save it directly from the command line, just define a filename:
scrapy crawl yourspider -o output.csv
But if you really need it to be set from the spider, you can use a custom setting per spider, for example:
class YourSpider(scrapy.Spider):
name = 'yourspider'
start_urls = 'www.yoursite.com'
custom_settings = {
'FEED_URI':'output.csv',
'FEED_FORMAT': 'csv',
}
Use spider.logger.info('Your message')
I am running scrapy on Anaconda and have tried to run example code from this DigitalOcean guide as shown below:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
I am a beginner with Scrapy so keep this in mind.This code executes but no output is shown. There is supposed to be output based on the article I got the code from. Please let me know how to view the information the spider gathers. I am running the module off my IDLE, if I try to do "runspider" in cmd it says it cannot find my python file even though I can see the file directory and open it on IDLE.Thanks in advance.
Your spider is missing a callback method to handle the response from http://brickset.com/sets/year-2016.
Try defining a callback method like this:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
self.log('I visited: {}'.format(response.url))
By default, Scrapy calls the parse method defined in your spider to handle the responses for the requests that your spider generates.
Have a look at the official Scrapy tutorial too: https://doc.scrapy.org/en/latest/intro/tutorial.html
For example i had a site "www.example.com"
Actually i want to scrape the html of this site by saving on to local system.
so for testing i saved that page on my desktop as example.html
Now i had written the spider code for this as below
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
But when i run the above code i am getting this error as below
ValueError: Missing scheme in request url: example.html
Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system
Can any one suggest me on how to assign that example.html file in start_urls
Thanks in advance
You can crawl a local file using an url of the following form:
file:///path/to/file.html
You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.
Basically, adding the following settings to your settings.py will make it work:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire
This however requires to do an initial spider run from the web to populate the cache.
In scrapy, You can scrape local file using:
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["file:///path_of_directory/example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'
Just to share the way that I like to do this scraping with local files:
import scrapy
import os
LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = [
f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
]
I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.
scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"
this works for me today on Windows 10.
I have to put the full path without the ////.
You can simple do
def start_requests(self):
yield Request(url='file:///path_of_directory/example.html')
If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like
http://127.0.0.1/example.html