I've seen several things, but I'm not able to play this in a table or .csv to print the table on the screen, can anyone help me?
I'm lost
import scrapy
class SinonimoSpider(scrapy.Spider):
name = 'sinonimo'
start_urls = ['https://www.sinonimos.com.br/pedido/']
def parse(self, response):
for i in response.css('.sinonimo'):
yield{
'sinonimo': i.css('a.sinonimo ::text').get()
}
Do you mean you are unable to see the data and want the spider to store the data in CSV?
There are many ways to do it.
The most popular one that we use when we run the spider from the terminal
$ scrapy crawl sinonimo -O sinonimo.csv # in case of CSV
$ scrapy crawl sinonimo -O sinonimo.json # in case of json
If you need any help, just leave a comment.
Related
So i am working on a small crawler using scrapy and python on this website https://www.theverge.com/reviews. From there i am trying to extract the reviews based on the rules i have set which should match links that match this criteria:
example: https://www.theverge.com/22274747/tern-hsd-p9-ebike-review-electric-cargo-bike-price-specs
Extracting the url from the review page, title of the page, name of who made the review and the link to their profile. However i assume there is something either wrong with my code or something wrong with the way i have my files sorted. Because this error when i try to run it:
runspider: error: Unable to load 'spiders/vergespider.py': No module named 'oblig3.oblig3'
My folders look like this.
So my intended results should look something like this. Visiting up to 20 pages, which i don't quite understand how to fix through the scrapy settings, but that is another problem.
authorlink,authorname,title,url
"https://www.theverge.com/authors/cameron-faulkner,https://www.twitter.com/camfaulkner",Cameron
Faulkner,"Gigabyte’s Aorus 15G is great at gaming, but not much
else",https://www.theverge.com/22299226/gigabyte-aorus-15g-review-gaming-laptop-price-specs-features
So my question is what could be causing the error i am getting why am i not getting any csv output from this code. I am fairly new at python and scrapy oo any tips or improvement to the code are appreciated. I would like to keep the "solutions" through scrapy and python as those are the things i am trying to learn atm.
Edit:
This is what i use to run the code with scrapy runspider spiders/vergespider.py -o vergetest.csv -t csv. And this is what i have coded so far.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from oblig3.items import VergeReview
class VergeSpider(CrawlSpider):
name = 'verge'
allowed_domains = ['theverge.com']
start_urls = ['https://www.theverge.com/reviews']
rules = [
Rule(LinkExtractor(allow=r'^(https://www.theverge.com/)(/d+)/([%5E/]+$)%27'),
callback='parse_items', follow=True),
Rule(LinkExtractor(allow=r'.*'),
callback='parse_items', cb_kwargs={'is_verge': False})
]
def parse(self, response, is_verge):
if is_verge:
verge = VergeReview()
verge['url'] = response.url
verge['title'] = response.xpath("//h1/text()").extract_first()
verge['authorname'] = response.xpath("//span[#class='c-byline__author-name']/text()").extract()
verge['authorlink'] = response.xpath("//*/span[#class = 'c-byline__item'][1]/a/#href").extract()
yield verge
else:
# Do something else
pass
My items file
import scrapy
class VergeReview(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
authorname = scrapy.Field()
authorlink = scrapy.Field()
And my settings file is unchanged though i should implement CLOSESPIDER_PAGECOUNT = 20 but idk how.
The error you have is:
runspider: error ..... No module named 'oblig3.oblig3'
What I can see from your screenprint is that oblig3 is the name of your project.
This is a common error when you try to run your spider using:
scrapy runspider spider_file.py
If you are running your spider this way, you need to change the way you are running the spider:
First, make sure that you are in the directory where scrapy.cfg is located
then run
scrapy list
This should give you a list of all the spiders it found.
After that, you should use this command to run your spider.
scrapy crawl <spidername>
If this does not solve your problem, you need to share the code and share the details about how you are running your spider.
I am running scrapy on Anaconda and have tried to run example code from this DigitalOcean guide as shown below:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
I am a beginner with Scrapy so keep this in mind.This code executes but no output is shown. There is supposed to be output based on the article I got the code from. Please let me know how to view the information the spider gathers. I am running the module off my IDLE, if I try to do "runspider" in cmd it says it cannot find my python file even though I can see the file directory and open it on IDLE.Thanks in advance.
Your spider is missing a callback method to handle the response from http://brickset.com/sets/year-2016.
Try defining a callback method like this:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
self.log('I visited: {}'.format(response.url))
By default, Scrapy calls the parse method defined in your spider to handle the responses for the requests that your spider generates.
Have a look at the official Scrapy tutorial too: https://doc.scrapy.org/en/latest/intro/tutorial.html
I'm relatively new to python and scrapy; just started learning from tutorials a few days ago.
Like the title says, I've been trying to get a simple text scraping operation done as practice, by scraping chapters from fanfiction.net. However, I've run into a roadblock where even though my test selectors in the shell work perfectly fine, the spider itself still returns nothing when I run the command scrapy crawl fanfiction -o fanfiction.json.
Here is the code that I have so far; it's essentially a modified version of the tutorial code from doc.scrapy.org.
import scrapy
class FFSpider(scrapy.Spider):
name = "fanfiction"
start_urls = [
'https://www.fanfiction.net/s/12580108/1/Insane-Gudako',
'https://www.fanfiction.net/s/12580108/2/Insane-Gudako',
]
def parse(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
for chapter in response.css('div.fanfiction'):
yield {
'summary': chapter.css('.storytext p').xpath('text()').extract()
}
In the inline shell call, testing using chapter.css('.storytext p').xpath('text()').extract() returns the text properly, but once the spider finishes crawling, fanfiction.json is still an empty file.
What is the problem here?
I'm trying to scrape a single url with scrapy. I don't want it to crawl, just parse the item, run the pipelines and return. My pipeline just updates the database. The following code is what i've done so far and is taking around 3 seconds but seems like most of the time is spend loading scrapy. If there a better way todo this?
Ideally I want to parse a single url from a python script and not command line.
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
if 'item.asp' in response.url:
yield Request(response.url, callback=self.parse_item)
Then i'm running from command line like the following
time scrapy crawl --loglevel=DEBUG MySpider -a start_url="www.example.com"
I did also try the following but never worked with the pipeline parameter.
time scrape parse "www.example.com" --spider=MySpider --callback parse_item --pipelines AddToDB
check the documentation for scrapy parse http://doc.scrapy.org/en/latest/topics/commands.html?highlight=parse#std:command-parse
In your case you are misunderstanding the --pipelines argument. it enables all of the pipelines defined in the settings.py
so just run without AddToDB.
If you want to disable some pipelines from running it might be tricky and you might want to just have a child of your spider, add class attribute custom_settings and restrict the pipelines in it.
So in your case something like:
class MySpider2(MySpider):
name = 'spider2'
custom_settings = {'ITEM_PIPELINES': 'project.pipelines.AddToDB'}
and then use scrapy parse 'http://example.com' --spider=spider2 --pipelines.
I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!
This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()
How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))