I am trying to run the scrapy exporter with a custom delimiter via CLI like this:
scrapy runspider beneficiari_2016.py -o beneficiari_2016.csv -t csv -a CSV_DELIMITER="\n"
The export works perfectly, but the delimiter is still the default comma(",").
Please let me know if you have any idea how it can be fixed. Thank you!
The code:
import scrapy
from scrapy.item import Item, Field
import urllib.parse
class anmdm(Item):
nume_beneficiar = Field()
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.anm.ro/sponsorizari/afisare-2016/beneficiari?
page=1']
def parse(self, response):
doctor = anmdm()
doctors = []
for item in response.xpath('//tbody/tr'):
doctor['nume_beneficiar'] =
item.xpath('td[5]//text()').extract_first()
yield doctor
next_page = response.xpath("//ul/li[#class='active']/following-
sibling::li/a/#href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
print(next_page)
yield response.follow(next_page, self.parse)
CSV_DELIMITER needs to be changed in settings and not like an spider argument -a.
To change settings on the command line use -s:
scrapy runspider beneficiari_2016.py -o beneficiari_2016.csv -t csv -s CSV_DELIMITER="\n"
Related
when I put
scrapy runspider divar.py -o data.json
in the terminal, I get an empty file. am I doing something wrong here? I want to get result of categories and subcategories from the URL I put in the start_urls and then put in the result and the print it and also get an json file. mostly json file.
import scrapy
class ws(scrapy.Spider):
name = 'wsDivar'
result = []
start_urls =["https://divar.ir/s/tehran"]
def parse(self, response):
for category in response.xpath("//*/ul[#class='kt-accordion-item__header']"):
x = {'cats' : category.xpath("//*/ul[#class='kt-accordion-item__header']/a").extract_first()}
result.append(x)
yield(x)
print(result)
next_L =response.xpath("//li[#class='next']/a/#href").extract_first()
if next_L is not None:
next_link = response.urljoin(next_L)
yield scrapy.Request(url=next_link, callback=self.parse)
import scrapy
class ws(scrapy.Spider):
name = 'wsDivar'
start_urls =["https://divar.ir/s/tehran"]
def parse(self, response):
for category in response.xpath("//*/ul[#class='kt-accordion-item__header']"):
x = {'cats' : category.xpath("//*/ul[#class='kt-accordion-item__header']/a").extract_first()}
yield(x)
next_L =response.xpath("//li[#class='next']/a/#href").extract_first()
if next_L is not None:
next_link = response.urljoin(next_L)
yield scrapy.Request(url=next_link, callback=self.parse)
if your XPath works fine. yield also print you the result.
instead of this:
scrapy runspider divar.py -o data.json
use this:
scrapy crawl wsDivar -o data.json
Also, run the command in project directory which supposed to include scrapy.cfg file.
Hi, I'm trying to pass a list of arguments in scrapy spider command. I am able to run it for 1 argument. but unable to do it for a list of arguments. Please help. here what I tried.
# -*- coding: utf-8 -*-
import scrapy
import json
class AirbnbweatherSpider(scrapy.Spider):
name = 'airbnbweather'
allowed_domains = ['www.wunderground.com']
def __init__(self,geocode ):
self.geocode = geocode.split(',')
pass
def start_requests(self):
yield scrapy.Request(url="https://api.weather.com/v3/wx/forecast/daily/10day?apiKey=6532d6454b8aa370768e63d6ba5a832e&geocode={0}{1}{2}&units=e&language=en-US&format=json".format(self.geocode[0],"%2C",self.geocode[1]))
def parse(self, response):
resuturant = json.loads(response.body)
yield {
'temperatureMax' : resuturant.get('temperatureMax'),
'temperatureMin' : resuturant.get('temperatureMin'),
'validTimeLocal' : resuturant.get('validTimeLocal'),
}
I'm able to run it using this command
scrapy crawl airbnbweather -o BOSTON.json -a geocode="42.361","-71.057"
it's working fine. but how I I can iterate over a list for geocodes ?
list = [("42.361","-71.057"),("29.384","-94.903"),("30.384", "-84.903")]
You can only use string as spider arguments (https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments), so you should pass the list as a string, and do the parsing in your code.
The following seems to do the trick:
import scrapy
import json
import ast
class AirbnbweatherSpider(scrapy.Spider):
name = 'airbnbweather'
allowed_domains = ['www.wunderground.com']
def __init__(self, geocode, *args, **kwargs):
super().__init__(*args, **kwargs)
self.geocodes = ast.literal_eval(geocode)
def start_requests(self):
for geocode in self.geocodes:
yield scrapy.Request(
url="https://api.weather.com/v3/wx/forecast/daily/10day?apiKey=6532d6454b8aa370768e63d6ba5a832e&geocode={0}{1}{2}&units=e&language=en-US&format=json".format(geocode[0],"%2C",geocode[1]))
You can then run the crawler like this:
scrapy crawl airbnbweather -o BOSTON.json -a geocodes='[("42.361","-71.057"),("29.384","-94.903"),("30.384", "-84.903")]'
I'm brand new to Python so I apologize if there's a dumb mistake here...I've been scouring the web for days, looking at similar issues and combing through Scrapy docs and nothing seems to really resolve this for me...
I have a Scrapy project which successfully scrapes the source website, returns the required items, and then uses an ImagePipeline to download (and then rename accordingly) the images from the returned image links... but only when I run from the terminal with "runspider".
Whenever I use "crawl" from the terminal or CrawlProcess to run the spider from within the script, it returns the items but does not download the images and, I assume, completely misses the ImagePipeline.
I read that I needed to import my settings when running this way in order to properly load the pipeline, which makes sense after looking into the differences between "crawl" and "runspider" but I still cannot get the pipeline working.
There are no error messages but I notice that it does return "[scrapy.middleware] INFO: Enabled item pipelines: []" ... Which I assumed was showing that it is still missing my pipeline?
Here's my spider.py:
import scrapy
from scrapy2.items import Scrapy2Item
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
process = CrawlerProcess(get_project_settings())
process.crawl(spider1)
process.start()
Here is my items.py:
import scrapy
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
Here is my pipelines.py:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
Here is my settings.py:
BOT_NAME = 'scrapy2'
SPIDER_MODULES = ['scrapy2.spiders']
NEWSPIDER_MODULE = 'scrapy2.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy2.pipelines.Scrapy2Pipeline': 1,
}
IMAGES_STORE = 'images'
Thank you to anybody that looks at this or even attempts to help me out. It's greatly appreciated.
Since you are running your spider as a script, there is no scrapy project environment, get_project_settings won't work (aside from grabbing the default settings).
The script must be self-contained, i.e. contain everything you need to run your spider (or import it from your python search path, like any regular old python code).
I've reformatted that code for you, so that it runs, when you execute it with the plain python interpreter: python3 script.py.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
settings = Settings(values={
'BOT_NAME': BOT_NAME,
'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
'ITEM_PIPELINES': {
'__main__.Scrapy2Pipeline': 1,
},
'IMAGES_STORE': IMAGES_STORE,
'TELNETCONSOLE_ENABLED': False,
})
process = CrawlerProcess(settings=settings)
process.crawl(spider1)
process.start()
I'm very new to scrapy so it's hard for me to find out what i am doing wrong in case of having no results in csv file. I can see results in the console though. Here is what I tried with:
Main folder is named "realyp".
Spider file is named "yp.py" and the code:
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from realyp.items import RealypItem
class MySpider(BaseSpider):
name="YellowPage"
allowed_domains=["yellowpages.com"]
start_urls=["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=2"]
def parse(self, response):
title = Selector(response)
page=title.xpath('//div[#class="info"]')
items = []
for titles in page:
item = RealypItem()
item["name"] = titles.xpath('.//span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()').extract()
item["phone"] = titles.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
items.append(item)
return items
"items.py" file includes:
from scrapy.item import Item, Field
class RealypItem(Item):
name= Field()
address = Field()
phone= Field()
To get the csv output my command line is:
cd desktop
cd realyp
scrapy crawl YellowPage -o items.csv -t csv
Any help will be greatly appreciated.
As stated by #Granitosauros, you should use yield instead of return. The yield should be inside the for cycle.
In the for cycle, if the path starts with // then all elements in the document which fulfill following criteria are selected (see here).
Here's a (rough) code that works for me:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from realyp.items import RealypItem
class MySpider(BaseSpider):
name="YellowPage"
allowed_domains=["yellowpages.com"]
start_urls=["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=2"]
def parse(self, response):
for titles in response.xpath('//div[#class = "result"]/div'):
item = RealypItem()
item["name"] = titles.xpath('div[2]/div[2]/h2 /a/span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('string(div[2]/div[2]/div/p[#itemprop="address"])').extract()
item["phone"] = titles.xpath('div[2]/div[2]/div/div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
yield item
Ihave wriiten a crawler in scrapy but I would want to initiate the crwaling by using main method
import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
import re
class TutsplusItem(scrapy.Item):
title = scrapy.Field()
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/"]
def __init__(self, *args):
try:
opts, args = getopt.getopt(args, "hi:o:", ["ifile=", "ofile="])
except getopt.GetoptError:
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit(2)
super(MySpider, self).__init__(self,*args)
def parse(self, response):
links = response.xpath('//a/#href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
# I only want to get the tutorial posts
# linkPattern = re.compile("^\/tutorials\?page=\d+")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
#if linkPattern.match(link) and not link in crawledLinks:
if not link in crawledLinks:
link = "http://www.bbc.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(#class, "media__link")]/text()').extract()
count=0
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
Instead of using scrapy runspider Crawler.py arg1 arg2
I would like to have a seprate class with main function and initiate scrapy from there. How to this?
There are different ways to approach this, but I suggest the following:
Have a main.py file on the same directory that will open a new process and launch the spider with the parameters you need.
The main.py file would have something like the following:
import subprocess
scrapy_command = 'scrapy runspider {spider_name} -a param_1="{param_1}"'.format(spider_name='your_spider', param_1='your_value')
process = subprocess.Popen(scrapy_command, shell=True)
With this code, you just need to call your main file.
python main.py
Hope it helps.