Python + Scrapy: Issues running "ImagesPipeline" when running crawler from script

Python + Scrapy: Issues running "ImagesPipeline" when running crawler from script - python

I'm brand new to Python so I apologize if there's a dumb mistake here...I've been scouring the web for days, looking at similar issues and combing through Scrapy docs and nothing seems to really resolve this for me...
I have a Scrapy project which successfully scrapes the source website, returns the required items, and then uses an ImagePipeline to download (and then rename accordingly) the images from the returned image links... but only when I run from the terminal with "runspider".
Whenever I use "crawl" from the terminal or CrawlProcess to run the spider from within the script, it returns the items but does not download the images and, I assume, completely misses the ImagePipeline.
I read that I needed to import my settings when running this way in order to properly load the pipeline, which makes sense after looking into the differences between "crawl" and "runspider" but I still cannot get the pipeline working.
There are no error messages but I notice that it does return "[scrapy.middleware] INFO: Enabled item pipelines: []" ... Which I assumed was showing that it is still missing my pipeline?
Here's my spider.py:
import scrapy
from scrapy2.items import Scrapy2Item
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
process = CrawlerProcess(get_project_settings())
process.crawl(spider1)
process.start()
Here is my items.py:
import scrapy
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
Here is my pipelines.py:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
Here is my settings.py:
BOT_NAME = 'scrapy2'
SPIDER_MODULES = ['scrapy2.spiders']
NEWSPIDER_MODULE = 'scrapy2.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy2.pipelines.Scrapy2Pipeline': 1,
}
IMAGES_STORE = 'images'
Thank you to anybody that looks at this or even attempts to help me out. It's greatly appreciated.

Since you are running your spider as a script, there is no scrapy project environment, get_project_settings won't work (aside from grabbing the default settings).
The script must be self-contained, i.e. contain everything you need to run your spider (or import it from your python search path, like any regular old python code).
I've reformatted that code for you, so that it runs, when you execute it with the plain python interpreter: python3 script.py.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
settings = Settings(values={
'BOT_NAME': BOT_NAME,
'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
'ITEM_PIPELINES': {
'__main__.Scrapy2Pipeline': 1,
},
'IMAGES_STORE': IMAGES_STORE,
'TELNETCONSOLE_ENABLED': False,
})
process = CrawlerProcess(settings=settings)
process.crawl(spider1)
process.start()

Related

How to run multiple spiders through individual pipelines?

Total noob just getting started with scrapy.
In my directory structure I have like this...
#FYI: running on Scrapy 2.4.1
WebScraper/
Webscraper/
spiders/
spider.py # (NOTE: contains spider1 and spider2 classes.)
items.py
middlewares.py
pipelines.py # (NOTE: contains spider1Pipeline and spider2Pipeline)
settings.py # (NOTE: I wrote here:
#ITEM_PIPELINES = {
# 'WebScraper.pipelines.spider1_pipelines': 300,
# 'WebScraper.pipelines.spider2_pipelines': 300,
#}
scrapy.cfg
And spider2.py resembles...
class OneSpider(scrapy.Spider):
name = "spider1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "spider2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
With pipelines.py looking like...
class spider1_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
self.csvwriter.writerow(['header1', 'header2'])
def process_item(self, item, spider):
row = []
row.append(item['header1'])
row.append(item['header2'])
self.csvwrite.writerow(row)
class spider2_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
self.csvwriter.writerow(['header_a', 'header_b'])
def process_item(self, item, spider):
row = []
row.append(item['header_a']) #NOTE: this is not the same as header1
row.append(item['header_b']) #NOTE: this is not the same as header2
self.csvwrite.writerow(row)
I have a question about running spider1 and spider2 on different urls with one terminal command:
nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log
Note: this is an extension of a previous question specific to this stack overflow post (2018).
Desired result: spider1.csv with data from spider1, spider2.csv with data from spider2.
Current result: spider1.csv with data from spider1, spider2.csv BREAKS but error log contains spider2 data, and that there was a keyerror ['header1'], even though the item for spider2 does not include header1, it only includes header_a.
Does anyone know how to run one spider after the other on different urls, and plug data fetched by spider1, spider2, etc. into pipelines specific to that spider, as in spider1 -> spider1Pipeline -> spider1.csv, spider2 -> spider2Pipelines -> spider2.csv.
Or perhaps this is a matter of specifying the spider1_item and spider2_item from items.py? I wonder if I can specify where to insert spider2's data that way.
Thank you!

You can implement this using custom_settings spider attribute to set settings individually per spider
#spider2.py
class OneSpider(scrapy.Spider):
name = "spider1"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
...
class TwoSpider(scrapy.Spider):
name = "spider2"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
...

Can´t download images with Scrapy

I can't download images. I have several problems (I tried so many variations). This is my code (I guess it has many errors)
The goal is to crawl the start URL and save all the product images and change its names by SKU number. Also, the spider has to click "next button" to do the same task in all the pages (there are around 24.000 products)
The problems that I noticed are:
I don't know the exact configuration with Items Pipelines
The images don't download on the folder in settings.py
I want to filter images by resolution and use thumbnails. Which one is the recommended configuration?
The images are located on another server. This is a problem?
SETTINGS.PY
BOT_NAME = 'soarimages'
SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'
ITEMS.PY
import scrapy
class soarimagesItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
PIPELINES.PY
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class soarimagesPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
Productphotos.PY (Spider)
# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem
class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
def parse(self, response):
SECTION_SELECTOR = '.one-prod'
for soarimages in response.css(SECTION_SELECTOR):
image = soarimagesItem()
image['title'] = response.xpath('.//p[#class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
rel = response.xpath('//div/a/img/#data-original').extract_first()
image['image_urls'] = ['http:'+rel[0]]
yield image
NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

This is my code (I guess it has many errors)
In fact I could spot at least one error: allowed_domains should list only the domains. You must not include any http:// prefix:
allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']
You maybe want to fix this and test your spider. If new questions arise create specific questions for each specific problem. This makes it easier to help you. See also how to ask

Scrapy generate csv file (UTF-8)

I try to generate a CSV file with the result of the crawler. Because it is German, I need to have it UTF-8 encoded (ä,ö, etc.). This is my result so far:
spider.py
import scrapy
from scrapy.spiders import BaseSpider
from scrapy.selector import Selector
from Polizeimeldungen.items import PolizeimeldungenItem
class PoliceSpider(scrapy.Spider):
name = "pm"
allowed_domains = ["berlin.de"]
start_urls =
["https://www.berlin.de/polizei/polizeimeldungen/archiv/2014/?page_at_1_0=1"]
def parse(self, response):
for sel in response.css('.row-fluid'):
item = PolizeimeldungenItem()
item['title'] = sel.css('a ::text').extract_first().encode('utf-8')
item['link'] = sel.css('a ::text').extract_first().encode('utf-8') // this is wrong, but it is easy to fix
yield item
items.py
import scrapy
class PolizeimeldungenItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
pipelines.py
import csv
class PolizeimeldungenPipeline(object):
def __init__(self):
self.myCsv = csv.writer(open('Item.csv', 'wb'))
self.myCsv.writerow(['title', 'link'])
def process_item(self, item, spider):
self.myCsv.writerow([item['title'], item['link']])
return item
Settings.py
BOT_NAME = 'Polizeimeldungen'
SPIDER_MODULES = ['Polizeimeldungen.spiders']
NEWSPIDER_MODULE = 'Polizeimeldungen.spiders'
ITEM_PIPELINES = {'Polizeimeldungen.pipelines.PolizeimeldungenPipeline': 100}
AS the result after:
scrapy crawl pm
I get this error message:
TypeError: a bytes-like object is required, not 'str'
Thanks for your help!!
UPDATE: Python 3.6.0 :: Anaconda 4.3.1

I assume that you are using Python 3 (this solution won't work with Python 2).
You need to change two things:
Open the output file in text mode, with the desired output encoding.
In the PolizeimeldungenPipeline's constructor, write:
self.myCsv = csv.writer(open('Item.csv', 'w', encoding='utf-8'))
Don't encode the cells (as in PoliceSpider.parse):
item['title'] = sel.css('a ::text').extract_first()
etc.

Scrapy put two spiders in single file

I have written two spiders in single file. When I ran scrapy runspider two_spiders.py, only the first Spider was executed. How can I run both of them without splitting the file into two files.
two_spiders.py:
import scrapy
class MySpider1(scrapy.Spider):
# first spider definition
...
class MySpider2(scrapy.Spider):
# second spider definition
...

Let's read the documentation:
Running multiple spiders in the same process
By default, Scrapy runs a
single spider per process when you run scrapy crawl. However, Scrapy
supports running multiple spiders per process using the internal API.
Here is an example that runs multiple spiders simultaneously:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
(there are few more examples in the documentation)
From your question it is not clear how have you put two spiders into one file. It was not enough to concatenate content of two files with single spiders.
Try to do what is written in the documentation. Or at least show us your code. Without it we can't help you.

Here is a full Scrapy project with 2 spiders in one file.
# quote_spiders.py
import json
import string
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field
class TextCleaningPipeline(object):
def _clean_text(self, text):
text = text.replace('“', '').replace('”', '')
table = str.maketrans({key: None for key in string.punctuation})
clean_text = text.translate(table)
return clean_text.lower()
def process_item(self, item, spider):
item['text'] = self._clean_text(item['text'])
return item
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open(spider.settings['JSON_FILE'], 'a')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class QuoteItem(Item):
text = Field()
author = Field()
tags = Field()
spider = Field()
class QuotesSpiderOne(scrapy.Spider):
name = "quotes1"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/', ]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('small.author::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
item['spider'] = self.name
yield item
class QuotesSpiderTwo(scrapy.Spider):
name = "quotes2"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/2/', ]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('small.author::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
item['spider'] = self.name
yield item
if __name__ == '__main__':
settings = dict()
settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
settings['HTTPCACHE_ENABLED'] = True
settings['JSON_FILE'] = 'items.jl'
settings['ITEM_PIPELINES'] = dict()
settings['ITEM_PIPELINES']['__main__.TextCleaningPipeline'] = 800
settings['ITEM_PIPELINES']['__main__.JsonWriterPipeline'] = 801
process = CrawlerProcess(settings=settings)
process.crawl(QuotesSpiderOne)
process.crawl(QuotesSpiderTwo)
process.start()
Install Scrapy and run the script
$ pip install Scrapy
$ python quote_spiders.py
No other file is needed.
This example coupled with graphical debugger of pycharm/vscode can help understand scrapy workflow and make debugging easier.

Scrapy spider results can't be pipelined into database when ran from a script [duplicate]

This question already has answers here:
Getting scrapy project settings when script is outside of root directory
(6 answers)
Closed 9 months ago.
I've written a Scrapy spider that I am trying to run from a python script located in another directory. The code I'm using from the docs seems to run the spider, but when I check the postgresql table, it hasn't been created. The spider only properly pipelines the scraped data if I use the scrapy crawl command. I've tried placing the script in the directory right above the scrapy project and also in the same directory as the config file and neither seem to be creating the table.
The code for the script is below followed by the code for the spider. I think the problem involves the directory in which the script should be place and/or the code that I use within the spider file to enable the spider to be ran from a script, but I'm not sure. Does it look like there is a problem with the function that is being called in the script or is there something that needs to be changed within the settings file? I can provide the code for the pipelines file if necessary, thanks.
Script file (only 3 lines)
from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider
tc_spider.spiderCrawl()
Spider file
import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"
class MySpider3(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.ticketcity.com"]
start_urls = [tc_url]
tickets_list_xpath = './/div[#class = "vevent"]'
def create_link(self, bandname):
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"
self.start_urls = [tc_url]
#return tc_url
tickets_list_xpath = './/div[#class = "vevent"]'
def parse_json(self, response):
loader = response.meta['loader']
jsonresponse = json.loads(response.body_as_unicode())
ticket_info = jsonresponse.get('B')
price_list = [i.get('P') for i in ticket_info]
if len(price_list) > 0:
str_Price = str(price_list[0])
ticketPrice = unicode(str_Price, "utf-8")
loader.add_value('ticketPrice', ticketPrice)
else:
ticketPrice = unicode("sold out", "utf-8")
loader.add_value('ticketPrice', ticketPrice)
return loader.load_item()
def parse_price(self, response):
print "parse price function entered \n"
loader = response.meta['loader']
event_City = response.xpath('.//span[#itemprop="addressLocality"]/text()').extract()
eventCity = ''.join(event_City)
loader.add_value('eventCity' , eventCity)
event_State = response.xpath('.//span[#itemprop="addressRegion"]/text()').extract()
eventState = ''.join(event_State)
loader.add_value('eventState' , eventState)
event_Date = response.xpath('.//span[#class="event_datetime"]/text()').extract()
eventDate = ''.join(event_Date)
loader.add_value('eventDate' , eventDate)
ticketsLink = loader.get_output_value("ticketsLink")
json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
json_id= "".join(json_id_list)
json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True)
def parse(self, response):
"""
# """
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):
loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
loader.add_xpath('eventName' , './/span[#class="summary listingEventName"]/text()')
loader.add_xpath('eventLocation' , './/div[#class="divVenue location"]/text()')
loader.add_xpath('ticketsLink' , './/a[#class="divEventDetails url"]/#href')
#loader.add_xpath('eventDateTime' , '//div[#id="divEventDate"]/#title') #datetime type
#loader.add_xpath('eventTime' , './/*[#class = "productionsTime"]/text()')
print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
#sel.xpath("//span[#id='PractitionerDetails1_Label4']/text()").extract()
ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
ticketsURL = urljoin(response.url, ticketsURL)
yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)
def spiderCrawl():
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider3)
process.start()

It's because your settings object only contains a user agent. Your project settings are what determine which pipeline gets ran. From scrapy docs:
You can
automatically import your spiders passing their name to
CrawlerProcess, and use get_project_settings to get a Settings
instance with your project settings.
more info here http://doc.scrapy.org/en/latest/topics/practices.html
Read more than the first example.

The settings are not beeing read if you run from a parent folder.
This answer helped me:
Getting scrapy project settings when script is outside of root directory

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python + Scrapy: Issues running "ImagesPipeline" when running crawler from script - python

Related

How to run multiple spiders through individual pipelines?

Can´t download images with Scrapy

Scrapy generate csv file (UTF-8)

Scrapy put two spiders in single file

Scrapy spider results can't be pipelined into database when ran from a script [duplicate]

Categories

Resources