Understandin how rename images scrapy works

Understandin how rename images scrapy works - python

I see all questions here, but i dont understand yet.
Actualy with de code bellow i do what i need, except rename de image, so i try change name in the items.py file, pls check comments inside.
settings.py
SPIDER_MODULES = ['xxx.spiders']
NEWSPIDER_MODULE = 'xxx.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/home/magicnt/xxx/images'
items.py
class XxxItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
image_urls = scrapy.Field()
#images = scrapy.Field()<---with that code work with default name images
images = title<--- I try rename here, but not work
spider.py
from xxx.items import XxxItem
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class CoverSpider(scrapy.Spider):
name = "pyimagesearch-cover-spider"
start_urls = ['https://xxx.com.br/product']
def parse(self, response):
for bimb in response.css('#mod_imoveis_result'):
imageURL = bimb.xpath('./div[#id="g-img-imo"]/div[#class="img_p_results"]/img/#src').extract_first()
title = bimb.css('#titulo_imovel::text').extract_first()
yield {
'image_urls' : [response.urljoin(imageURL)],
'title' : title
}
next_page = response.xpath('//a[contains(#class, "num_pages") and contains(#class, "pg_number_next")]/#href').extract_first()
yield response.follow(next_page, self.parse)
My goal is rename downloaded images with the title from item. Any tip for this goal are welcome.
I'm totally new to python and oo, I usually scrape with structural php but realize what a good scrapy it can be, ask for a little patience and help.

My code is based on Scrapy Image Pipeline: How to rename images? I tested it a week ago and it works on my own spiders.
# This pipeline is designed for an item with multiple images
class ImagesWithNamesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# values in field "image_name" must have suffix ".jpg"
# you can only change "image_name" to your own image name filed "images"
# however it should be a list
for (image_url, image_name) in zip(item[self.IMAGES_URLS_FIELD], item["image_names"]):
yield scrapy.Request(url=image_url, meta={"image_name": image_name})
def file_path(self, request, response=None, info=None):
image_name = request.meta["image_name"]
return image_name
Here is how the ImagePipeline works:
The pipeline will execute image_downloaded -> get_images -> file_path in order. ("->" means invokes)
image_downloaded: save images that get_images return by invoking persist_file
get_images: convert images to JPEG
file_path: return the relative path of image
I scaned through the source code of ImagePipeline and found no special field for rename an image. Scrapy will rename it in this way:
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
return 'full/%s.jpg' % (image_guid)
Therefore we should override method file_path. According to the source code of FilePipeline which ImagePipeline inherits, we only need to return relative paths and persist_file will get things done.

Related

Python + Scrapy: Issues running "ImagesPipeline" when running crawler from script

I'm brand new to Python so I apologize if there's a dumb mistake here...I've been scouring the web for days, looking at similar issues and combing through Scrapy docs and nothing seems to really resolve this for me...
I have a Scrapy project which successfully scrapes the source website, returns the required items, and then uses an ImagePipeline to download (and then rename accordingly) the images from the returned image links... but only when I run from the terminal with "runspider".
Whenever I use "crawl" from the terminal or CrawlProcess to run the spider from within the script, it returns the items but does not download the images and, I assume, completely misses the ImagePipeline.
I read that I needed to import my settings when running this way in order to properly load the pipeline, which makes sense after looking into the differences between "crawl" and "runspider" but I still cannot get the pipeline working.
There are no error messages but I notice that it does return "[scrapy.middleware] INFO: Enabled item pipelines: []" ... Which I assumed was showing that it is still missing my pipeline?
Here's my spider.py:
import scrapy
from scrapy2.items import Scrapy2Item
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
process = CrawlerProcess(get_project_settings())
process.crawl(spider1)
process.start()
Here is my items.py:
import scrapy
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
Here is my pipelines.py:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
Here is my settings.py:
BOT_NAME = 'scrapy2'
SPIDER_MODULES = ['scrapy2.spiders']
NEWSPIDER_MODULE = 'scrapy2.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy2.pipelines.Scrapy2Pipeline': 1,
}
IMAGES_STORE = 'images'
Thank you to anybody that looks at this or even attempts to help me out. It's greatly appreciated.

Since you are running your spider as a script, there is no scrapy project environment, get_project_settings won't work (aside from grabbing the default settings).
The script must be self-contained, i.e. contain everything you need to run your spider (or import it from your python search path, like any regular old python code).
I've reformatted that code for you, so that it runs, when you execute it with the plain python interpreter: python3 script.py.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'
class Scrapy2Item(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
sku = scrapy.Field()
class Scrapy2Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [scrapy.Request(x, meta={'image_name': item['sku']})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
class spider1(scrapy.Spider):
name = "spider1"
domain = "https://www.amazon.ca/s?k=821826022317"
def start_requests(self):
yield scrapy.Request(url=spider1.domain ,callback = self.parse)
def parse(self, response):
items = Scrapy2Item()
titlevar = response.css('span.a-text-normal ::text').extract_first()
imgvar = [response.css('img ::attr(src)').extract_first()]
skuvar = response.xpath('//meta[#name="keywords"]/#content')[0].extract()
items['title'] = titlevar
items['image_urls'] = imgvar
items['sku'] = skuvar
yield items
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
settings = Settings(values={
'BOT_NAME': BOT_NAME,
'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
'ITEM_PIPELINES': {
'__main__.Scrapy2Pipeline': 1,
},
'IMAGES_STORE': IMAGES_STORE,
'TELNETCONSOLE_ENABLED': False,
})
process = CrawlerProcess(settings=settings)
process.crawl(spider1)
process.start()

Can´t download images with Scrapy

I can't download images. I have several problems (I tried so many variations). This is my code (I guess it has many errors)
The goal is to crawl the start URL and save all the product images and change its names by SKU number. Also, the spider has to click "next button" to do the same task in all the pages (there are around 24.000 products)
The problems that I noticed are:
I don't know the exact configuration with Items Pipelines
The images don't download on the folder in settings.py
I want to filter images by resolution and use thumbnails. Which one is the recommended configuration?
The images are located on another server. This is a problem?
SETTINGS.PY
BOT_NAME = 'soarimages'
SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'
ITEMS.PY
import scrapy
class soarimagesItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
PIPELINES.PY
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class soarimagesPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
Productphotos.PY (Spider)
# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem
class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
def parse(self, response):
SECTION_SELECTOR = '.one-prod'
for soarimages in response.css(SECTION_SELECTOR):
image = soarimagesItem()
image['title'] = response.xpath('.//p[#class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
rel = response.xpath('//div/a/img/#data-original').extract_first()
image['image_urls'] = ['http:'+rel[0]]
yield image
NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

This is my code (I guess it has many errors)
In fact I could spot at least one error: allowed_domains should list only the domains. You must not include any http:// prefix:
allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']
You maybe want to fix this and test your spider. If new questions arise create specific questions for each specific problem. This makes it easier to help you. See also how to ask

Needed a little twitch in my scrapy code to shake off redundant data

I wrote a code in scrapy to scrape coffee shops from yellowpage. The total data is around 870 but I'm getting around 1200 with a minimum number of duplicates. Moreover, in the csv output the data are getting placed in every alternate row. Expecting someone to provide me with a workaround. Thanks in advance.
Folder Name "yellpg" and "items.py" contains
from scrapy.item import Item, Field
class YellpgItem(Item):
name= Field()
address = Field()
phone= Field()
Spider Name "yellsp.py" which contains:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from yellpg.items import YellpgItem
class YellspSpider(CrawlSpider):
name = "yellsp"
allowed_domains = ["yellowpages.com"]
start_urls = (
'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=1',
)
rules = (Rule(LinkExtractor(allow=('\&page=.*',)),callback='parse_item',follow=True),)
def parse_item(self, response):
page=response.xpath('//div[#class="info"]')
for titles in page:
item = YellpgItem()
item["name"] = titles.xpath('.//span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()').extract()
item["phone"] = titles.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
yield item
To get the CSV output, the command line I'm using:
scrapy crawl yellsp -o items.csv

I could recommend creating a pipeline that stores items to later check if the new ones are duplicates, but that isn't a real solution here, as it could create memory problems.
The real solution here is that you should avoid "storing" duplicates in your final database.
Define what field of your item is going to behave as the index in your database and everything should be working.

The best way would be to use CSVItemExporter in your pipeline. Create a file named pipeline.py inside your scrapy project and add below code lines.
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class CSVExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_coffer_shops.csv' % spider.name, 'w+b') # hard coded filename, not a good idea
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now add these lines in setting.py
ITEM_PIPELINES = {
'your_project_name.pipelines.CSVExportPipeline': 300
}
This custom CSVItemExporter will export your data in CSV styles. If you are not getting the data as expected you can modify process_item method to suits your need.

Scrapy not able to get image url and also not able to download images

I have tried every search result in google and stack overflow solution but i am not able to get the solution.
I am creating a scrapy to extract images please find the code below
My items.py
class MyntraItem(scrapy.Item):
product_urls=scrapy.Field()
files=scrapy.Field()
image_urls=scrapy.Field()
images = scrapy.Field()
My settings.py
BOT_NAME = 'hello'
SPIDER_MODULES = ['myntra.spiders']
NEWSPIDER_MODULE = 'myntra.spiders'
FILES_STORE = '/home/swapnil/Desktop/AI/myntra/'
ITEM_PIPELINES = {
#'myntra.pipelines.SomePipeline': 300,
'scrapy.pipelines.images.FilesPipeline': 1,
}
My first.py
class FirstSpider(CrawlSpider):
name = "first"
allowed_domains = ["myntra.com"]
start_urls = [
'http://www.myntra.com/men-sports-tshirts-menu?src=tNav&f=Pattern_article_attr%3Astriped',
]
rules = [Rule(LinkExtractor(restrict_xpaths=['//*[#class="product-link"]']),callback='parse_lnk',follow=True)]
#rules = [Rule(LinkExtractor(allow=['.*']),callback='parse_lnk',follow=True)]
def parse_lnk(self, response):
item=MyntraItem()
item['product_urls']=response.url
item['files']=response.xpath('//*[#class="thumbnails-selected-image"]/#src')
item['image_urls']=item['files']
#print '666666666666666666',item['files']
return item
Please help: My intention is to download the images.

By default, FilesPipeline expects file URLs to be available from the value of an item's "file_urls" key.
(...) if a spider returns a dict with the URLs key ("file_urls" or
"image_urls", for the Files or Images Pipeline respectively), the
pipeline will put the results under respective key ("files" or "images").
It seems you are using "product_urls". To change where the pipeline looks for URLs, you need to set FILES_URLS_FIELD = "product_urls".

Use ImagesPipeline instead, and extract the images using regex.
In My first.py
item['files']= re.findall('front":\{"path":"(.+?)"', response.body)
In settings.py
IMAGES_STORE = '/home/swapnil/Desktop/AI/myntra/'
ITEM_PIPELINES = {'myntra.pipelines.SomePipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,}
This would works like a charm.

Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

I have an array of links that define the structure of a website. While downloading images from these links, I want to simultaneously place the downloaded images in a folder structure similar to the website structure, and not just rename it (as answered in Scrapy image download how to use custom filename)
My code for the same is like this:
class MyImagesPipeline(ImagesPipeline):
"""Custom image pipeline to rename images as they are being downloaded"""
page_url=None
def image_key(self, url):
page_url=self.page_url
image_guid = url.split('/')[-1]
return '%s/%s/%s' % (page_url,image_guid.split('_')[0],image_guid)
def get_media_requests(self, item, info):
#http://store.abc.com/b/n/s/m
os.system('mkdir '+item['sku'][0].encode('ascii','ignore'))
self.page_url = urlparse(item['start_url']).path #I store the parent page's url in start_url Field
for image_url in item['image_urls']:
yield Request(image_url)
It creates the required folder structure but when I go into the folders in deapth, I see that the files have been misplaced in the folders.
I'm suspecting that it is happening because the "get_media_requests" and "image_key" functions might be executing asynchronously hence the value of "page_url" changes before it is used by the "image_key" function.

You are absolutely right that asynchronous Item processing prevents using class variables via self within the pipeline. You will have to store your path in each Request and override a few more methods (untested):
def image_key(self, url, page_url):
image_guid = url.split('/')[-1]
return '%s/%s/%s' % (page_url, image_guid.split('_')[0], image_guid)
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url, meta=dict(page_url=urlparse(item['start_url']).path))
def get_images(self, response, request, info):
key = self.image_key(request.url, request.meta.get('page_url'))
...
def media_to_download(self, request, info):
...
key = self.image_key(request.url, request.meta.get('page_url'))
...
def media_downloaded(self, response, request, info):
...
try:
key = self.image_key(request.url, request.meta.get('page_url'))
...

This scrapy pipeline extension provides an easy way to store downloaded files into a folder tree.
You have to install it:
pip install scrapy_folder_tree
and then, add the pipeline on your configuration:
ITEM_PIPELINES = {
'scrapy_folder_tree.ImagesHashTreePipeline': 300
}
Disclaimer: I'm the author of scrapy-folder-tree

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understandin how rename images scrapy works - python

Related

Python + Scrapy: Issues running "ImagesPipeline" when running crawler from script

Can´t download images with Scrapy

Needed a little twitch in my scrapy code to shake off redundant data

Scrapy not able to get image url and also not able to download images

Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

Categories

Resources