crawl pictures from web site with Scrapy

crawl pictures from web site with Scrapy - python

I want to crawl the image of each bottle of wine from web site of vinnicolas and save it in an svc file.
unfortunately, I got some errors :
Spider : https://gist.github.com/anonymous/6424305
pipelines.py. : https://gist.github.com/nahali/6434932
settings.py :

Your parse_wine_page does not set the "image_urls" field value in the items, so the middleware will not download any images
import urlparse
...
def parse_wine_page(self, reponse):
...
hxs = HtmlXPathSelector(response)
content = hxs.select('//*[#id="glo_right"]')
for res in content:
...
#item ["Image"]= map(unicode.strip, res.select('//div[#class="pro_detail_tit"]//div[#class="pro_titre"]/h1/text()').extract())
item['image_urls'] = map(lambda src: urlparse.urljoin(response.url, src), res.select('./div[#class="pro_col_left"]/img/#src').extract())
items.append(item)
return items
Also make sure your Projetvinnicolas3Item class has "images" and "image_urls" Fields()

Related

Scrapy Crawler:Avoid Duplicate Crawling of URLs

I have created a crawler using Scrapy.The crawler is crawling the website fetching the URL.
Technology Used:Python Scrapy
Issue:I am having duplication of URLs.
What I need the output to be:
I want the crawler to crawl the website and fetch the URL's but not crawl the duplicate URL's.
Sample Code:
I have added this code to my settings.py file.
DUPEFILTER_CLASS ='scrapy.dupefilter.RFPDupeFilter'
I ran the file its says module not found.
import scrapy
import os
import scrapy.dupefilters
class MySpider(scrapy.Spider):
name = 'feed_exporter_test'
# this is equivalent to what you would set in settings.py file
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'inputLinks2.csv'
}
filePath='inputLinks2.csv'
if os.path.exists(filePath):
os.remove(filePath)
else:
print("Can not delete the file as it doesn't exists")
start_urls = ['https://www.mytravelexp.com/']
def parse(self, response):
titles = response.xpath("//a/#href").extract()
for title in titles:
yield {'title': title}
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
Please help!!

Scrapy filters duplicate requests by default.

Scrapy dont save values to items

Since today, my spider wont save any information to my items "DuifpicturesItem".
I got almost the same spider created for a different customer, but this one wont save anything, idk why. My items.py only have two fields: Images and Link
In my console, i can see, that i collects the right data, but it doenst save it
My Code
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DuifpicturesItem
from scrapy.http import Request, FormRequest
import csv
class DuifLogin(CrawlSpider):
name = "duiflogin"
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
custom_settings = {'FEED_EXPORT_FIELDS' : ['SKU', 'Title', 'Price', 'Link', 'Title_small', 'NL_PL_PC', 'Description' ] }
with open("duifonlylinks.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
rules = (
Rule(
LinkExtractor(),
callback='parse_page',
follow=True
),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.parse,
dont_filter=True
)
def parse(self, response):
return FormRequest.from_response(response,formdata={
'username' : '****',
'password' : '****',
'submit' : ''
}, callback=self.after_loging)
def after_loging(self, response):
accview = response.xpath('//div[#class="c-accountbox clearfix js-match-height"]/h3')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[#class="product-details col-md-12"]')
if not productpage:
print('No product', response.url)
for a in productpage:
items = DuifpicturesItem()
items['Link'] = response.url
items['Images'] = response.xpath('//div[#class="inner"]/img/#src').getall()
yield items
My console
here you can see, that it scrapes links and images like i want to, but the .csv/.json file still empty
P.S
the login data isnt correct, but for this proccess, i dont have to be login, so i guess, it doenst effect the crawling process.

Not sure of what you mean by "save it". Since you made no mention to a pipeline, I'm assuming you don't have one for handling your items, so your Items are beign kept on memory only.
If you want to save your scraped items into a file you need to use the feed export. Simplest way would be:
scrapy crawl myspider -o items.json
It supports other formats, check the documentation.
If you meant to save into a DB or do something else with the data check the ItemPipelines.

Scrapy Files Pipeline not downloading files

I have been tasked with building a web crawler that downloads all .pdfs in a given site. Spider runs on local machine and on scraping hub. For some reason when I run it only downloads some but not all of the pdfs. This can be seen by looking at the items in the output JSON.
I have set MEDIA_ALLOW_REDIRECTS = True and tried to run it on scrapinghub as well as locally
Here is my spider
import scrapy
from scrapy.loader import ItemLoader
from poc_scrapy.items import file_list_Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class PdfCrawler(CrawlSpider):
# loader = ItemLoader(item=file_list_Item())
downloaded_set = {''}
name = 'example'
allowed_domains = ['www.groton.org']
start_urls = ['https://www.groton.org']
rules=(
Rule(LinkExtractor(allow='www.groton.org'), callback='parse_page', follow=True),
)
def parse_page(self, response):
print('parseing' , response)
pdf_urls = []
link_urls = []
other_urls = []
# print("this is the response", response.text)
all_href = response.xpath('/html/body//a/#href').extract()
# classify all links
for href in all_href:
if len(href) < 1:
continue
if href[-4:] == '.pdf':
pdf_urls.append(href)
elif href[0] == '/':
link_urls.append(href)
else:
other_urls.append(href)
# get the links that have pdfs and send them to the item pipline
for pdf in pdf_urls:
if pdf[0:5] != 'http':
new_pdf = response.urljoin(pdf)
if new_pdf in self.downloaded_set:
# we have seen it before, dont do anything
# print('skipping ', new_pdf)
pass
else:
loader = ItemLoader(item=file_list_Item())
# print(self.downloaded_set)
self.downloaded_set.add(new_pdf)
loader.add_value('file_urls', new_pdf)
loader.add_value('base_url', response.url)
yield loader.load_item()
else:
if new_pdf in self.downloaded_set:
pass
else:
loader = ItemLoader(item=file_list_Item())
self.downloaded_set.add(new_pdf)
loader.add_value('file_urls', new_pdf)
loader.add_value('base_url', response.url)
yield loader.load_item()
settings.py
MEDIA_ALLOW_REDIRECTS = True
BOT_NAME = 'poc_scrapy'
SPIDER_MODULES = ['poc_scrapy.spiders']
NEWSPIDER_MODULE = 'poc_scrapy.spiders'
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'poc_scrapy.middlewares.UserAgentMiddlewareRotator': 400,
}
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':1
}
FILES_STORE = 'pdfs/'
AUTOTHROTTLE_ENABLED = True
here is the output a small portion of the output
{
"file_urls": [
"https://www.groton.org/ftpimages/542/download/download_3402393.pdf"
],
"base_url": [
"https://www.groton.org/parents/business-office"
],
"files": []
},
as you can see the pdf file is in the file_urls but not downloaded, there are 5 warning messages that indicate that the some of them can not be downloaded but there are over 20 missing files.
Here is the warning message I get for some of the files
[scrapy.pipelines.files] File (code: 301): Error downloading file from <GET http://groton.myschoolapp.com/ftpimages/542/download/Candidate_Statement_2013.pdf> referred in <None>
[scrapy.core.downloader.handlers.http11] Received more bytes than download warn size (33554432) in request <GET https://groton.myschoolapp.com/ftpimages/542/download/download_1474034.pdf>
.
I would expect that all the files will be download or at least a warning message for all files that are not downloaded. Maybe there is a workaround.
Any feedback is greatly appreciated. Thanks!

UPDATE: I realized that the problem was that robots.txt was not allowing me to visit some of the pdfs. This could be fixed by using an other service to download them or by not following robots.txt

Can´t download images with Scrapy

I can't download images. I have several problems (I tried so many variations). This is my code (I guess it has many errors)
The goal is to crawl the start URL and save all the product images and change its names by SKU number. Also, the spider has to click "next button" to do the same task in all the pages (there are around 24.000 products)
The problems that I noticed are:
I don't know the exact configuration with Items Pipelines
The images don't download on the folder in settings.py
I want to filter images by resolution and use thumbnails. Which one is the recommended configuration?
The images are located on another server. This is a problem?
SETTINGS.PY
BOT_NAME = 'soarimages'
SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'
ITEMS.PY
import scrapy
class soarimagesItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
PIPELINES.PY
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class soarimagesPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
Productphotos.PY (Spider)
# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem
class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
def parse(self, response):
SECTION_SELECTOR = '.one-prod'
for soarimages in response.css(SECTION_SELECTOR):
image = soarimagesItem()
image['title'] = response.xpath('.//p[#class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
rel = response.xpath('//div/a/img/#data-original').extract_first()
image['image_urls'] = ['http:'+rel[0]]
yield image
NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

This is my code (I guess it has many errors)
In fact I could spot at least one error: allowed_domains should list only the domains. You must not include any http:// prefix:
allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']
You maybe want to fix this and test your spider. If new questions arise create specific questions for each specific problem. This makes it easier to help you. See also how to ask

Scrapy only scrapes the first start url in a list of 15 start urls

I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.

I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}

your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

crawl pictures from web site with Scrapy - python

I want to crawl the image of each bottle of wine from web site of vinnicolas and save it in an svc file. unfortunately, I got some errors : Spider : https://gist.github.com/anonymous/6424305 pipelines.py. : https://gist.github.com/nahali/6434932 settings.py :

Related

Scrapy Crawler:Avoid Duplicate Crawling of URLs

Scrapy dont save values to items

Scrapy Files Pipeline not downloading files

Can´t download images with Scrapy

Scrapy only scrapes the first start url in a list of 15 start urls

Categories

Resources