missing data using scrapy

missing data using scrapy - python

I'm using scrapy to get the data of
http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805
So I created some items to save the information, but I don't get all the data every time I run the script, usually I get some empty items so I need to run the script again until I get all the items.
This is the code of the spider
import scrapy
from tutorial.items import Product
from scrapy.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["bbb.org/"]
start_urls = [
"http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805"
#"http://www.bbb.org/greater-san-francisco/business-reviews/architects/a-d-architects-in-oakland-ca-133229"
#"http://www.bbb.org/greater-san-francisco/business-reviews/architects/aecom-in-concord-ca-541360"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
producto = Product()
#producto['name'] = response.xpath('//*[#id="business-detail"]/div/h1')
producto = Product(Name=response.xpath('//*[#id="business-detail"]/div/h1/text()').extract(),
Telephone=response.xpath('//*[#id="business-detail"]/div/p/span[1]/text()').extract(),
Address=response.xpath('//*[#id="business-detail"]/div/p/span[2]/span[1]/text()').extract(),
Description=response.xpath('//*[#id="business-description"]/p[2]/text()').extract(),
BBBAccreditation =response.xpath('//*[#id="business-accreditation-content"]/p[1]/text()').extract(),
Complaints=response.xpath('//*[#id="complaint-sort-container"]/text()').extract(),
Reviews=response.xpath('//*[#id="complaint-sort-container"]/p/text()').extract(),
WebPage=response.xpath('//*[#id="business-detail"]/div/p/span[3]/a/text()').extract(),
Rating = response.xpath('//*[#id="accedited-rating"]/img/text()').extract(),
ServiceArea = response.xpath('//*[#id="business-additional-info-text"]/span[4]/p/text()').extract(),
ReasonForRating = response.xpath('//*[#id="reason-rating-content"]/ul/li[1]/text()').extract(),
NumberofEmployees = response.xpath('//*[#id="business-additional-info-text"]/p[8]/text()').extract(),
LicenceNumber = response.xpath('//*[#id="business-additional-info-text"]/p[6]/text()').extract(),
Contact = response.xpath('//*[#id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
BBBFileOpened = response.xpath('//*[#id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
BusinessStarted = response.xpath('//*[#id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),)
#producto.add_xpath('name', '//*[#id="business-detail"]/div/h1')
#product.add_value('name', 'today') # you can also use literal values
#product.load_item()
return producto
This page requieres to set an user agent, so I have a file of user agents, could be than some of them are wrong?

Yes, some of your user agents could be wrong (maybe some old ones, deprecated) and the site, if there is no problem on using only one user-agent, you could add that to settings.py:
USER_AGENT="someuseragent"
Remember to remove or disable the randoming user-agent also from settings.py

Related

Scrapy dont save values to items

Since today, my spider wont save any information to my items "DuifpicturesItem".
I got almost the same spider created for a different customer, but this one wont save anything, idk why. My items.py only have two fields: Images and Link
In my console, i can see, that i collects the right data, but it doenst save it
My Code
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DuifpicturesItem
from scrapy.http import Request, FormRequest
import csv
class DuifLogin(CrawlSpider):
name = "duiflogin"
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
custom_settings = {'FEED_EXPORT_FIELDS' : ['SKU', 'Title', 'Price', 'Link', 'Title_small', 'NL_PL_PC', 'Description' ] }
with open("duifonlylinks.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
rules = (
Rule(
LinkExtractor(),
callback='parse_page',
follow=True
),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.parse,
dont_filter=True
)
def parse(self, response):
return FormRequest.from_response(response,formdata={
'username' : '****',
'password' : '****',
'submit' : ''
}, callback=self.after_loging)
def after_loging(self, response):
accview = response.xpath('//div[#class="c-accountbox clearfix js-match-height"]/h3')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[#class="product-details col-md-12"]')
if not productpage:
print('No product', response.url)
for a in productpage:
items = DuifpicturesItem()
items['Link'] = response.url
items['Images'] = response.xpath('//div[#class="inner"]/img/#src').getall()
yield items
My console
here you can see, that it scrapes links and images like i want to, but the .csv/.json file still empty
P.S
the login data isnt correct, but for this proccess, i dont have to be login, so i guess, it doenst effect the crawling process.

Not sure of what you mean by "save it". Since you made no mention to a pipeline, I'm assuming you don't have one for handling your items, so your Items are beign kept on memory only.
If you want to save your scraped items into a file you need to use the feed export. Simplest way would be:
scrapy crawl myspider -o items.json
It supports other formats, check the documentation.
If you meant to save into a DB or do something else with the data check the ItemPipelines.

2 functions in scrapy spider and the second one not running

I am using scrapy to get the content inside some urls on a page, similar to this question here:
Use scrapy to get list of urls, and then scrape content inside those urls
I am able to get the subURLs from my start urls(first def), However, my second def doesn't seem to be passing through. And the result file is empty. I have tested the content inside the function in scrapy shell and it is getting the info I want, but not when I am running the spider.
import scrapy
from scrapy.selector import Selector
#from scrapy import Spider
from WheelsOnlineScrapper.items import Dealer
from WheelsOnlineScrapper.url_list import urls
import logging
from urlparse import urljoin
logger = logging.getLogger(__name__)
class WheelsonlinespiderSpider(scrapy.Spider):
logger.info('Spider starting')
name = 'wheelsonlinespider'
rotate_user_agent = True # lives in middleware.py and settings.py
allowed_domains = ["https://wheelsonline.ca"]
start_urls = urls # this list is created in url_list.py
logger.info('URLs retrieved')
def parse(self, response):
subURLs = []
partialURLs = response.css('.directory_name::attr(href)').extract()
for i in partialURLs:
subURLs = urljoin('https://wheelsonline.ca/', i)
yield scrapy.Request(subURLs, callback=self.parse_dealers)
logger.info('Dealer ' + subURLs + ' fetched')
def parse_dealers(self, response):
logger.info('Beginning of page')
dlr = Dealer()
#Extracting the content using css selectors
try:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first()
except TypeError:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first()
dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract())
dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first()
logger.info('Dealer fetched ' + dlr['DealerName'])
yield dlr
logger.info('End of page')

Your allowed_domains list contains the protocol (https). It should have only the domain name as per the documentation:
allowed_domains = ["wheelsonline.ca"]
Also, you should've received a message in your log:
URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://wheelsonline.ca in allowed_domains

Can´t download images with Scrapy

I can't download images. I have several problems (I tried so many variations). This is my code (I guess it has many errors)
The goal is to crawl the start URL and save all the product images and change its names by SKU number. Also, the spider has to click "next button" to do the same task in all the pages (there are around 24.000 products)
The problems that I noticed are:
I don't know the exact configuration with Items Pipelines
The images don't download on the folder in settings.py
I want to filter images by resolution and use thumbnails. Which one is the recommended configuration?
The images are located on another server. This is a problem?
SETTINGS.PY
BOT_NAME = 'soarimages'
SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'
ITEMS.PY
import scrapy
class soarimagesItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
PIPELINES.PY
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class soarimagesPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
Productphotos.PY (Spider)
# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem
class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
def parse(self, response):
SECTION_SELECTOR = '.one-prod'
for soarimages in response.css(SECTION_SELECTOR):
image = soarimagesItem()
image['title'] = response.xpath('.//p[#class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
rel = response.xpath('//div/a/img/#data-original').extract_first()
image['image_urls'] = ['http:'+rel[0]]
yield image
NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

This is my code (I guess it has many errors)
In fact I could spot at least one error: allowed_domains should list only the domains. You must not include any http:// prefix:
allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']
You maybe want to fix this and test your spider. If new questions arise create specific questions for each specific problem. This makes it easier to help you. See also how to ask

Scrapy is done running with results in the console but CSV output remains blank

I'm very new to scrapy so it's hard for me to find out what i am doing wrong in case of having no results in csv file. I can see results in the console though. Here is what I tried with:
Main folder is named "realyp".
Spider file is named "yp.py" and the code:
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from realyp.items import RealypItem
class MySpider(BaseSpider):
name="YellowPage"
allowed_domains=["yellowpages.com"]
start_urls=["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=2"]
def parse(self, response):
title = Selector(response)
page=title.xpath('//div[#class="info"]')
items = []
for titles in page:
item = RealypItem()
item["name"] = titles.xpath('.//span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()').extract()
item["phone"] = titles.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
items.append(item)
return items
"items.py" file includes:
from scrapy.item import Item, Field
class RealypItem(Item):
name= Field()
address = Field()
phone= Field()
To get the csv output my command line is:
cd desktop
cd realyp
scrapy crawl YellowPage -o items.csv -t csv
Any help will be greatly appreciated.

As stated by #Granitosauros, you should use yield instead of return. The yield should be inside the for cycle.
In the for cycle, if the path starts with // then all elements in the document which fulfill following criteria are selected (see here).
Here's a (rough) code that works for me:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from realyp.items import RealypItem
class MySpider(BaseSpider):
name="YellowPage"
allowed_domains=["yellowpages.com"]
start_urls=["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=2"]
def parse(self, response):
for titles in response.xpath('//div[#class = "result"]/div'):
item = RealypItem()
item["name"] = titles.xpath('div[2]/div[2]/h2 /a/span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('string(div[2]/div[2]/div/p[#itemprop="address"])').extract()
item["phone"] = titles.xpath('div[2]/div[2]/div/div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
yield item

Scrapy spider results can't be pipelined into database when ran from a script [duplicate]

This question already has answers here:
Getting scrapy project settings when script is outside of root directory
(6 answers)
Closed 9 months ago.
I've written a Scrapy spider that I am trying to run from a python script located in another directory. The code I'm using from the docs seems to run the spider, but when I check the postgresql table, it hasn't been created. The spider only properly pipelines the scraped data if I use the scrapy crawl command. I've tried placing the script in the directory right above the scrapy project and also in the same directory as the config file and neither seem to be creating the table.
The code for the script is below followed by the code for the spider. I think the problem involves the directory in which the script should be place and/or the code that I use within the spider file to enable the spider to be ran from a script, but I'm not sure. Does it look like there is a problem with the function that is being called in the script or is there something that needs to be changed within the settings file? I can provide the code for the pipelines file if necessary, thanks.
Script file (only 3 lines)
from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider
tc_spider.spiderCrawl()
Spider file
import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"
class MySpider3(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.ticketcity.com"]
start_urls = [tc_url]
tickets_list_xpath = './/div[#class = "vevent"]'
def create_link(self, bandname):
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"
self.start_urls = [tc_url]
#return tc_url
tickets_list_xpath = './/div[#class = "vevent"]'
def parse_json(self, response):
loader = response.meta['loader']
jsonresponse = json.loads(response.body_as_unicode())
ticket_info = jsonresponse.get('B')
price_list = [i.get('P') for i in ticket_info]
if len(price_list) > 0:
str_Price = str(price_list[0])
ticketPrice = unicode(str_Price, "utf-8")
loader.add_value('ticketPrice', ticketPrice)
else:
ticketPrice = unicode("sold out", "utf-8")
loader.add_value('ticketPrice', ticketPrice)
return loader.load_item()
def parse_price(self, response):
print "parse price function entered \n"
loader = response.meta['loader']
event_City = response.xpath('.//span[#itemprop="addressLocality"]/text()').extract()
eventCity = ''.join(event_City)
loader.add_value('eventCity' , eventCity)
event_State = response.xpath('.//span[#itemprop="addressRegion"]/text()').extract()
eventState = ''.join(event_State)
loader.add_value('eventState' , eventState)
event_Date = response.xpath('.//span[#class="event_datetime"]/text()').extract()
eventDate = ''.join(event_Date)
loader.add_value('eventDate' , eventDate)
ticketsLink = loader.get_output_value("ticketsLink")
json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
json_id= "".join(json_id_list)
json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True)
def parse(self, response):
"""
# """
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):
loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
loader.add_xpath('eventName' , './/span[#class="summary listingEventName"]/text()')
loader.add_xpath('eventLocation' , './/div[#class="divVenue location"]/text()')
loader.add_xpath('ticketsLink' , './/a[#class="divEventDetails url"]/#href')
#loader.add_xpath('eventDateTime' , '//div[#id="divEventDate"]/#title') #datetime type
#loader.add_xpath('eventTime' , './/*[#class = "productionsTime"]/text()')
print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
#sel.xpath("//span[#id='PractitionerDetails1_Label4']/text()").extract()
ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
ticketsURL = urljoin(response.url, ticketsURL)
yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)
def spiderCrawl():
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider3)
process.start()

It's because your settings object only contains a user agent. Your project settings are what determine which pipeline gets ran. From scrapy docs:
You can
automatically import your spiders passing their name to
CrawlerProcess, and use get_project_settings to get a Settings
instance with your project settings.
more info here http://doc.scrapy.org/en/latest/topics/practices.html
Read more than the first example.

The settings are not beeing read if you run from a parent folder.
This answer helped me:
Getting scrapy project settings when script is outside of root directory

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

missing data using scrapy - python

Yes, some of your user agents could be wrong (maybe some old ones, deprecated) and the site, if there is no problem on using only one user-agent, you could add that to settings.py: USER_AGENT="someuseragent" Remember to remove or disable the randoming user-agent also from settings.py

Related

Scrapy dont save values to items

2 functions in scrapy spider and the second one not running

Can´t download images with Scrapy

Scrapy is done running with results in the console but CSV output remains blank

Scrapy spider results can't be pipelined into database when ran from a script [duplicate]

Categories

Resources