Trouble scraping with scrapy - python

Here is my code guys, to explain first of all I scraped listing links, then I yielded response to go through every link of a listing and then parse some info e.g name,address,price,number. While running it in terminal I get some errors such as (price = response.css('div.article_right_price::text').get().strip()
AttributeError: 'NoneType' object has no attribute 'strip'), but I still can export it into csv without problem, but one thing this particular language is Georgian and when I export it to CSV i only see symbols which are not georgian :)) i would be grateful if someone could help me.
import scrapy
class SsHomesSpider(scrapy.Spider):
name = 'ss_home'
start_urls = ['https://ss.ge/ka/udzravi-qoneba/l/bina/qiravdeba?CurrentUserId=&Query=&MunicipalityId=95&CityIdList=95&subdistr=&stId=&PrcSource=2&StatusField.FieldId=34&StatusField.Type=SingleSelect&StatusField.StandardField=Status&StatusField.SelectedValues=2&QuantityFrom=&QuantityTo=&PriceType=false&CurrencyId=2&PriceFrom=300&PriceTo=500&Context.Request.Query%5BQuery%5D=&IndividualEntityOnly=true&Fields%5B3%5D.FieldId=151&Fields%5B3%5D.Type=SingleSelect&Fields%5B3%5D.StandardField=None&Fields%5B4%5D.FieldId=150&Fields%5B4%5D.Type=SingleSelect&Fields%5B4%5D.StandardField=None&Fields%5B5%5D.FieldId=152&Fields%5B5%5D.Type=SingleSelect&Fields%5B5%5D.StandardField=None&Fields%5B6%5D.FieldId=29&Fields%5B6%5D.Type=SingleSelect&Fields%5B6%5D.StandardField=None&Fields%5B7%5D.FieldId=153&Fields%5B7%5D.Type=MultiSelect&Fields%5B7%5D.StandardField=None&Fields%5B8%5D.FieldId=30&Fields%5B8%5D.Type=SingleSelect&Fields%5B8%5D.StandardField=None&Fields%5B0%5D.FieldId=48&Fields%5B0%5D.Type=Number&Fields%5B0%5D.StandardField=None&Fields%5B0%5D.ValueFrom=&Fields%5B0%5D.ValueTo=&Fields%5B1%5D.FieldId=146&Fields%5B1%5D.Type=Number&Fields%5B1%5D.StandardField=None&Fields%5B1%5D.ValueFrom=&Fields%5B1%5D.ValueTo=&Fields%5B2%5D.FieldId=28&Fields%5B2%5D.Type=Number&Fields%5B2%5D.StandardField=Floor&Fields%5B2%5D.ValueFrom=&Fields%5B2%5D.ValueTo=&Fields%5B9%5D.FieldId=15&Fields%5B9%5D.Type=Group&Fields%5B9%5D.StandardField=None&Fields%5B9%5D.Values%5B0%5D.Value=35&Fields%5B9%5D.Values%5B1%5D.Value=36&Fields%5B9%5D.Values%5B2%5D.Value=37&Fields%5B9%5D.Values%5B3%5D.Value=38&Fields%5B9%5D.Values%5B4%5D.Value=39&Fields%5B9%5D.Values%5B5%5D.Value=40&Fields%5B9%5D.Values%5B6%5D.Value=41&Fields%5B9%5D.Values%5B7%5D.Value=42&Fields%5B9%5D.Values%5B8%5D.Value=24&Fields%5B9%5D.Values%5B9%5D.Value=27&Fields%5B9%5D.Values%5B10%5D.Value=22&Fields%5B9%5D.Values%5B11%5D.Value=20&Fields%5B9%5D.Values%5B12%5D.Value=8&Fields%5B9%5D.Values%5B13%5D.Value=6&Fields%5B9%5D.Values%5B14%5D.Value=4&Fields%5B9%5D.Values%5B15%5D.Value=5&Fields%5B9%5D.Values%5B16%5D.Value=9&Fields%5B9%5D.Values%5B17%5D.Value=3&Fields%5B9%5D.Values%5B18%5D.Value=120&AgencyId=&VipStatus=&Fields%5B9%5D.Values%5B0%5D.Selected=false&Fields%5B9%5D.Values%5B1%5D.Selected=false&Fields%5B9%5D.Values%5B2%5D.Selected=false&Fields%5B9%5D.Values%5B3%5D.Selected=false&Fields%5B9%5D.Values%5B4%5D.Selected=false&Fields%5B9%5D.Values%5B5%5D.Selected=false&Fields%5B9%5D.Values%5B6%5D.Selected=false&Fields%5B9%5D.Values%5B7%5D.Selected=false&Fields%5B9%5D.Values%5B8%5D.Selected=false&Fields%5B9%5D.Values%5B9%5D.Selected=false&Fields%5B9%5D.Values%5B10%5D.Selected=false&Fields%5B9%5D.Values%5B11%5D.Selected=false&Fields%5B9%5D.Values%5B12%5D.Selected=false&Fields%5B9%5D.Values%5B13%5D.Selected=false&Fields%5B9%5D.Values%5B14%5D.Selected=false&Fields%5B9%5D.Values%5B15%5D.Selected=false&Fields%5B9%5D.Values%5B16%5D.Selected=false&Fields%5B9%5D.Values%5B17%5D.Selected=false&Fields%5B9%5D.Values%5B18%5D.Selected=false']
def parse(self, response):
all_listing = response.css('div.latest_desc a::attr(href)')
for list in all_listing:
yield response.follow(list.get(), callback=self.parse_listings)
def parse_listings(self, response):
name = response.css('div.article_in_title h1::text').get()
price = response.css('div.article_right_price::text').get().strip()
square_m = response.css('div.WholeFartBlock text::text').get().strip()
street = response.css('div.StreeTaddressList a::text').get().strip()
number = response.css('div.UserMObileNumbersBlock a::attr(href)').get().strip("tel':")
yield {
'name': name,
'price': price,
'square_m': square_m,
'street': street,
'number': number
}

The error you are getting is not scrapy related. You are calling method strip() on a None object. Your selectors are returning None instead of the string value you are expecting. Check your selectors again and also consider using scrapy itemloaders to clean your items.

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Python & Scrapy output: "\r\n\t\t\t\t\t\t\t"

I'M learning scraping with Scrapy and having some issues with some code giving me a weird output that I don't understand. Can someone explain to me why I am getting a bunch "\r\n\t\t\t\t\t\t\t"
I found this solution on Stack overflow:
Remove an '\\n\\t\\t\\t'-element from list
But I want to learn what is causing it.
Here is my code that is causing my issue. The Strip method from the link above solves it, but as mentioned, I don't understand where it is coming from.
import scrapy
import logging
import re
class CitySpider(scrapy.Spider):
name = 'city'
allowed_domains = ['www.a-tembo.nl']
start_urls = ['https://www.a-tembo.nl/themas/category/city/']
def parse(self, response):
titles = response.xpath("//div[#class='hikashop_category_image']/a")
for title in titles:
series = title.xpath(".//#title").get()
link = title.xpath(".//#href").get()
#absolute_url = f"https://www.a-tembo.nl{link}"
#absolute_url = response.urljoin(link)
yield response.follow(link, callback=self.parse_title)
def parse_title(self, response):
rows = response.xpath("//table[#class='hikashop_products_table adminlist table']/tbody/tr")
for row in rows:
product_code = row.xpath(".//span[#class='hikashop_product_code']/text()").get()
product_name = row.xpath(".//span[#class='hikashop_product_name']/a/text()").get()
yield{
"Product_code": product_code,
"Product_name": product_name
}
Characters like \n are called escape characters.
For example:
\n indicates a new line and \t signifies a tab. Websites are full of them, although you never see them without inspecting the page. If you want to learn more about escape characters in Python you can read about them here. I hope that answers your question.

Scrapy CSV incorrectly formatted

I'm new to scrapy package, and here's my problem:
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
token = response.css("input[name=csrf_token] ::attr(value)").extract_first()
formdata = {
'csrf_token' : token,
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest(response.url, formdata=formdata, callback=self.parse_logged)
def parse_logged(self, response):
yield {
'text' : response.css('span.text::Text').extract(),
'author' : response.css('small.author::Text').extract(),
'tags' : response.css('div.tags a.tag::Text').extract()
}
This is my spider. And it does work. But when I try to:
scrapy crawl simple_spider -o mySpider.csv
the .csv file doesn't seen to be correctly formated. It extracts only the "text" column.
What's wrong?
Thank you!
Editted: This is my .csv file:
text,author,tags
"“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,“It is our choices, Harry, that show what we truly are, far more than our abilities.”,“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”,“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”,“Try not to become a man of success. Rather become a man of value.”,“It is better to be hated for what you are than to be loved for what you are not.”,“I have not failed. I've just found 10,000 ways that won't work.”,“A woman is like a tea bag; you never know how strong it is until it's in hot water.”,“A day without sunshine is like, you know, night.”","Albert Einstein,J.K. Rowling,Albert Einstein,Jane Austen,Marilyn Monroe,Albert Einstein,André Gide,Thomas A. Edison,Eleanor Roosevelt,Steve Martin","change,deep-thoughts,thinking,world,abilities,choices,inspirational,life,live,miracle,miracles,aliteracy,books,classic,humor,be-yourself,inspirational,adulthood,success,value,life,love,edison,failure,inspirational,paraphrased,misattributed-eleanor-roosevelt,humor,obvious,simile"
...
Figure out now that it is not that there are empty columns. The .csv file format is not well defined. Everything came up in just one row!
Solved!
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
formdata = {
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest.from_response(response, formdata=formdata, callback=self.parse_logged,)
def parse_logged(self, response):
# Get list of Selector objects and loop through them
for quote in response.css('div.quote'):
# yield each item individually
yield {
'text' : quote.css('span.text::Text').extract_first(),
'author' : quote.css('small.author::Text').extract_first(),
'author_goodreads_url' : quote.css('span a[href*="goodreads.com"]::attr(href)').extract_first(),
'tags' : quote.css('div.tags a.tag::Text').extract()
}
The problem was I was using extract(). What I wanted to do was get a list of Selector objects.
Using extract() will always produce a list output. When you use extract you get a list of strings of the html you requested with the selector, or when using extract_first() a single string. By not using extract() nor extract_first() you create a list of selectors which you can then iterate through and chain a new selector on it allowing you to pickout each individual item.

Can scrapy skip the error of empty data and keeping scraping?

I want to scrape product pages from its sitemap, the products page are similar, but not all of them are the same.
for example
Product A
https://www.vitalsource.com/products/environment-the-science-behind-the-stories-jay-h-withgott-matthew-v9780134446400
Product B
https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
we can see the product A has the subtitle but another one doesn't have.
So I get errors when I trying to scrape all the product pages.
My question is, is there a way to let the spider skip the error for returning no data?
There is a simple way to bypass it. that is not using strip()
But I am wondering if there is a better way to do the job.
import scrapy
import re
from VitalSource.items import VitalsourceItem
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider
class VsSpider(SitemapSpider):
name = 'VS'
allowed_domains = ['vitalsource.com']
sitemap_urls = ['https://storage.googleapis.com/vst-stargate-production/sitemap/sitemap1.xml.gz']
sitemap_rules = [
('/products/', 'parse_product'),
]
def parse_product(self, response):
selector = Selector(response=response)
item = VitalsourceItem()
item['Ebook_Title'] = response.css('.product-overview__title-header::text').extract()[1].strip
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
print(item)
return item
error message
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
AttributeError: 'list' object has no attribute 'strip'
Since you need only one subtitle you can use get() with setting default value to empty string. This will save you from errors about applying strip() function to empty element.
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get('').strip()
In general scrapy will not stop crawling if callbacks raise an exception. e.g.:
def start_requests(self):
for i in range(10):
yield Requst(
f'http://example.org/page/{i}',
callback=self.parse,
errback=self.errback,
)
def parse(self, response):
# first page
if 'page/1' in response.request.url:
raise ValueError()
yield {'url': response.url}
def errback(self, failure):
print(f"oh no, failed to parse {failure.request}")
In this example 10 requests will be made and 9 items will be scraped but 1 will fail and got o errback
In your case you have nothing to fear - any request that does not raise an exception will scrape as it should, for the ones that do you'll just see an exception traceback in your terminal/logs.
You could check if a value is returned before extracting:
if response.css("div.subtitle.subtitle-pdp::text"):
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get().strip
That way the subTitle code line would only run if a value was to be returned...

Scraping Japanese website using Scrapy but no data in output file

I am new to Scrapy. I wanted some data scraped from a Japanese website but when I run the following spider, it won't show any data on the exported file. Can someone help me please.
Exporting to csv format doesn't show any results in the shell either, just [].
Here is my code.
import scrapy
class suumotest(scrapy.Spider):
name = "testsecond"
start_urls = [
'https://suumo.jp/jj/chintai/ichiran/FR301FC005/?tc=0401303&tc=0401304&ar=010&bs=040'
]
def parse(self, response):
# for following property link
for href in response.css('.property_inner-title+a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_info)
# defining parser to extract data
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'Title': extract_with_css('h1.section_title::text'),
'Fee': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-accent::text'),
'Fee Descrition': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-text::text'),
'Prop Description': extract_with_css('td.detailinfo-col--03::text'),
'Prop Address': extract_with_css('td.detailinfo-col--04::text'),
}
Your first css selector in parse method is faulty here:
response.css('.property_inner-title+a::attr(href)').extract()
+ is the fault here. Just replace it with a space, like:
response.css('.property_inner-title a::attr(href)').extract()
Another issue is in your defined extract_with_css() function:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
The problem here is that extract_first() will return None by default if no values are found and .strip() is a function of string base class, since you're not getting a string this will throw an error.
To fix that you can set default value to extract_first to be an empty string instead:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first('').strip()

Categories

Resources