Scraping Japanese website using Scrapy but no data in output file

Scraping Japanese website using Scrapy but no data in output file - python

I am new to Scrapy. I wanted some data scraped from a Japanese website but when I run the following spider, it won't show any data on the exported file. Can someone help me please.
Exporting to csv format doesn't show any results in the shell either, just [].
Here is my code.
import scrapy
class suumotest(scrapy.Spider):
name = "testsecond"
start_urls = [
'https://suumo.jp/jj/chintai/ichiran/FR301FC005/?tc=0401303&tc=0401304&ar=010&bs=040'
]
def parse(self, response):
# for following property link
for href in response.css('.property_inner-title+a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_info)
# defining parser to extract data
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'Title': extract_with_css('h1.section_title::text'),
'Fee': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-accent::text'),
'Fee Descrition': extract_with_css('td.detailinfo-col--01 span.detailvalue-item-text::text'),
'Prop Description': extract_with_css('td.detailinfo-col--03::text'),
'Prop Address': extract_with_css('td.detailinfo-col--04::text'),
}

Your first css selector in parse method is faulty here:
response.css('.property_inner-title+a::attr(href)').extract()
+ is the fault here. Just replace it with a space, like:
response.css('.property_inner-title a::attr(href)').extract()
Another issue is in your defined extract_with_css() function:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
The problem here is that extract_first() will return None by default if no values are found and .strip() is a function of string base class, since you're not getting a string this will throw an error.
To fix that you can set default value to extract_first to be an empty string instead:
def parse_info(self, response):
def extract_with_css(query):
return response.css(query).extract_first('').strip()

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,

Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()

If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Getting this error when scraping JSON with scrapy: Spider must return request, item, or None, got 'str'

I am trying to get a json field with key "longName" with scrapy but I am receiving the error: "Spider must return request, item, or None, got 'str'".
The JSON I'm trying to scrape looks something like this:
{
"id":5355,
"code":9594,
}sadsadsd
This is my code:
import scrapy
import json
class NotesSpider(scrapy.Spider):
name = 'notes'
allowed_domains = ['blahblahblah.com']
start_urls = ['https://blahblahblah.com/api/123']
def parse(self, response):
data = json.loads(response.body)
yield from data['longName']
I get the above error when I run "scrapy crawl notes" in prompt. Anyone can point me in the right direction?

If you only want longName modifying your parse method like this should do the trick:
def parse(self, response):
data = json.loads(response.body)
yield {"longName": data["longName"]}

Trouble scraping with scrapy

Here is my code guys, to explain first of all I scraped listing links, then I yielded response to go through every link of a listing and then parse some info e.g name,address,price,number. While running it in terminal I get some errors such as (price = response.css('div.article_right_price::text').get().strip()
AttributeError: 'NoneType' object has no attribute 'strip'), but I still can export it into csv without problem, but one thing this particular language is Georgian and when I export it to CSV i only see symbols which are not georgian :)) i would be grateful if someone could help me.
import scrapy
class SsHomesSpider(scrapy.Spider):
name = 'ss_home'
start_urls = ['https://ss.ge/ka/udzravi-qoneba/l/bina/qiravdeba?CurrentUserId=&Query=&MunicipalityId=95&CityIdList=95&subdistr=&stId=&PrcSource=2&StatusField.FieldId=34&StatusField.Type=SingleSelect&StatusField.StandardField=Status&StatusField.SelectedValues=2&QuantityFrom=&QuantityTo=&PriceType=false&CurrencyId=2&PriceFrom=300&PriceTo=500&Context.Request.Query%5BQuery%5D=&IndividualEntityOnly=true&Fields%5B3%5D.FieldId=151&Fields%5B3%5D.Type=SingleSelect&Fields%5B3%5D.StandardField=None&Fields%5B4%5D.FieldId=150&Fields%5B4%5D.Type=SingleSelect&Fields%5B4%5D.StandardField=None&Fields%5B5%5D.FieldId=152&Fields%5B5%5D.Type=SingleSelect&Fields%5B5%5D.StandardField=None&Fields%5B6%5D.FieldId=29&Fields%5B6%5D.Type=SingleSelect&Fields%5B6%5D.StandardField=None&Fields%5B7%5D.FieldId=153&Fields%5B7%5D.Type=MultiSelect&Fields%5B7%5D.StandardField=None&Fields%5B8%5D.FieldId=30&Fields%5B8%5D.Type=SingleSelect&Fields%5B8%5D.StandardField=None&Fields%5B0%5D.FieldId=48&Fields%5B0%5D.Type=Number&Fields%5B0%5D.StandardField=None&Fields%5B0%5D.ValueFrom=&Fields%5B0%5D.ValueTo=&Fields%5B1%5D.FieldId=146&Fields%5B1%5D.Type=Number&Fields%5B1%5D.StandardField=None&Fields%5B1%5D.ValueFrom=&Fields%5B1%5D.ValueTo=&Fields%5B2%5D.FieldId=28&Fields%5B2%5D.Type=Number&Fields%5B2%5D.StandardField=Floor&Fields%5B2%5D.ValueFrom=&Fields%5B2%5D.ValueTo=&Fields%5B9%5D.FieldId=15&Fields%5B9%5D.Type=Group&Fields%5B9%5D.StandardField=None&Fields%5B9%5D.Values%5B0%5D.Value=35&Fields%5B9%5D.Values%5B1%5D.Value=36&Fields%5B9%5D.Values%5B2%5D.Value=37&Fields%5B9%5D.Values%5B3%5D.Value=38&Fields%5B9%5D.Values%5B4%5D.Value=39&Fields%5B9%5D.Values%5B5%5D.Value=40&Fields%5B9%5D.Values%5B6%5D.Value=41&Fields%5B9%5D.Values%5B7%5D.Value=42&Fields%5B9%5D.Values%5B8%5D.Value=24&Fields%5B9%5D.Values%5B9%5D.Value=27&Fields%5B9%5D.Values%5B10%5D.Value=22&Fields%5B9%5D.Values%5B11%5D.Value=20&Fields%5B9%5D.Values%5B12%5D.Value=8&Fields%5B9%5D.Values%5B13%5D.Value=6&Fields%5B9%5D.Values%5B14%5D.Value=4&Fields%5B9%5D.Values%5B15%5D.Value=5&Fields%5B9%5D.Values%5B16%5D.Value=9&Fields%5B9%5D.Values%5B17%5D.Value=3&Fields%5B9%5D.Values%5B18%5D.Value=120&AgencyId=&VipStatus=&Fields%5B9%5D.Values%5B0%5D.Selected=false&Fields%5B9%5D.Values%5B1%5D.Selected=false&Fields%5B9%5D.Values%5B2%5D.Selected=false&Fields%5B9%5D.Values%5B3%5D.Selected=false&Fields%5B9%5D.Values%5B4%5D.Selected=false&Fields%5B9%5D.Values%5B5%5D.Selected=false&Fields%5B9%5D.Values%5B6%5D.Selected=false&Fields%5B9%5D.Values%5B7%5D.Selected=false&Fields%5B9%5D.Values%5B8%5D.Selected=false&Fields%5B9%5D.Values%5B9%5D.Selected=false&Fields%5B9%5D.Values%5B10%5D.Selected=false&Fields%5B9%5D.Values%5B11%5D.Selected=false&Fields%5B9%5D.Values%5B12%5D.Selected=false&Fields%5B9%5D.Values%5B13%5D.Selected=false&Fields%5B9%5D.Values%5B14%5D.Selected=false&Fields%5B9%5D.Values%5B15%5D.Selected=false&Fields%5B9%5D.Values%5B16%5D.Selected=false&Fields%5B9%5D.Values%5B17%5D.Selected=false&Fields%5B9%5D.Values%5B18%5D.Selected=false']
def parse(self, response):
all_listing = response.css('div.latest_desc a::attr(href)')
for list in all_listing:
yield response.follow(list.get(), callback=self.parse_listings)
def parse_listings(self, response):
name = response.css('div.article_in_title h1::text').get()
price = response.css('div.article_right_price::text').get().strip()
square_m = response.css('div.WholeFartBlock text::text').get().strip()
street = response.css('div.StreeTaddressList a::text').get().strip()
number = response.css('div.UserMObileNumbersBlock a::attr(href)').get().strip("tel':")
yield {
'name': name,
'price': price,
'square_m': square_m,
'street': street,
'number': number
}

The error you are getting is not scrapy related. You are calling method strip() on a None object. Your selectors are returning None instead of the string value you are expecting. Check your selectors again and also consider using scrapy itemloaders to clean your items.

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

I am trying to parse a list from a javascript website. When I run it, it only gives me back one entry on each column and then the spider shuts down. I have already set up my middleware settings. I am not sure what is going wrong. Thanks in advance!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[#id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[#class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[#class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})

The .extract_first() (now .get()) you used will always return the first result. It's not an iterator so there is no sense to call it several times. You should try the .getall() method. That will be something like:
names = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-1"]/text()').getall()
sources = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-2"]/text()').getall()

Scrapy Table column and rows doesn't work

I want to scrapy the table of this page, but the scrapped data is only in one column, and in some case the data doesn't appear. Also I use the shell to see if the Xpath is correct (I use the Xpath helper to identify these xpath)
import scrapy
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'scrape-xpath'
start_urls = [
'http://explorer.eu/contents/food/28?utf8=/',
]
def parse(self, response):
for flv in response.xpath('//html/body/main/div[4]'):
yield {
'Titulo': flv.xpath('//*#id="chromatography"]/table/tbody/tr[3]/th/strong/a/text()"]/tbody/tr[5]/td[3]/a[2]').extract(),
'contenido': flv.xpath('//*#id="chromatography"]/table/tbody/tr[5]/td[3]/a[2]/text()').extract(),
'clase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[1]/text()').extract(),
'Subclase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[2]/a/text').extract(),
}

From the example URL given, it's not exactly obvious what the values should be and how extraction should generalize for page containing more records. So I tried a different page with multiple records, let's see if the result gets you what you need. Here's ready to run code:
# -*- coding: utf-8 -*-
import scrapy
class PhenolExplorerSpider(scrapy.Spider):
name = 'phenol-explorer'
start_urls = ['http://phenol-explorer.eu/contents/food/29?utf8=/']
def parse(self, response):
chromatography = response.xpath('//div[#id="chromatography"]')
title = chromatography.xpath('.//tr/th[#class="outer"]/strong/a/text()').extract_first()
for row in chromatography.xpath('.//tr[not(#class="header")]'):
class_ = row.xpath('./td[#rowspan]/text()').extract_first()
if not class_:
class_ = row.xpath('./preceding-sibling::tr[td[#rowspan]][1]/td[#rowspan]/text()').extract_first()
subclass = row.xpath('./td[not(#rowspan)][1]/a/text()').extract_first()
#content = row.xpath('./td[not(#rowspan)][2]/a[2]/text()').extract_first()
content = row.xpath('./td[not(#rowspan)][2]/text()').extract_first()
yield {
'title': title.strip(),
'class': class_.strip(),
'subclass': subclass.strip(),
'content': content.strip(),
}
Basically, it iterates over individual rows of the table and extracts the data from corresponding fields, yielding an item once complete information is collected.

Try this:
for row in response.css('#chromatography table tr:not(.header)'):
yield {'titulo': row.xpath('./preceding-sibling::tr/th[contains(#class, "outer")]//a/text()').extract_first().strip(),
'clase': row.xpath('./preceding-sibling::tr/th[contains(#class, "inner")]//text()').extract_first().strip(),
'subclase': row.xpath('./td[2]//text()').extract_first().strip(),
'contenido': row.css('.content_value a::text').extract_first().strip()}
remember that the inner loop selectors should also be relative to node flv in your case, selecting with // is a global selector so it'll grab everything.
It's also better to inspect the real html code, because the browser might render some other code different to the actual html received (for example the tbody tags)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping Japanese website using Scrapy but no data in output file - python

Related

Scrapy one item with multiple parsing functions

Getting this error when scraping JSON with scrapy: Spider must return request, item, or None, got 'str'

Trouble scraping with scrapy

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

Scrapy Table column and rows doesn't work

Categories

Resources