Python & Scrapy output: "\r\n\t\t\t\t\t\t\t" - python

I'M learning scraping with Scrapy and having some issues with some code giving me a weird output that I don't understand. Can someone explain to me why I am getting a bunch "\r\n\t\t\t\t\t\t\t"
I found this solution on Stack overflow:
Remove an '\\n\\t\\t\\t'-element from list
But I want to learn what is causing it.
Here is my code that is causing my issue. The Strip method from the link above solves it, but as mentioned, I don't understand where it is coming from.
import scrapy
import logging
import re
class CitySpider(scrapy.Spider):
name = 'city'
allowed_domains = ['www.a-tembo.nl']
start_urls = ['https://www.a-tembo.nl/themas/category/city/']
def parse(self, response):
titles = response.xpath("//div[#class='hikashop_category_image']/a")
for title in titles:
series = title.xpath(".//#title").get()
link = title.xpath(".//#href").get()
#absolute_url = f"https://www.a-tembo.nl{link}"
#absolute_url = response.urljoin(link)
yield response.follow(link, callback=self.parse_title)
def parse_title(self, response):
rows = response.xpath("//table[#class='hikashop_products_table adminlist table']/tbody/tr")
for row in rows:
product_code = row.xpath(".//span[#class='hikashop_product_code']/text()").get()
product_name = row.xpath(".//span[#class='hikashop_product_name']/a/text()").get()
yield{
"Product_code": product_code,
"Product_name": product_name
}

Characters like \n are called escape characters.
For example:
\n indicates a new line and \t signifies a tab. Websites are full of them, although you never see them without inspecting the page. If you want to learn more about escape characters in Python you can read about them here. I hope that answers your question.

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Trouble scraping with scrapy

Here is my code guys, to explain first of all I scraped listing links, then I yielded response to go through every link of a listing and then parse some info e.g name,address,price,number. While running it in terminal I get some errors such as (price = response.css('div.article_right_price::text').get().strip()
AttributeError: 'NoneType' object has no attribute 'strip'), but I still can export it into csv without problem, but one thing this particular language is Georgian and when I export it to CSV i only see symbols which are not georgian :)) i would be grateful if someone could help me.
import scrapy
class SsHomesSpider(scrapy.Spider):
name = 'ss_home'
start_urls = ['https://ss.ge/ka/udzravi-qoneba/l/bina/qiravdeba?CurrentUserId=&Query=&MunicipalityId=95&CityIdList=95&subdistr=&stId=&PrcSource=2&StatusField.FieldId=34&StatusField.Type=SingleSelect&StatusField.StandardField=Status&StatusField.SelectedValues=2&QuantityFrom=&QuantityTo=&PriceType=false&CurrencyId=2&PriceFrom=300&PriceTo=500&Context.Request.Query%5BQuery%5D=&IndividualEntityOnly=true&Fields%5B3%5D.FieldId=151&Fields%5B3%5D.Type=SingleSelect&Fields%5B3%5D.StandardField=None&Fields%5B4%5D.FieldId=150&Fields%5B4%5D.Type=SingleSelect&Fields%5B4%5D.StandardField=None&Fields%5B5%5D.FieldId=152&Fields%5B5%5D.Type=SingleSelect&Fields%5B5%5D.StandardField=None&Fields%5B6%5D.FieldId=29&Fields%5B6%5D.Type=SingleSelect&Fields%5B6%5D.StandardField=None&Fields%5B7%5D.FieldId=153&Fields%5B7%5D.Type=MultiSelect&Fields%5B7%5D.StandardField=None&Fields%5B8%5D.FieldId=30&Fields%5B8%5D.Type=SingleSelect&Fields%5B8%5D.StandardField=None&Fields%5B0%5D.FieldId=48&Fields%5B0%5D.Type=Number&Fields%5B0%5D.StandardField=None&Fields%5B0%5D.ValueFrom=&Fields%5B0%5D.ValueTo=&Fields%5B1%5D.FieldId=146&Fields%5B1%5D.Type=Number&Fields%5B1%5D.StandardField=None&Fields%5B1%5D.ValueFrom=&Fields%5B1%5D.ValueTo=&Fields%5B2%5D.FieldId=28&Fields%5B2%5D.Type=Number&Fields%5B2%5D.StandardField=Floor&Fields%5B2%5D.ValueFrom=&Fields%5B2%5D.ValueTo=&Fields%5B9%5D.FieldId=15&Fields%5B9%5D.Type=Group&Fields%5B9%5D.StandardField=None&Fields%5B9%5D.Values%5B0%5D.Value=35&Fields%5B9%5D.Values%5B1%5D.Value=36&Fields%5B9%5D.Values%5B2%5D.Value=37&Fields%5B9%5D.Values%5B3%5D.Value=38&Fields%5B9%5D.Values%5B4%5D.Value=39&Fields%5B9%5D.Values%5B5%5D.Value=40&Fields%5B9%5D.Values%5B6%5D.Value=41&Fields%5B9%5D.Values%5B7%5D.Value=42&Fields%5B9%5D.Values%5B8%5D.Value=24&Fields%5B9%5D.Values%5B9%5D.Value=27&Fields%5B9%5D.Values%5B10%5D.Value=22&Fields%5B9%5D.Values%5B11%5D.Value=20&Fields%5B9%5D.Values%5B12%5D.Value=8&Fields%5B9%5D.Values%5B13%5D.Value=6&Fields%5B9%5D.Values%5B14%5D.Value=4&Fields%5B9%5D.Values%5B15%5D.Value=5&Fields%5B9%5D.Values%5B16%5D.Value=9&Fields%5B9%5D.Values%5B17%5D.Value=3&Fields%5B9%5D.Values%5B18%5D.Value=120&AgencyId=&VipStatus=&Fields%5B9%5D.Values%5B0%5D.Selected=false&Fields%5B9%5D.Values%5B1%5D.Selected=false&Fields%5B9%5D.Values%5B2%5D.Selected=false&Fields%5B9%5D.Values%5B3%5D.Selected=false&Fields%5B9%5D.Values%5B4%5D.Selected=false&Fields%5B9%5D.Values%5B5%5D.Selected=false&Fields%5B9%5D.Values%5B6%5D.Selected=false&Fields%5B9%5D.Values%5B7%5D.Selected=false&Fields%5B9%5D.Values%5B8%5D.Selected=false&Fields%5B9%5D.Values%5B9%5D.Selected=false&Fields%5B9%5D.Values%5B10%5D.Selected=false&Fields%5B9%5D.Values%5B11%5D.Selected=false&Fields%5B9%5D.Values%5B12%5D.Selected=false&Fields%5B9%5D.Values%5B13%5D.Selected=false&Fields%5B9%5D.Values%5B14%5D.Selected=false&Fields%5B9%5D.Values%5B15%5D.Selected=false&Fields%5B9%5D.Values%5B16%5D.Selected=false&Fields%5B9%5D.Values%5B17%5D.Selected=false&Fields%5B9%5D.Values%5B18%5D.Selected=false']
def parse(self, response):
all_listing = response.css('div.latest_desc a::attr(href)')
for list in all_listing:
yield response.follow(list.get(), callback=self.parse_listings)
def parse_listings(self, response):
name = response.css('div.article_in_title h1::text').get()
price = response.css('div.article_right_price::text').get().strip()
square_m = response.css('div.WholeFartBlock text::text').get().strip()
street = response.css('div.StreeTaddressList a::text').get().strip()
number = response.css('div.UserMObileNumbersBlock a::attr(href)').get().strip("tel':")
yield {
'name': name,
'price': price,
'square_m': square_m,
'street': street,
'number': number
}
The error you are getting is not scrapy related. You are calling method strip() on a None object. Your selectors are returning None instead of the string value you are expecting. Check your selectors again and also consider using scrapy itemloaders to clean your items.

Scrapy CSV incorrectly formatted

I'm new to scrapy package, and here's my problem:
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
token = response.css("input[name=csrf_token] ::attr(value)").extract_first()
formdata = {
'csrf_token' : token,
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest(response.url, formdata=formdata, callback=self.parse_logged)
def parse_logged(self, response):
yield {
'text' : response.css('span.text::Text').extract(),
'author' : response.css('small.author::Text').extract(),
'tags' : response.css('div.tags a.tag::Text').extract()
}
This is my spider. And it does work. But when I try to:
scrapy crawl simple_spider -o mySpider.csv
the .csv file doesn't seen to be correctly formated. It extracts only the "text" column.
What's wrong?
Thank you!
Editted: This is my .csv file:
text,author,tags
"“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,“It is our choices, Harry, that show what we truly are, far more than our abilities.”,“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”,“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”,“Try not to become a man of success. Rather become a man of value.”,“It is better to be hated for what you are than to be loved for what you are not.”,“I have not failed. I've just found 10,000 ways that won't work.”,“A woman is like a tea bag; you never know how strong it is until it's in hot water.”,“A day without sunshine is like, you know, night.”","Albert Einstein,J.K. Rowling,Albert Einstein,Jane Austen,Marilyn Monroe,Albert Einstein,André Gide,Thomas A. Edison,Eleanor Roosevelt,Steve Martin","change,deep-thoughts,thinking,world,abilities,choices,inspirational,life,live,miracle,miracles,aliteracy,books,classic,humor,be-yourself,inspirational,adulthood,success,value,life,love,edison,failure,inspirational,paraphrased,misattributed-eleanor-roosevelt,humor,obvious,simile"
...
Figure out now that it is not that there are empty columns. The .csv file format is not well defined. Everything came up in just one row!
Solved!
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
formdata = {
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest.from_response(response, formdata=formdata, callback=self.parse_logged,)
def parse_logged(self, response):
# Get list of Selector objects and loop through them
for quote in response.css('div.quote'):
# yield each item individually
yield {
'text' : quote.css('span.text::Text').extract_first(),
'author' : quote.css('small.author::Text').extract_first(),
'author_goodreads_url' : quote.css('span a[href*="goodreads.com"]::attr(href)').extract_first(),
'tags' : quote.css('div.tags a.tag::Text').extract()
}
The problem was I was using extract(). What I wanted to do was get a list of Selector objects.
Using extract() will always produce a list output. When you use extract you get a list of strings of the html you requested with the selector, or when using extract_first() a single string. By not using extract() nor extract_first() you create a list of selectors which you can then iterate through and chain a new selector on it allowing you to pickout each individual item.

Scrapy Table column and rows doesn't work

I want to scrapy the table of this page, but the scrapped data is only in one column, and in some case the data doesn't appear. Also I use the shell to see if the Xpath is correct (I use the Xpath helper to identify these xpath)
import scrapy
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'scrape-xpath'
start_urls = [
'http://explorer.eu/contents/food/28?utf8=/',
]
def parse(self, response):
for flv in response.xpath('//html/body/main/div[4]'):
yield {
'Titulo': flv.xpath('//*#id="chromatography"]/table/tbody/tr[3]/th/strong/a/text()"]/tbody/tr[5]/td[3]/a[2]').extract(),
'contenido': flv.xpath('//*#id="chromatography"]/table/tbody/tr[5]/td[3]/a[2]/text()').extract(),
'clase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[1]/text()').extract(),
'Subclase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[2]/a/text').extract(),
}
From the example URL given, it's not exactly obvious what the values should be and how extraction should generalize for page containing more records. So I tried a different page with multiple records, let's see if the result gets you what you need. Here's ready to run code:
# -*- coding: utf-8 -*-
import scrapy
class PhenolExplorerSpider(scrapy.Spider):
name = 'phenol-explorer'
start_urls = ['http://phenol-explorer.eu/contents/food/29?utf8=/']
def parse(self, response):
chromatography = response.xpath('//div[#id="chromatography"]')
title = chromatography.xpath('.//tr/th[#class="outer"]/strong/a/text()').extract_first()
for row in chromatography.xpath('.//tr[not(#class="header")]'):
class_ = row.xpath('./td[#rowspan]/text()').extract_first()
if not class_:
class_ = row.xpath('./preceding-sibling::tr[td[#rowspan]][1]/td[#rowspan]/text()').extract_first()
subclass = row.xpath('./td[not(#rowspan)][1]/a/text()').extract_first()
#content = row.xpath('./td[not(#rowspan)][2]/a[2]/text()').extract_first()
content = row.xpath('./td[not(#rowspan)][2]/text()').extract_first()
yield {
'title': title.strip(),
'class': class_.strip(),
'subclass': subclass.strip(),
'content': content.strip(),
}
Basically, it iterates over individual rows of the table and extracts the data from corresponding fields, yielding an item once complete information is collected.
Try this:
for row in response.css('#chromatography table tr:not(.header)'):
yield {'titulo': row.xpath('./preceding-sibling::tr/th[contains(#class, "outer")]//a/text()').extract_first().strip(),
'clase': row.xpath('./preceding-sibling::tr/th[contains(#class, "inner")]//text()').extract_first().strip(),
'subclase': row.xpath('./td[2]//text()').extract_first().strip(),
'contenido': row.css('.content_value a::text').extract_first().strip()}
remember that the inner loop selectors should also be relative to node flv in your case, selecting with // is a global selector so it'll grab everything.
It's also better to inspect the real html code, because the browser might render some other code different to the actual html received (for example the tbody tags)

Scrapy Pull Same Data from Multiple Pages

This is related to the previous question I wrote here. I am trying to pull the same data from multiple pages on the same domain. A small explanation, I'm trying to pull data like offensive yards, turnovers, etc from a bunch of different box scores on a main page. Pulling the data from individual pages is working properly as is generation of the urls but when I try to have the spider cycle through all of the pages nothing is returned. I've looked through many other questions people have asked and the documentation and I can't figure out what is not working. Code is below. Thanks to anyone who's able to help in advance.
import scrapy
from scrapy import Selector
from nflscraper.items import NflscraperItem
class NFLScraperSpider(scrapy.Spider):
name = "pfr"
allowed_domains = ['www.pro-football-reference.com/']
start_urls = [
"http://www.pro-football-reference.com/years/2015/games.htm"
#"http://www.pro-football-reference.com/boxscores/201510110tam.htm"
]
def parse(self,response):
for href in response.xpath('//a[contains(text(),"boxscore")]/#href'):
item = NflscraperItem()
url = response.urljoin(href.extract())
request = scrapy.Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
def parse_dir_contents(self,response):
item = response.meta['item']
# Code to pull out JS comment - https://stackoverflow.com/questions/38781357/pro-football-reference-team-stats-xpath/38781659#38781659
extracted_text = response.xpath('//div[#id="all_team_stats"]//comment()').extract()[0]
new_selector = Selector(text=extracted_text[4:-3].strip())
# Item population
item['home_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[2]/td[last()]/text()').extract()[0].strip()
item['away_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[1]/td[last()]/text()').extract()[0].strip()
item['home_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[2]/text()').extract()[0].strip()
item['away_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[1]/text()').extract()[0].strip()
item['home_dyds'] = item['away_oyds']
item['away_dyds'] = item['home_oyds']
item['home_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[2]/text()').extract()[0].strip()
item['away_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[1]/text()').extract()[0].strip()
yield item
The subsequent requests you make are filtered as offsite, fix your allowed_domains setting:
allowed_domains = ['pro-football-reference.com']
Worked for me.

Categories

Resources