Scrapy: ascii' codec can't encode characters

Scrapy: ascii' codec can't encode characters - python

I am having problem on running my crawler
UnicodeEncodeError: 'ascii' codec can't encode characters in position
I am using this code
author = str(info.css(".author::text").extract_first())
but still I am having that error any idea how can solve it?
Thank you!
Here's the error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line
102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/app/__main__.egg/teslamotorsclub_spider/spiders/teslamotorsclub.py", line 40, in parse
author = str(info.css(".author::text").extract_first())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Try:
author = info.css(".author::text").extract_first().decode('utf-8')
The reason for this is extract_first returns a raw bytes object. To convert this to a string, python makes no guesses as to how it's encoded, therefore, you need to make that explicit. Utf-8 will handle just about anything you throw at it.

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: for textacy

I am using the textacy method to get synonyms.
import textacy.resources
rs = textacy.resources.ConceptNet()
syn=rs.get_synonyms('happy')
I get the below error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 353, in get_synonyms
return self._get_relation_values(self.synonyms, term, lang=lang, sense=sense)
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 338, in synonyms
self._synonyms = self._get_relation_data("/r/Synonym", is_symmetric=True)
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 162, in _get_relation_data
for row in rows:
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\io\csv.py", line 96, in read_csv
for row in csv_reader:
File "C:\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: character maps to <undefined>
I have tried to enforce encoding='utf8' in both concept_net.py", line 162, and io\csv.py", line 96, in read_csv, but that gives another error
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
What can be done ?

Giving an error when extracting cities from a text using geograpy(Python)

I am trying to extract city name from a text but it is giving an error.
Here is my code:
import geograpy
text = 'I am from Delhi'
places = geograpy.get_place_context(text=text)
print(places.cities)
ERROR:
Traceback (most recent call last):
File "C:/Users/M.B.C. Kadawatha/PycharmProjects/NewsFeed/NLP.py", line 17, in <module>
places = geograpy.get_place_context(text=text)
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\__init__.py", line 11, in get_place_context
pc.set_cities()
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\places.py", line 137, in set_cities
self.populate_db()
File "C:\Users\M.B.C. Kadawatha\PycharmProjects\NewsFeed\venv\lib\site-packages\geograpy\places.py", line 30, in populate_db
for row in reader:
File "C:\Users\M.B.C. Kadawatha\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 274: character maps to <undefined>

This should be a bug in the package geograpy.
In geograpy/places.py, revise:
with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "rb") as info:
to
with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "r", encoding="utf-8") as info:
And the problem of encoding should go away.

rendering a file in python using pygal - ascii code error

I am trying to create a pygal chart in python and saving it to a .svg file.
#Creating pygal charts
pie_chart = pygal.Pie(style=DarkSolarizedStyle, legend_box_size = 20, pretty_print=True)
pie_chart.title = 'Github-Migration Status Chart (in %)'
pie_chart.add('Intro', int(intro))
pie_chart.add('Parallel', int(parallel))
pie_chart.add('In Progress', int(in_progress) )
pie_chart.add('Complete', int(complete))
pie_chart.render_to_file('../../../../../usr/share/nginx/html/TeamFornax/githubMigration/OverallProgress/overallProgress.svg')
This simple piece of code seems to give the error -
> Traceback (most recent call last): File
> "/home/ec2-user/githubr/migrationcharts.py", line 161, in <module>
> pie_chart.render_to_file('../../../../../usr/share/nginx/html/TeamFornax/githubMigration/OverallProgress/overallProgress.svg')
> File "/usr/lib/python2.6/site-packages/pygal/ghost.py", line 149, in
> render_to_file
> f.write(self.render(is_unicode=True, **kwargs)) File "/usr/lib/python2.6/site-packages/pygal/ghost.py", line 112, in render
> .render(is_unicode=is_unicode)) File "/usr/lib/python2.6/site-packages/pygal/graph/base.py", line 293, in
> render
> is_unicode=is_unicode, pretty_print=self.pretty_print) File "/usr/lib/python2.6/site-packages/pygal/svg.py", line 271, in render
> self.root, **args) File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1010, in
> tostring
> return string.join(data, "") File "/usr/lib64/python2.6/string.py", line 318, in join
> return sep.join(words) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 40: ordinal not in range(128)
Any idea why ?

try to decode the path string to unicode that you send to render_to_file.
such like:
pie_chart.render_to_file('path/to/overallProgress.svg'.decode('utf-8'))
the decoding charset should be consistent with your file encoding.

How to avoid scrapy UnicodeEncodeError

I have the following code in my parse_item callback:
sel = Selector(response)
item['name'] = sel.xpath('//div[#class="productDescriptionBlock"]/h2/text()').extract()[0]
return item
But I get UnicodeEncodeError:
exceptions.UnicodeEncodeError: 'charmap' codec can't encode character u'\uff01' in position 271761: character maps to <undefined>
I also tried adding .encode('utf-8') but still get the same error.
Traceback (most recent call last):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
for x in result:
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 124, in extract_links
).encode(response.encoding)
File "/home/scraper/.fakeroot/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
exceptions.UnicodeEncodeError: 'charmap' codec can't encode character u'\x99' in position 349751: character maps to <undefined>

I've seen this before. If I'm not wrong, you are using the restrict_xpaths parameter in your rule's link extractor.
Possible solutions are:
Avoid to use restrict_xpaths for that particular site. This happens because the page content contains characters not defined in the declared encoding.
Identify the invalid characters and replace them before the rule acts on it. This can be tricky, though.
Use the middleware in this answer to re-encode the response into its declared encoding: UnicodeEncodeError after setting restrict_xpaths settings

Scrapy response.replace encoding error

I am trying to replace the response body of a search result block of a search result page of google using response.replace() and I face some encoding issues.
scrapy shell "http://www.google.de/search?q=Zuckerccc"
>>> srb = hxs.select("//li[#class='g']").extract()
>>> body = '<html><body>' + srb[0] + '</body></html>' # get only 1st search result block
>>> b = response.replace(body = body)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 54, in replace
return Response.replace(self, *args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 77, in replace
return cls(*args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__
self._set_body(body)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
self._body = body.encode(self._encoding)
File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>
I tried to create my own response as well,
>>> x = HtmlResponse("http://www.google.de/search?q=Zuckerccc", body = body, encoding = response.encoding)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
self._set_body(body)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
self._body = body.encode(self._encoding)
File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__
Also, when I use _body_declared_encoding() for encoding in replace() function, it works.
replace(body = body, encoding = response._body_declared_encoding())
I don't understand why response._body_declared_encoding() and response.encoding are different. Can anybody please shed some light on this.
So, what will be a good way to fix this ?

I successfully replaced the response body with these lines of code:
scrapy shell "http://www.google.de/search?q=Zuckerccc"
>>> google_result = response.xpath('//li[#class="g"]').extract()[0]
>>> body = '<html><body>' + google_result + '</body></html>'
>>> b = response.replace(body = body)

I check the source code from scrapy.http.response.text , when we use TextResponse, we need to tell self._encoding first. So we can do like this:
>>>response._encoding='utf8'
>>>response._set_body("aaaaaa")
>>>response.body
>>>'aaaaaa'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: ascii' codec can't encode characters - python

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: for textacy

Giving an error when extracting cities from a text using geograpy(Python)

rendering a file in python using pygal - ascii code error

How to avoid scrapy UnicodeEncodeError

Scrapy response.replace encoding error

Categories

Resources