Print succeeds but logging module throws exception - python

I'm trying to log the contents of a file, but I get some funny behavior from the logging module (and not only that one).
Here is the file contents:
"Testing …"
Testing å¨'æøöä
"Testing å¨'æøöä"
And here is how I open and log it:
with codecs.open(f, "r", encoding="utf-8") as myfile:
script = myfile.read()
log.debug("Script type: {}".format(type(script)))
print(script)
log.debug("{}".format(script.encode("utf8")))
The line where I log the type of the object shows up as follows in my logs:
Script type: <type 'unicode'>
Then the print ... line prints the contents correctly to console, but, the logging module throws an exception:
Traceback (most recent call last):
File "/usr/lib/python2.7/logging/__init__.py", line 882, in emit
stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 63: ordinal not in range(128)
When I remove the .encode("utf8") bit from that last line, I get the expected exception:
'ascii' codec can't encode character u'\u2026' in position 9: ordinal not in range(128)
This is just to demonstrate the problem. It's not only the logging module. Rest of my code also throws similar exceptions when dealing with this "unicode" string.
What am I doing wrong?

Logging handles Unicode values just fine:
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> script = u'"Testing …"'
>>> logging.debug(script)
DEBUG:root:"Testing …"
(Writing to a log file will result in UTF-8 encoded messages).
Where you went wrong is by mixing byte strings and Unicode values while using str.format():
>>> "{}".format(script)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 9: ordinal not in range(128)
If you used a unicode format string you avoid the forced implicit encoding:
>>> u"{}".format(script)
u'"Testing \u2026"'

Related

Error when printing random line from text file

I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.

VADER-Sentiment-Analysis toolkit and decoding to UTF-8

I'm trying out this awesome sentiment analysis toolkit for python called Vader (https://github.com/cjhutto/vaderSentiment#python-code-example). However, I'm not even able to run their examples, because of a decoding problem (?).
I've tried the .decode('utf-8'), but it still gives me this error code:
Traceback (most recent call last):
File "/Users/solari/Codes/EmotionalTwitter/vader.py", line 22, in
<module>
analyzer = SentimentIntensityAnalyzer()
File "/usr/local/lib/python3.6/site-
packages/vaderSentiment/vaderSentiment.py", line 199, in __init__
self.lexicon_full_filepath = f.read()
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/l
ib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6573: ordinal not in range(128)
[Finished in 0.5s with exit code 1]
Why does it complain about this "ascii codec"? Because if I've read their documentation correctly this should be in utf-8 anyway. Also, I'm using Python 3.6.2.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 81201: ordinal not in range(128)

I am trying to read a dictionary_format_file.txt but I keep getting the error at line.
I read other posts and they made complete sense, however I could not fix my problem still.
Any help is appreciated.
import ast
path = '/Users/xyz/Desktop/final/'
filename = 'dictionary_format_text_file.txt'
with open((path+filename), 'r') as f:
s=f.read()
s=s.encode('ascii', 'ignore').decode('ascii')
Error:
Traceback (most recent call last):
File "/Users/xyz/Desktop/final/boolean_query.py", line 347, in <module>
s=s.encode('ascii', 'ignore').decode('ascii')
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
builtins.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 81201: ordinal not in range(128)
f.read is returning a byte string, not a Unicode string. When you try to use encode on it, Python 2 tries to decode it first using the 'ascii' codec with errors turned on (Python 3 would simply give you an error without trying to decode). It's that hidden decode that is generating the error. You can easily avoid it by getting rid of the redundant encode:
s=s.decode('ascii', 'ignore')

Decoding HappyBase data from HBase

While trying to decode the values from HBase, i am seeing an error but it is apparent that Python thinks it is not in UTF-8 format but the Java application that put the data into HBase encoded it in UTF-8 only
a = '\x00\x00\x00\x00\x10j\x00\x00\x07\xe8\x02Y'
a.decode("UTF-8")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 9: invalid continuation byte
any thoughts?
that data is not valid utf-8, so if you really retrieved it as such from the database, you should check who/what put it in there.

All of a sudden str.decode('unicode_escape') stopped working [2.7.3]

This snippet is taken from my recent python work. And it used to work just fine
strr = "What is th\u00e9 point?"
print strr.decode('unicode_escape')
But now it throws the unicode decoding error:
Traceback (most recent call last):
File "C:\Users\Lenon\Documents\WorkDir\pyWork\ocrFinale\F1\tests.py", line 49, in <module>
print strr.decode('unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)
What is the possible cause of this?
You either have enabled unicode literals or have created a Unicode object by another means, by mistake.
The strr value is already a unicode object, so in order to decode the value Python first tries to encode to a byte string.
If you have an actual byte string your code works:
>>> strr = "What is th\u00e9 point?"
>>> strr.decode('unicode_escape')
u'What is th\xe9 point?'
but as soon as strr is in fact a Unicode object, you get the error as Python tries to encode the object using the default ASCII codec first:
>>> strr.decode('unicode_escape').decode('unicode_escape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)
It could be that you enabled unicode_literals, for example:
>>> from __future__ import unicode_literals
>>> strr = "What is th\u00e9 point?"
>>> type(strr)
<type 'unicode'>
>>> strr.decode('unicode_escape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)

Categories

Resources