Python UnicodeDecodeError: ascii vs utf-8 - python

Why the following code is still use "ascii" to decode the string.
Didn't I tell python to use "utf-8" to decode the string? Plus,
how come ignore did not work?
print data.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 12355:

I assume data is a str
print isinstance(data,str)
should probably tell you true
encode wants a unicode so first it tries to decode your str to unicode using the ascii codec
hence why you get the UnicodeDecodeError not UnicodeEncodeError
try
print data.decode("utf-8","ignore")

Related

Is there a way to encode a csv file to UTF-8 in pandas?

My code: data = pd.read_csv('Downloads/samplefile.csv',low_memory=False, encoding='utf-8')
I receive the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 258663: invalid continuation byte
Any help is appreciated.
Your data file might NOT be encoded in UTF-8, because the character 0xd1 is Ñ in the encoding ISO8859-1.
So, use the line below:
data = pd.read_csv('Downloads/samplefile.csv',low_memory=False, encoding='iso8859-1')

How to convert a byte string with a unicode character to normal text in Python?

I have the following string in Python 3:
bytestring = b'Zeer ge\xc3\xafnteresseerd naar iemands verhalen luisteren.'
How do I get this to a string with normal characters? That is:
'Zeer geïnteresseerd naar iemands verhalen luisteren.'
I've already tried decoding it using:
bytestring.decode('utf-8)
But when I try to print that value to the console Python gives me the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xef' in position 7: ordinal not in range(128)
Any help appreciated.
SOLUTION
I solved the problem by typing the following in the terminal:
export PYTHONIOENCODING=UTF-8
After that I was able to print the decoded bytestring to the console.
It seems like you are working with unicode rather than string. See if this helps. You decode using this custom function; first with UTF8 and then with Latin1 then encode to ascii.
def CustomDecode(mystring):
'''Accepts string and tries decode with UTF8 first and then Latin1'''
c=''.join(map(lambda x: chr(ord(x)),mystring))
decval = None
try:
decval = c.decode('utf8')
except UnicodeDecodeError:
decval = c.decode('latin1')
return decval
CustomDecode(mystring).encode('ascii', 'ignore')
Result:
'Zeer genteresseerd naar iemands verhalen luisteren.'

base64 encode in js and decode in python. Unicode issue

I have following string in js.
*"form-uploads/2015 Perry's Awärds Letter.jpg"*
It has a ä symbol.
When i encode it in js using btoa ( in chrome) i get following:
"Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw=="
And when I try to decode it in python I get following:
In[16]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==')
Out[16]: "form-uploads/2015 Perry's Aw\xe4rds Letter.jpg"
So ä got lost, and if I try to decode that string for utf-8 I get a error.
In[18]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 28: invalid continuation byte
How can i get a proper utf-8 ä in python code after decoding?
You need to decode with latin1 encoding and then print the Unicode :
>>> print base64.b64decode(u'Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
form-uploads/2015 Perry's Awärds Letter.jpg
Try latin1, it can't be utf8 because in utf8 there are no 1 byte chars with MSB set to 1 (like \xe4).
base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
Also btoa is not working well with unicode in general:
https://developer.mozilla.org/en/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)

I'm trying to print a string from an archived web crawl, but when I do I get this error:
print page['html']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)
When I try print unicode(page['html']) I get:
print unicode(page['html'],errors='ignore')
TypeError: decoding Unicode is not supported
Any idea how I can properly code this string, or at least get it to print? Thanks.
You need to encode the unicode you saved to display it, not decode it -- unicode is the unencoded form. You should always specify an encoding, so that your code will be portable. The "usual" pick is utf-8:
print page['html'].encode('utf-8')
If you don't specify an encoding, whether or not it works will depend on what you're printing to -- your editor, OS, terminal program, etc.

Categories

Resources