I have following string in js.
*"form-uploads/2015 Perry's Awärds Letter.jpg"*
It has a ä symbol.
When i encode it in js using btoa ( in chrome) i get following:
"Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw=="
And when I try to decode it in python I get following:
In[16]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==')
Out[16]: "form-uploads/2015 Perry's Aw\xe4rds Letter.jpg"
So ä got lost, and if I try to decode that string for utf-8 I get a error.
In[18]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 28: invalid continuation byte
How can i get a proper utf-8 ä in python code after decoding?
You need to decode with latin1 encoding and then print the Unicode :
>>> print base64.b64decode(u'Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
form-uploads/2015 Perry's Awärds Letter.jpg
Try latin1, it can't be utf8 because in utf8 there are no 1 byte chars with MSB set to 1 (like \xe4).
base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
Also btoa is not working well with unicode in general:
https://developer.mozilla.org/en/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem
Related
I've received a .xlsx file where foreign language characters appear to have to have been originally encoded as utf-8 and then decoded as utf-16.
I don't have access to the original file.
é and ó, for example, should be read as é and ó, respectively, which are encoded as 0xC3 0xA9 and 0xC3 0xB3. Instead each single utf-8 character has been decoded at some point as two utf-16 characters.
I've tried encoding them to bytes and decoding them with UTF-8, but that doesn't translate correctly.
This:
s = "ó".encode("utf-16")
uni = s.decode("utf-8")
print(uni)
returns this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I've tried the above encode/decode sequence with a variety of different parameters including UTF-16be and UTF-16le, and every error parameter of decode, and just slicing off the BOM, none of which worked.
Is there a way to fix this? Some library that can make this a less painful process than doing a literal replacement by reading every string and replacing ó with ó?
I have the following string in Python 3:
bytestring = b'Zeer ge\xc3\xafnteresseerd naar iemands verhalen luisteren.'
How do I get this to a string with normal characters? That is:
'Zeer geïnteresseerd naar iemands verhalen luisteren.'
I've already tried decoding it using:
bytestring.decode('utf-8)
But when I try to print that value to the console Python gives me the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xef' in position 7: ordinal not in range(128)
Any help appreciated.
SOLUTION
I solved the problem by typing the following in the terminal:
export PYTHONIOENCODING=UTF-8
After that I was able to print the decoded bytestring to the console.
It seems like you are working with unicode rather than string. See if this helps. You decode using this custom function; first with UTF8 and then with Latin1 then encode to ascii.
def CustomDecode(mystring):
'''Accepts string and tries decode with UTF8 first and then Latin1'''
c=''.join(map(lambda x: chr(ord(x)),mystring))
decval = None
try:
decval = c.decode('utf8')
except UnicodeDecodeError:
decval = c.decode('latin1')
return decval
CustomDecode(mystring).encode('ascii', 'ignore')
Result:
'Zeer genteresseerd naar iemands verhalen luisteren.'
Why the following code is still use "ascii" to decode the string.
Didn't I tell python to use "utf-8" to decode the string? Plus,
how come ignore did not work?
print data.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 12355:
I assume data is a str
print isinstance(data,str)
should probably tell you true
encode wants a unicode so first it tries to decode your str to unicode using the ascii codec
hence why you get the UnicodeDecodeError not UnicodeEncodeError
try
print data.decode("utf-8","ignore")
I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s
I'm trying to print a string from an archived web crawl, but when I do I get this error:
print page['html']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)
When I try print unicode(page['html']) I get:
print unicode(page['html'],errors='ignore')
TypeError: decoding Unicode is not supported
Any idea how I can properly code this string, or at least get it to print? Thanks.
You need to encode the unicode you saved to display it, not decode it -- unicode is the unencoded form. You should always specify an encoding, so that your code will be portable. The "usual" pick is utf-8:
print page['html'].encode('utf-8')
If you don't specify an encoding, whether or not it works will depend on what you're printing to -- your editor, OS, terminal program, etc.