I've received a .xlsx file where foreign language characters appear to have to have been originally encoded as utf-8 and then decoded as utf-16.
I don't have access to the original file.
é and ó, for example, should be read as é and ó, respectively, which are encoded as 0xC3 0xA9 and 0xC3 0xB3. Instead each single utf-8 character has been decoded at some point as two utf-16 characters.
I've tried encoding them to bytes and decoding them with UTF-8, but that doesn't translate correctly.
This:
s = "ó".encode("utf-16")
uni = s.decode("utf-8")
print(uni)
returns this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I've tried the above encode/decode sequence with a variety of different parameters including UTF-16be and UTF-16le, and every error parameter of decode, and just slicing off the BOM, none of which worked.
Is there a way to fix this? Some library that can make this a less painful process than doing a literal replacement by reading every string and replacing ó with ó?
I have the following string in Python 3:
bytestring = b'Zeer ge\xc3\xafnteresseerd naar iemands verhalen luisteren.'
How do I get this to a string with normal characters? That is:
'Zeer geïnteresseerd naar iemands verhalen luisteren.'
I've already tried decoding it using:
bytestring.decode('utf-8)
But when I try to print that value to the console Python gives me the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xef' in position 7: ordinal not in range(128)
Any help appreciated.
SOLUTION
I solved the problem by typing the following in the terminal:
export PYTHONIOENCODING=UTF-8
After that I was able to print the decoded bytestring to the console.
It seems like you are working with unicode rather than string. See if this helps. You decode using this custom function; first with UTF8 and then with Latin1 then encode to ascii.
def CustomDecode(mystring):
'''Accepts string and tries decode with UTF8 first and then Latin1'''
c=''.join(map(lambda x: chr(ord(x)),mystring))
decval = None
try:
decval = c.decode('utf8')
except UnicodeDecodeError:
decval = c.decode('latin1')
return decval
CustomDecode(mystring).encode('ascii', 'ignore')
Result:
'Zeer genteresseerd naar iemands verhalen luisteren.'
I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.
I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position
ordinal not in range(128)]
Sample input:
Start: myUsername: myÜsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.
The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).
I found .. in my eyes a workaround.
Doesn't feel right though, but it does the job.
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8')
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.
I have following string in js.
*"form-uploads/2015 Perry's Awärds Letter.jpg"*
It has a ä symbol.
When i encode it in js using btoa ( in chrome) i get following:
"Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw=="
And when I try to decode it in python I get following:
In[16]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==')
Out[16]: "form-uploads/2015 Perry's Aw\xe4rds Letter.jpg"
So ä got lost, and if I try to decode that string for utf-8 I get a error.
In[18]: base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 28: invalid continuation byte
How can i get a proper utf-8 ä in python code after decoding?
You need to decode with latin1 encoding and then print the Unicode :
>>> print base64.b64decode(u'Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
form-uploads/2015 Perry's Awärds Letter.jpg
Try latin1, it can't be utf8 because in utf8 there are no 1 byte chars with MSB set to 1 (like \xe4).
base64.b64decode('Zm9ybS11cGxvYWRzLzIwMTUgUGVycnkncyBBd+RyZHMgTGV0dGVyLmpwZw==').decode('latin1')
Also btoa is not working well with unicode in general:
https://developer.mozilla.org/en/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem
I think I'm just fundamentally confused about char sets that are not ascii.
I have a python file that I have declared at the top to be # -*- coding: cp1252 -*-.
In the file I have question = "what is your borther’s name", for example.
type(question)
>> str
question
>> 'what is your borther\xe2\x80\x99s name'
And I cannot convert to unicode at this point, presumably because you can't go from ASCII to Unicode.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 20: ordinal not in range(128)
if I declare as unicode to begin with:
question = "what is your borther’s name"
>> u'what is your borther\u2019s name'
How do I get "what is your borther’s name" back? Or is just a how python interpreter displays unicode strings and it in fact will encode correctly when I pass it to an unicode-aware application (in this case, Office)?
I need to preserve the special characters but I still need to do a string comparison using Levenshtein library (pip install python-Levenshtein).
Levenshtein.ratio takes str or unicode for both of its arguments, but not mixed.
I have a plain text file that I have declared at the top to be # -*- coding: cp1252 -*-.
That does nothing.
with codecs.open(..., encoding='cp1252') as fp:
...