Running this code on Ubuntu 10.10 in Python 3.1.1
I am getting the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
And the position of the error changes depending on when I run the following code:
(not the real keys or secret)
sandboxAPIKey = "wed23hf5yxkbmvr9jsw323lkv5g"
sandboxSharedSecret = "98HsIjh39z"
def buildAuthParams():
authHash = hashlib.md5();
#encoding because the update on md5() needs a binary rep of the string
temp = str.encode(sandboxAPIKey + sandboxSharedSecret + repr(int(time.time())))
print(temp)
authHash.update(temp)
#look at the string representation of the binary digest
print(authHash.digest())
#now I want to look at the string representation of the digest
print(bytes.decode(authHash.digest()))
Here is the output of a run (with the sig and key information changed from the real output)
b'sdwe5yxkwewvr9j343434385gkbH4343h4343dz129443643474'
b'\x945EM3\xf5\xa6\xf6\x92\xd1\r\xa5K\xa3IO'
print(bytes.decode(authHash.digest()))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
I am assuming I am not getting something right with my call to decode but I can not figure out what it is. The print of the authHash.digest looks like valid to me.
I would really appreciate any ideas on how to get this to work
When you try to decode a bytearray into a string it tries to match sequentially the bytes to valid characters of an encoding set(by default, utf-8), the exception is being raised because it can't match a sequence of bytes to a valid character in the utf-8 alphabet.
The same will happen if you try to decode it using ascii, any value greater than 127 is an invalid ascii character.
So, if you are trying to get a printable version of the md5 hash, you should hexdigest it, this is the standard way of printing any type of hash, each byte is represented by 2 hexadecimal digits.
In order to do this you can use:
authHash.hexdigest()
If you need to use it in a url, you probably need to encode the bytearray into base64:
base64.b64encode(authHash.digest())
Related
I'm a beginner at Python, and I would like to read multiple csv file and when i encode them with encoding = "ISO-8859-1",I get this kind of characters in my csv file : "D°faut". So I tried to encode in utf-8, I get this error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte'.
Can someone help me please ?
Thank you !
If you decode with utf-8 you should also encode with utf-8.
Depending on the unicode character you want to display (basically everything except for basic latin letters, digits and the usual symbols) utf-8 needs multiple bytes to store it. Since the file is read byte by byte you need to know if the next character needs more than a byte. This is indicated by the most significant bit of the byte. 0xb0 translates to 1011 0000 in binary and as you can see, the first bit is a 1 and that tells the utf-8 decoder that it needs more bytes for the character to be read. Since you encoded with iso-8859-1 the following byte will be part of the current character and encoding fails.
If you want to encode the degree symbol (°), it would be encoded as 0xC2 0xB0.
In any case: Always encode with the same encoding as you want to decode. If you need characters outside the code page, use utf-8. In general using any of the utf encodings is a good advice.
I need to decode a byte string into unicode, and insert it into a unicode string (python 2.7). When I later encode that unicode string back into bytes, the byte array must be equal to the original bytes. My question is which encoding I should use to achieve this.
Example:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("ascii"))
backToBytes = unicodeString.encode("ascii")
assert byteString==backToBytes
This fails with the infamous:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 128: ordinal not in range(128)
What encoding should I use here (instead of 'ascii') to preserve my byte values?
I am using "ascii" in this (currently broken) example, because it is my default encoding:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
It turns out that the 'latin1' (aka 'iso-8859-1') encoding will preserve every byte literally. This link mentions this fact, although other sources led me to believe this was false. I can confirm that running this code on python 2.7 works, demonstrating that 'iso-8859-1' does indeed preserve every possible byte:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("iso-8859-1"))
backToBytes = unicodeString.encode("iso-8859-1")
assert byteString==backToBytes
The string I have is
u'3.4\xa2 / each'
The '\xa2' is the "cent" symbol, and I want to show it that way.
I tried
i= "3.4\xa2 / each"
print unicode(i, errors='replace')
In the result, the cent symbol is shown as a question mark inside a solid circle.
I also tried
i= "3.4\xa2 / each"
print i.encode('utf-8')
I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 3: ordinal not in range(128)
So what is the right way to accomplish this?
'\xa2' is a byte. It may be interpreted as a cent symbol, but only if you specify the right codec. By specifying the right codec can you decode it to the Unicode codepoint equivalent. Latin-1 would do:
>>> print '\xa2'.decode('latin1')
¢
There are a whole series of encodings that encode the ¢ cent codepoint as A2, however.
Alternatively, start with a Unicode string to begin with. \xa2 in a Unicode string expression is the same thing as \u00a2, which happens to be the right codepoint:
>>> print u'\xa2'
¢
>>> print u'\u00a2'
¢
That's because the first 256 codepoints of the Unicode standard happen to match the Latin-1 (ISO-8859-1) standard.
You may have trouble printing; if you are using a terminal or console, print is supposed to automatically encode Unicode data to match your terminal or console configuration, but that may not always be correct or be set to a codec that can handle the characters you are trying to print!
Note that I decoded. If you encode, Python tries to be helpful and decode the bytes to a Unicode object first, so that it can be encoded afterwards. Because \xa2 in not a valid ASCII byte, that decoding failed.
You may want to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
before continuing.
A few points:
encode is a method to convert unicode strings to bytes. If you call encode on a byte string, Python2 will first try to decode it with ASCII and then encode it. That's where your error is coming from.
Your string cannot be decoded with UTF-8, because not every sequence of bytes is valid UTF-8.
Demo:
>>> "3.4\xa2 / each".decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 3: invalid start byte
You can use the latin-1 encoding here, because it maps every byte to the corresponding unicode ordinal.
Demo:
>>> print("3.4\xa2 / each".decode('latin-1'))
3.4¢ / each
You can try:
print "3.4" + u"\u00A2" +"each"
Works for me.
I have bytes array as str and want to send it (if I look at it through a debugger, it shows me that File.body is str). For that reason I have to create message to send.
request_text += '\n'.join([
'',
'--%s' % boundary_id,
attachment_headers,
File.body,
])
But at only it tries to join file body, I receive exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Still, in example where I took it, it was implemented this way. How should I order python string to work byte string? Do I should decode it somehow? But how, if that is not text, but just bytes.
You are probably getting the error as your string has non-ascii characters. The following is an example on how to encode/decode a string containing non-ascii characters
1) convert the string to unicode
string="helloé"
u=unicode(string, 'utf-8')
2) Encode the string to utf-8 before sending it on the network
encoded = u.encode('utf-8')
3) Decode it from utf-8 to unicode on the other side
encoded.decode('utf-8')
I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"
You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'