python unicode: what codec will preserve literal byte values? - python

I need to decode a byte string into unicode, and insert it into a unicode string (python 2.7). When I later encode that unicode string back into bytes, the byte array must be equal to the original bytes. My question is which encoding I should use to achieve this.
Example:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("ascii"))
backToBytes = unicodeString.encode("ascii")
assert byteString==backToBytes
This fails with the infamous:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 128: ordinal not in range(128)
What encoding should I use here (instead of 'ascii') to preserve my byte values?
I am using "ascii" in this (currently broken) example, because it is my default encoding:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'

It turns out that the 'latin1' (aka 'iso-8859-1') encoding will preserve every byte literally. This link mentions this fact, although other sources led me to believe this was false. I can confirm that running this code on python 2.7 works, demonstrating that 'iso-8859-1' does indeed preserve every possible byte:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("iso-8859-1"))
backToBytes = unicodeString.encode("iso-8859-1")
assert byteString==backToBytes

Related

Python string cent symbol conversion

The string I have is
u'3.4\xa2 / each'
The '\xa2' is the "cent" symbol, and I want to show it that way.
I tried
i= "3.4\xa2 / each"
print unicode(i, errors='replace')
In the result, the cent symbol is shown as a question mark inside a solid circle.
I also tried
i= "3.4\xa2 / each"
print i.encode('utf-8')
I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 3: ordinal not in range(128)
So what is the right way to accomplish this?
'\xa2' is a byte. It may be interpreted as a cent symbol, but only if you specify the right codec. By specifying the right codec can you decode it to the Unicode codepoint equivalent. Latin-1 would do:
>>> print '\xa2'.decode('latin1')
¢
There are a whole series of encodings that encode the ¢ cent codepoint as A2, however.
Alternatively, start with a Unicode string to begin with. \xa2 in a Unicode string expression is the same thing as \u00a2, which happens to be the right codepoint:
>>> print u'\xa2'
¢
>>> print u'\u00a2'
¢
That's because the first 256 codepoints of the Unicode standard happen to match the Latin-1 (ISO-8859-1) standard.
You may have trouble printing; if you are using a terminal or console, print is supposed to automatically encode Unicode data to match your terminal or console configuration, but that may not always be correct or be set to a codec that can handle the characters you are trying to print!
Note that I decoded. If you encode, Python tries to be helpful and decode the bytes to a Unicode object first, so that it can be encoded afterwards. Because \xa2 in not a valid ASCII byte, that decoding failed.
You may want to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
before continuing.
A few points:
encode is a method to convert unicode strings to bytes. If you call encode on a byte string, Python2 will first try to decode it with ASCII and then encode it. That's where your error is coming from.
Your string cannot be decoded with UTF-8, because not every sequence of bytes is valid UTF-8.
Demo:
>>> "3.4\xa2 / each".decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 3: invalid start byte
You can use the latin-1 encoding here, because it maps every byte to the corresponding unicode ordinal.
Demo:
>>> print("3.4\xa2 / each".decode('latin-1'))
3.4¢ / each
You can try:
print "3.4" + u"\u00A2" +"each"
Works for me.

Encode byte string in utf8

How do I encode a byte string with non-ascii bytes into the utf8 format? For example:
x = zlib.compress(pickle.dumps(numpy.random.rand(10, 10)))
# What to do here?
y = x.encode('utf8')
This will give me an error saying that some bytes are not in range(128). What am I supposed to do?
You have to decide what code-points a non-ASCII byte refers to. For instance, what code-point does the byte 0xA1 refer to?
For instance, you could use any of the iso-8859-X encodings:
bytes = chr(161)
utf8 = bytes.decode('iso-8859-1').encode('utf-8')
# compare with: utf8 = bytes.decode('iso-8859-2').encode('utf-8')
Note that the choice of encoding makes a difference - under iso-8859-1 the byte 0xA1 is encoded as u'\xc2\xa1' but under iso-8859-2 it's encoded as u'\xc4\x84'.

Append bytes to string in python

I have bytes array as str and want to send it (if I look at it through a debugger, it shows me that File.body is str). For that reason I have to create message to send.
request_text += '\n'.join([
'',
'--%s' % boundary_id,
attachment_headers,
File.body,
])
But at only it tries to join file body, I receive exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Still, in example where I took it, it was implemented this way. How should I order python string to work byte string? Do I should decode it somehow? But how, if that is not text, but just bytes.
You are probably getting the error as your string has non-ascii characters. The following is an example on how to encode/decode a string containing non-ascii characters
1) convert the string to unicode
string="helloé"
u=unicode(string, 'utf-8')
2) Encode the string to utf-8 before sending it on the network
encoded = u.encode('utf-8')
3) Decode it from utf-8 to unicode on the other side
encoded.decode('utf-8')

Convert hash.digest() to unicode

import hashlib
string1 = u'test'
hashstring = hashlib.md5()
hashstring.update(string1)
string2 = hashstring.digest()
unicode(string2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 1: ordinal
not in range(128)
The string HAS to be unicode for it to be any use to me, can this be done?
Using python 2.7 if that helps...
Ignacio just gave the perfect answer. Just a complement: when you convert some string from an encoding which has chars not found in ASCII to unicode, you have to pass the encoding as a parameter:
>>> unicode("órgão")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode("órgão", "UTF-8")
u'\xf3rg\xe3o'
If you cannot say what is the original encoding (UTF-8 in my example) you really cannot convert to Unicode. It is a signal that something is not pretty correct in your intentions.
Last but not least, encodings are pretty confusing stuff. This comprehensive text about them can make them clear.
The result of .digest() is a bytestring¹, so converting it to Unicode is pointless. Use .hexdigest() if you want a readable representation.
¹ Some bytestrings can be converted to Unicode, but the bytestrings returned by .digest() do not contain textual data. They can contain any byte including the null byte: they're usually not printable without using escape sequences.

Encoding and decoding in Python with MD5( )

Running this code on Ubuntu 10.10 in Python 3.1.1
I am getting the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
And the position of the error changes depending on when I run the following code:
(not the real keys or secret)
sandboxAPIKey = "wed23hf5yxkbmvr9jsw323lkv5g"
sandboxSharedSecret = "98HsIjh39z"
def buildAuthParams():
authHash = hashlib.md5();
#encoding because the update on md5() needs a binary rep of the string
temp = str.encode(sandboxAPIKey + sandboxSharedSecret + repr(int(time.time())))
print(temp)
authHash.update(temp)
#look at the string representation of the binary digest
print(authHash.digest())
#now I want to look at the string representation of the digest
print(bytes.decode(authHash.digest()))
Here is the output of a run (with the sig and key information changed from the real output)
b'sdwe5yxkwewvr9j343434385gkbH4343h4343dz129443643474'
b'\x945EM3\xf5\xa6\xf6\x92\xd1\r\xa5K\xa3IO'
print(bytes.decode(authHash.digest()))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
I am assuming I am not getting something right with my call to decode but I can not figure out what it is. The print of the authHash.digest looks like valid to me.
I would really appreciate any ideas on how to get this to work
When you try to decode a bytearray into a string it tries to match sequentially the bytes to valid characters of an encoding set(by default, utf-8), the exception is being raised because it can't match a sequence of bytes to a valid character in the utf-8 alphabet.
The same will happen if you try to decode it using ascii, any value greater than 127 is an invalid ascii character.
So, if you are trying to get a printable version of the md5 hash, you should hexdigest it, this is the standard way of printing any type of hash, each byte is represented by 2 hexadecimal digits.
In order to do this you can use:
authHash.hexdigest()
If you need to use it in a url, you probably need to encode the bytearray into base64:
base64.b64encode(authHash.digest())

Categories

Resources