Encode byte string in utf8 - python

How do I encode a byte string with non-ascii bytes into the utf8 format? For example:
x = zlib.compress(pickle.dumps(numpy.random.rand(10, 10)))
# What to do here?
y = x.encode('utf8')
This will give me an error saying that some bytes are not in range(128). What am I supposed to do?

You have to decide what code-points a non-ASCII byte refers to. For instance, what code-point does the byte 0xA1 refer to?
For instance, you could use any of the iso-8859-X encodings:
bytes = chr(161)
utf8 = bytes.decode('iso-8859-1').encode('utf-8')
# compare with: utf8 = bytes.decode('iso-8859-2').encode('utf-8')
Note that the choice of encoding makes a difference - under iso-8859-1 the byte 0xA1 is encoded as u'\xc2\xa1' but under iso-8859-2 it's encoded as u'\xc4\x84'.

Related

Error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte

I'm a beginner at Python, and I would like to read multiple csv file and when i encode them with encoding = "ISO-8859-1",I get this kind of characters in my csv file : "D°faut". So I tried to encode in utf-8, I get this error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte'.
Can someone help me please ?
Thank you !
If you decode with utf-8 you should also encode with utf-8.
Depending on the unicode character you want to display (basically everything except for basic latin letters, digits and the usual symbols) utf-8 needs multiple bytes to store it. Since the file is read byte by byte you need to know if the next character needs more than a byte. This is indicated by the most significant bit of the byte. 0xb0 translates to 1011 0000 in binary and as you can see, the first bit is a 1 and that tells the utf-8 decoder that it needs more bytes for the character to be read. Since you encoded with iso-8859-1 the following byte will be part of the current character and encoding fails.
If you want to encode the degree symbol (°), it would be encoded as 0xC2 0xB0.
In any case: Always encode with the same encoding as you want to decode. If you need characters outside the code page, use utf-8. In general using any of the utf encodings is a good advice.

When UTF is multibyte & latin1 is single byte, why do I get error?

I am reading a CSV file, through pandas.read_csv(). When specifying enconding = UTF-8 or 16, it gives an error.
'utf-8' codec can't decode byte 0xa3 in position 127: invalid start byte
My doubt is, when UTF is a multibyte encoding and latin1 is a single byte encoding, why do I get an error when using UTF-8 or 16, but works fine with latin1?
Shouldn't UTF be superior and decode all characters?
Thanks in advance.
Tried encoding = latin1, 'cp1252', 'iso-8859-15'
UTF-8 is self-synchronizing; you can tell where you are in a multibyte character without examining the neighboring characters. So if you reach a byte that isn't a start byte before hitting a start byte, you know it's either not UTF-8, or the UTF-8 is corrupt.
UTF-8 isn't magic; you can encode just about anything to UTF-8, but you can only decode as UTF-8 when you have UTF-8 bytes.
Latin-1 decodes everything because latin-1, like most one byte per character ASCII superset encodings, is dumb. It just maps every byte value to a single character (the equivalent Unicode ordinal in the case of latin-1). So no matter what garbage you throw at it, latin-1 will decode it, but the result will also be garbage unless the text actually is latin-1 (or ASCII, which latin-1 is a superset of). This is why the one byte per character ASCII supersets are generally a bad idea; if you use the Windows locale's chosen ASCII superset, then it works on your machine and that of anyone else with the same locale, but as soon as it is loaded on a machine in a different locale, they silently get garbage.
Short answer: Your data isn't UTF-8 encoded, or it's corrupted. You need to figure out what it really is.

UTF-8 encoding in str type Python 2

I have a Python 2.7 code which retrieves a base64 encoded response from a server. This response is decoded using base64 module (b64decode / decodestring functions, returning str). Its decoded content has the Unicode code points of the original strings.
I need to convert these Unicode code points to UTF-8.
The original string has a substring content "Não". When I decode the responded string, it shows:
>>> encoded_str = ... # server response
>>> decoded_str = base64.b64decode(encoded_str)
>>> type(decoded_str)
<type 'str'>
>>> decoded_str[x:y]
'N\xe3o'
When I try to encode to UTF-8, it leads to errors as
>>> (decode_str[x:y]).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
However, when this string is manually written in Unicode type, I can correctly convert it to my desired UTF-8 string.
>>> test_str = u'N\xe3o'
>>> test.encode('utf-8')
'N\xc3\xa3o'
I have to retrieve this response from the server and correctly generate an UTF-8 string which can be printed as "Não", how can I do this in Python 2?
You want to decode, not encode the byte string.
Think of it like this: a Unicode string was encoded into bytes, and these bytes were further encoded into base64.
To reverse this, you need to reverse both encodings, in the opposite order.
However, the sample you show most definitely isn't a valid UTF-8 byte string - 0xE3 in isolation is not a valid UTF-8 encoding. Most likely, the Unicode string was encoded using Latin-1 or a related encoding (the sample is much too small to establish this conclusively; other common candidates are the fugly Windows code page CP1252 and Latin-9).

python unicode: what codec will preserve literal byte values?

I need to decode a byte string into unicode, and insert it into a unicode string (python 2.7). When I later encode that unicode string back into bytes, the byte array must be equal to the original bytes. My question is which encoding I should use to achieve this.
Example:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("ascii"))
backToBytes = unicodeString.encode("ascii")
assert byteString==backToBytes
This fails with the infamous:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 128: ordinal not in range(128)
What encoding should I use here (instead of 'ascii') to preserve my byte values?
I am using "ascii" in this (currently broken) example, because it is my default encoding:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
It turns out that the 'latin1' (aka 'iso-8859-1') encoding will preserve every byte literally. This link mentions this fact, although other sources led me to believe this was false. I can confirm that running this code on python 2.7 works, demonstrating that 'iso-8859-1' does indeed preserve every possible byte:
#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("iso-8859-1"))
backToBytes = unicodeString.encode("iso-8859-1")
assert byteString==backToBytes

Append bytes to string in python

I have bytes array as str and want to send it (if I look at it through a debugger, it shows me that File.body is str). For that reason I have to create message to send.
request_text += '\n'.join([
'',
'--%s' % boundary_id,
attachment_headers,
File.body,
])
But at only it tries to join file body, I receive exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Still, in example where I took it, it was implemented this way. How should I order python string to work byte string? Do I should decode it somehow? But how, if that is not text, but just bytes.
You are probably getting the error as your string has non-ascii characters. The following is an example on how to encode/decode a string containing non-ascii characters
1) convert the string to unicode
string="helloé"
u=unicode(string, 'utf-8')
2) Encode the string to utf-8 before sending it on the network
encoded = u.encode('utf-8')
3) Decode it from utf-8 to unicode on the other side
encoded.decode('utf-8')

Categories

Resources