Properly split unicode string on byte count [duplicate] - python

This question already has answers here:
Split unicode string into 300 byte chunks without destroying characters
(5 answers)
Closed 8 years ago.
I want to split unicode string to max 255 byte characters and return the result as unicode:
# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')
Problem with this snippet, is that if 255-th byte character is part of 2-byte unicode character, I'll get error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data
Even if I handle the error I'll get unwanted garbage at the string end.
How to solve this more elegantly?

One very nice property of UTF-8 is that trailing bytes can easily be differentiated from starting bytes. Just work backwards until you've deleted a starting byte.
trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
final = -1
while ord(trunc_s[final]) & 0xc0 == 0x80:
final -= 1
trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')
Edit: Check out the answers in the question identified as a duplicate, too.

Related

'UTF-8' decoding error while using unireedsolomon package

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:
def str_to_byte(padded):
byte_array = padded.encode()
binary_int = int.from_bytes(byte_array, "big")
binary_string = bin(binary_int)
without_b = binary_string[2:]
return without_b
def byte_to_str(without_b):
binary_int = int(without_b, 2)
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
padded_char = ascii_text[:]
return padded_char
After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.
If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?
It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.
Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:
import unireedsolomon as rs
import random
def corrupt(encoded):
'''Flip up to 3 bits (might pick the same bit more than once).
'''
b = bytearray(encoded) # convert to writable bytes
for _ in range(3):
index = random.randrange(len(b)) # pick random byte
bit = random.randrange(8) # pic random bit
b[index] ^= 1 << bit # flip it
return bytes(b) # back to read-only bytes, but not necessary
def encode(coder,msg):
'''Convert the msg to UTF-8-encoded bytes and encode with "coder". Return as bytes.
'''
return coder.encode(msg.encode('utf8')).encode('latin1')
def decode(coder,encoded):
'''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
'''
return coder.decode(encoded)[0].encode('latin1').decode('utf8')
coder = rs.RSCoder(20,13)
msg = 'hello(你好)' # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)
Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.
b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)
A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.
UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at
https://en.wikipedia.org/wiki/UTF-8#Encoding
Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.

Convert unicode string into byte string [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?
You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.
You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

How to convert byte to string in python 3? [duplicate]

This question already has answers here:
What's the correct way to convert bytes to a hex string in Python 3?
(9 answers)
Closed 4 years ago.
I have a variable b whose value is b'\xac\xed\x05sr\x00'.
How can I convert it to 'aced05737200'?
s, and r are converted to 73 and 72 respectively because their ascii code are 73 and 72.
b.decode('utf-8') gives me this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position
0: invalid start byte
Simply use .hex()-method
>>> b = b'\xac\xed\x05sr\x00'
>>> b.hex()
'aced05737200'
to get the wanted result, because it's not a problem with decoding or encoding. Your bytestring looks ok to produce a proper string object with hexadecimal numbers.

Convert ASCII to Unicode encoding issue [duplicate]

This question already has answers here:
Python string to unicode [duplicate]
(3 answers)
Closed 6 years ago.
I have a question about Python 2 encoding. I am trying to decode an ASCII string which contains Unicode code of a letter to Unicode, and then encode it back to Latin-1, but with no success. Here is an illustration:
In[27]: d = u'\u010d'
In[28]: print d.encode('utf-8')
č
In[29]: d1 = '\u010d'
In[30]: d1.decode('ascii').encode('utf-8')
Out[30]: '\\u010d'
I would like to convert '\u010d' to 'č'. Are there any built-in solutions to avoid custom string replacement?
When you do
d1 = '\u010d'
you actually get this string:
In [3]: d1
Out[3]: '\\u010d'
This is because "normal" (non-Unicode) strings don't recognize the \unnnn escape sequence and therefore convert it to a literal backslash, followed by unnnn.
In order to decode that, you need to use the unicode_escape codec:
In [4]: print d1.decode("unicode_escape").encode('utf-8')
č
But of course you shouldn't use Unicode escape sequences in non-Unicode strings in the first place.

How to remove last utf8 char of a python string

I have a string containing utf-8 encoded text. I need to remove the last utf-8 character.
So far I did
msg = msg[:-1]
but this only removes the last byte. It works as long as the last character is an ASCII code. It doesn't work anymore when the last character is a multibyte character.
The simplest way is to decode your UTF-8 bytes to Unicode text:
without_last = msg.decode('utf8')[:-1]
You can always encode it again.
The alternative would be for you to search for a UTF-8 start byte; UTF-8 byte sequences always start with a byte with the most significant bit set to 0, or the two most significant bits set to 1, while continuation bytes always start with 10:
# find starting byte of last codepoint
pos = len(msg) - 1
while pos > -1 and ord(msg[pos]) & 0xC0 == 0x80:
# character at pos is a continuation byte (bit 7 set, bit 6 not)
pos -= 1
msg = msg[:pos]

Categories

Resources