Fixed length data field and variable length utf-8 encoding - python

I have a Python project where I have a fixed byte-length text field (NOT FIXED CHAR-LENGTH FIELD) in a comm protocol that contains a utf-8 encoded, NULL padded, NULL terminated string.
I need to ensure that a string fits into the fixed byte-length field. Since utf-8 is a variable width encoding, this makes using brute force to truncate the string at a fixed byte length dicey since you could possibly leave part of a multi-byte character dangling at the end.
Is there a module/method/function/etc that can help me with truncating utf-8 variable width encoded strings to a fixed byte-length?
Something that does Null padding and termination would be a bonus.
This seems like a nut that would have already been cracked. I don't want to reinvent something if it already exists.

Let Python detect and eliminate any partial or invalid characters.
byte_str = uni_str.encode('utf-8')
byte_str = byte_str[:size].decode('utf-8', 'ignore').encode('utf-8')
This works because the UTF-8 spec encodes the number of following bytes in the first byte of a character, so the missing bytes can be easily detected.
Edit: Here's the results from this code using a random oriental character string I pulled from another question. The first number is the maximum size, the second is the actual number of bytes in the UTF-8 string.
45 45 具有靜電產生裝置之影像輸入裝置
44 42 具有靜電產生裝置之影像輸入裝
43 42 具有靜電產生裝置之影像輸入裝
42 42 具有靜電產生裝置之影像輸入裝
41 39 具有靜電產生裝置之影像輸入
40 39 具有靜電產生裝置之影像輸入
39 39 具有靜電產生裝置之影像輸入
38 36 具有靜電產生裝置之影像輸
37 36 具有靜電產生裝置之影像輸
36 36 具有靜電產生裝置之影像輸
35 33 具有靜電產生裝置之影像
34 33 具有靜電產生裝置之影像
33 33 具有靜電產生裝置之影像
32 30 具有靜電產生裝置之影
31 30 具有靜電產生裝置之影

It is very easy to see in a UTF-8 stream whether a given byte is at the start (or not) of a given character's byte stream. If the byte is of the form 10xxxxxx then it is a non-initial byte of a character, if the byte is of the form 0xxxxxx it is a single byte character, and other bytes are the initial bytes of a multi-byte character.
As such, you can build your own function without too much difficulty. Just ensure that the last character you add to your field is either of the form 0xxxxxx, or is of the form 10xxxxxx where the next character (which you're not adding) is not of the form 10xxxxxx. I.e. you make sure you've just added a one-byte UTF-8 character or the last byte of a multi-byte UTF-8 character. You can then just add 0s to fill in the rest of your field.

def fit(s, l):
u = s.decode("utf8")
while True:
if len(s) <= l:
return s + "\0" * (l - len(s))
u = u[:-1]
s = u.encode("utf8")
should be about the thing you need. Maybe you have to refine it; it is untested.
I edited because I accidentally answered in C. I changed the algorithm to a not so optimal one, but easier to understand.

Related

What is this hexadecimal in the utf16 format?

print(bytes('ba', 'utf-16'))
Result :
b'\xff\xfeb\x00a\x00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:
\xff\xfeb
1 - What is this ?????????
2 - Why fe ??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
You have,
b'\xff\xfeb\x00a\x00'
This is what you asked for, it has three characters.
b'\xff\xfe' # 0xff 0xfe
b'b\x00' # 0x62 0x00
b'a\x00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.
If you don't want the BOM, you can use utf-16le or utf-16be.
>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.
I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE: This is the byte-order mark required in UTF-16 (Byte order mark - Wikipedia).
00, followed by the 8-bit encoding of 'b' (that is shown as the character 'b' instead of using an \x escape sequence): This is the 16-bit representation of 'b'.
00, followed by the 8-bit encoding of 'a': This is the 16-bit representation of 'a'.
You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don't tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can't say anything just by looking at these bytes. Did I mean to send you 愀 or a ?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"
print((ordering1 + b"\x00a").decode("utf-16")) # a
print((ordering2 + b"a\x00").decode("utf-16")) # a
Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it's already there.

'UTF-8' decoding error while using unireedsolomon package

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:
def str_to_byte(padded):
byte_array = padded.encode()
binary_int = int.from_bytes(byte_array, "big")
binary_string = bin(binary_int)
without_b = binary_string[2:]
return without_b
def byte_to_str(without_b):
binary_int = int(without_b, 2)
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
padded_char = ascii_text[:]
return padded_char
After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.
If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?
It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.
Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:
import unireedsolomon as rs
import random
def corrupt(encoded):
'''Flip up to 3 bits (might pick the same bit more than once).
'''
b = bytearray(encoded) # convert to writable bytes
for _ in range(3):
index = random.randrange(len(b)) # pick random byte
bit = random.randrange(8) # pic random bit
b[index] ^= 1 << bit # flip it
return bytes(b) # back to read-only bytes, but not necessary
def encode(coder,msg):
'''Convert the msg to UTF-8-encoded bytes and encode with "coder". Return as bytes.
'''
return coder.encode(msg.encode('utf8')).encode('latin1')
def decode(coder,encoded):
'''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
'''
return coder.decode(encoded)[0].encode('latin1').decode('utf8')
coder = rs.RSCoder(20,13)
msg = 'hello(你好)' # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)
Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.
b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)
A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.
UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at
https://en.wikipedia.org/wiki/UTF-8#Encoding
Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.

Why is a whitespace character only represented by 6 bits in ASCII?

I have written a code in python to represent strings using their ASCII counterparts. I have noticed that every character is replaced by 7 bits (as I expected). The problem is that every time I include a space in the string I am converting it is only represented by 6 bits instead of 7. This is a bit of a problem for a Vernam Cipher program I am writing where my ASCII code is always a few bits smaller than my key due to spaces. Here is the code and output below:
string = 'Hello t'
ASCII = ""
for c in string:
ASCII += bin(ord(c))
ASCII = ASCII.replace('0b', ' ')
print(ASCII)
Output: 1001000 1100101 1101100 1101100 1101111 100000 1110100
As can be seen in the output the 6th sequence of bits which represents the space character has only 6 bits and not 7 like the rest of the characters.
Instead of bin(ord(c)), which will automatically strip leading bits, use string formatting to ensure a minimum width:
f'{ord(c):07b}'
The problem lies within your "conversion" - the value for whitespace happens to only need 6 bits, and the bin built-in simply don't do left padding with zeros. That is why you are getting 7 bits for other chars - but it would really be more confortable if you would use 8 bits for everything.
One way to go is, instead of using the bin call, use string formatting operators: these can, besides the base conversion, pad the missing bits with 0s:
string = 'Hello t'
# agregating values in a list so that you can easily separate the binary strings with a " "
# by using ".join"
bin_strings = []
for code in string.encode("ASCII"): # you really should do this from bytes -
#which are encoded text. Moreover, iterating a bytes
# object yield 0-255 ints, no need to call "ord"
bin_strings.append(f"{code:08b}") # format the number in `code` in base 2 (b), with 8 digits, padding with 0s
ASCII = ' '.join(bin_strings)
Or, as a oneliner:
ASCII = " ".join(f"{code:08b}" for code in "Hello World".encode("ASCII"))

hexdump function in python

With below python code
zero = '00000000000000000000000000000000'
print(bytes.fromhex(zero))
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
three = '33333333333333333333333333333333'
print(bytes.fromhex(three))
b'3333333333333333
On printing bytes.fromhex(zero), exact 32 character is printed in hexadecimal.
But while printing bytes.fromhex(three), only 16 character is printed. zero and three string both are of same length:32
The method fromhex(string) returns a bytes object, decoding the given string object. The string must contain two hexadecimal digits per byte, with ASCII whitespace being ignored.
It means the method read one-byte hex string "33" as ASCII code and 33 in ASCII code represents number 3.
Try "34343434" or "32323232". You will see 4444 or 2222.
Have a look at https://docs.python.org/3/library/stdtypes.html

Length of a single character encoded in UTF-32

Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?
>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64
UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?
Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.
Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:
>>> Bits(bytes = 'a'.encode('utf-32-le')).bin
'01100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
32
The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.
Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:
>>> len('abcd'.encode('utf-32')) # (BOM + 4 chars) * 4 bytes == 20 bytes
20

Categories

Resources