Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?
>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64
UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?
Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.
Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:
>>> Bits(bytes = 'a'.encode('utf-32-le')).bin
'01100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
32
The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.
Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:
>>> len('abcd'.encode('utf-32')) # (BOM + 4 chars) * 4 bytes == 20 bytes
20
Related
print(bytes('ba', 'utf-16'))
Result :
b'\xff\xfeb\x00a\x00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:
\xff\xfeb
1 - What is this ?????????
2 - Why fe ??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
You have,
b'\xff\xfeb\x00a\x00'
This is what you asked for, it has three characters.
b'\xff\xfe' # 0xff 0xfe
b'b\x00' # 0x62 0x00
b'a\x00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.
If you don't want the BOM, you can use utf-16le or utf-16be.
>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.
I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE: This is the byte-order mark required in UTF-16 (Byte order mark - Wikipedia).
00, followed by the 8-bit encoding of 'b' (that is shown as the character 'b' instead of using an \x escape sequence): This is the 16-bit representation of 'b'.
00, followed by the 8-bit encoding of 'a': This is the 16-bit representation of 'a'.
You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don't tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can't say anything just by looking at these bytes. Did I mean to send you 愀 or a ?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"
print((ordering1 + b"\x00a").decode("utf-16")) # a
print((ordering2 + b"a\x00").decode("utf-16")) # a
Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it's already there.
I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:
def str_to_byte(padded):
byte_array = padded.encode()
binary_int = int.from_bytes(byte_array, "big")
binary_string = bin(binary_int)
without_b = binary_string[2:]
return without_b
def byte_to_str(without_b):
binary_int = int(without_b, 2)
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
padded_char = ascii_text[:]
return padded_char
After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.
If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?
It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.
Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:
import unireedsolomon as rs
import random
def corrupt(encoded):
'''Flip up to 3 bits (might pick the same bit more than once).
'''
b = bytearray(encoded) # convert to writable bytes
for _ in range(3):
index = random.randrange(len(b)) # pick random byte
bit = random.randrange(8) # pic random bit
b[index] ^= 1 << bit # flip it
return bytes(b) # back to read-only bytes, but not necessary
def encode(coder,msg):
'''Convert the msg to UTF-8-encoded bytes and encode with "coder". Return as bytes.
'''
return coder.encode(msg.encode('utf8')).encode('latin1')
def decode(coder,encoded):
'''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
'''
return coder.decode(encoded)[0].encode('latin1').decode('utf8')
coder = rs.RSCoder(20,13)
msg = 'hello(你好)' # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)
Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.
b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)
A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.
UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at
https://en.wikipedia.org/wiki/UTF-8#Encoding
Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.
I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'
I have a Python project where I have a fixed byte-length text field (NOT FIXED CHAR-LENGTH FIELD) in a comm protocol that contains a utf-8 encoded, NULL padded, NULL terminated string.
I need to ensure that a string fits into the fixed byte-length field. Since utf-8 is a variable width encoding, this makes using brute force to truncate the string at a fixed byte length dicey since you could possibly leave part of a multi-byte character dangling at the end.
Is there a module/method/function/etc that can help me with truncating utf-8 variable width encoded strings to a fixed byte-length?
Something that does Null padding and termination would be a bonus.
This seems like a nut that would have already been cracked. I don't want to reinvent something if it already exists.
Let Python detect and eliminate any partial or invalid characters.
byte_str = uni_str.encode('utf-8')
byte_str = byte_str[:size].decode('utf-8', 'ignore').encode('utf-8')
This works because the UTF-8 spec encodes the number of following bytes in the first byte of a character, so the missing bytes can be easily detected.
Edit: Here's the results from this code using a random oriental character string I pulled from another question. The first number is the maximum size, the second is the actual number of bytes in the UTF-8 string.
45 45 具有靜電產生裝置之影像輸入裝置
44 42 具有靜電產生裝置之影像輸入裝
43 42 具有靜電產生裝置之影像輸入裝
42 42 具有靜電產生裝置之影像輸入裝
41 39 具有靜電產生裝置之影像輸入
40 39 具有靜電產生裝置之影像輸入
39 39 具有靜電產生裝置之影像輸入
38 36 具有靜電產生裝置之影像輸
37 36 具有靜電產生裝置之影像輸
36 36 具有靜電產生裝置之影像輸
35 33 具有靜電產生裝置之影像
34 33 具有靜電產生裝置之影像
33 33 具有靜電產生裝置之影像
32 30 具有靜電產生裝置之影
31 30 具有靜電產生裝置之影
It is very easy to see in a UTF-8 stream whether a given byte is at the start (or not) of a given character's byte stream. If the byte is of the form 10xxxxxx then it is a non-initial byte of a character, if the byte is of the form 0xxxxxx it is a single byte character, and other bytes are the initial bytes of a multi-byte character.
As such, you can build your own function without too much difficulty. Just ensure that the last character you add to your field is either of the form 0xxxxxx, or is of the form 10xxxxxx where the next character (which you're not adding) is not of the form 10xxxxxx. I.e. you make sure you've just added a one-byte UTF-8 character or the last byte of a multi-byte UTF-8 character. You can then just add 0s to fill in the rest of your field.
def fit(s, l):
u = s.decode("utf8")
while True:
if len(s) <= l:
return s + "\0" * (l - len(s))
u = u[:-1]
s = u.encode("utf8")
should be about the thing you need. Maybe you have to refine it; it is untested.
I edited because I accidentally answered in C. I changed the algorithm to a not so optimal one, but easier to understand.
Having an UTF-8 string like this:
mystring = "işğüı"
is it possible to get its (in memory) size in Bytes with Python (2.5)?
Assuming you mean the number of UTF-8 bytes (and not the extra bytes that Python requires to store the object), it’s the same as for the length of any other string. A string literal in Python 2.x is a string of encoded bytes, not Unicode characters.
Byte strings:
>>> mystring = "işğüı"
>>> print "length of {0} is {1}".format(repr(mystring), len(mystring))
length of 'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1' is 9
Unicode strings:
>>> myunicode = u"işğüı"
>>> print "length of {0} is {1}".format(repr(myunicode), len(myunicode))
length of u'i\u015f\u011f\xfc\u0131' is 5
It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. In this case, you could use len(myunicode.encode('utf-8')) to find the size it would be after encoding.