What is this hexadecimal in the utf16 format?

What is this hexadecimal in the utf16 format? - python

print(bytes('ba', 'utf-16'))
Result :
b'\xff\xfeb\x00a\x00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:
\xff\xfeb
1 - What is this ?????????
2 - Why fe ??? it should be x00
i have read a lot of wikipedia articles but it is still not clear

You have,
b'\xff\xfeb\x00a\x00'
This is what you asked for, it has three characters.
b'\xff\xfe' # 0xff 0xfe
b'b\x00' # 0x62 0x00
b'a\x00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.
If you don't want the BOM, you can use utf-16le or utf-16be.
>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.

I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE: This is the byte-order mark required in UTF-16 (Byte order mark - Wikipedia).
00, followed by the 8-bit encoding of 'b' (that is shown as the character 'b' instead of using an \x escape sequence): This is the 16-bit representation of 'b'.
00, followed by the 8-bit encoding of 'a': This is the 16-bit representation of 'a'.

You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don't tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can't say anything just by looking at these bytes. Did I mean to send you 愀 or a ?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"
print((ordering1 + b"\x00a").decode("utf-16")) # a
print((ordering2 + b"a\x00").decode("utf-16")) # a
Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it's already there.

Related

'UTF-8' decoding error while using unireedsolomon package

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:
def str_to_byte(padded):
byte_array = padded.encode()
binary_int = int.from_bytes(byte_array, "big")
binary_string = bin(binary_int)
without_b = binary_string[2:]
return without_b
def byte_to_str(without_b):
binary_int = int(without_b, 2)
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
padded_char = ascii_text[:]
return padded_char
After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.
If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?

It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.
Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:
import unireedsolomon as rs
import random
def corrupt(encoded):
'''Flip up to 3 bits (might pick the same bit more than once).
'''
b = bytearray(encoded) # convert to writable bytes
for _ in range(3):
index = random.randrange(len(b)) # pick random byte
bit = random.randrange(8) # pic random bit
b[index] ^= 1 << bit # flip it
return bytes(b) # back to read-only bytes, but not necessary
def encode(coder,msg):
'''Convert the msg to UTF-8-encoded bytes and encode with "coder". Return as bytes.
'''
return coder.encode(msg.encode('utf8')).encode('latin1')
def decode(coder,encoded):
'''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
'''
return coder.decode(encoded)[0].encode('latin1').decode('utf8')
coder = rs.RSCoder(20,13)
msg = 'hello(你好)' # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)
Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.
b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)
A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.

UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at
https://en.wikipedia.org/wiki/UTF-8#Encoding
Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?

Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.

In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

How to extract numerical values after expression '\xb' in byte string

I am attempting to extract numerical values from a byte string transmitted from an RS-232 port. Here is an example:
b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
If I attempt to decode the byte string as 'utf-8', I receive the following output:
x = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
x.decode('utf-8', errors='ignore')
>>> 'SS3.6\n'
What I ideally want is 23.67, which is observed after every \xb pattern. How could I extract 23.67 from this byte string?

As mentioned in https://stackoverflow.com/a/59416410/3319460, your input actually doesn't really represent the output you seek. But just to fulfil your requirements, of course, we might set semantics onto the input such that
numbers or '.' sign is allowed, others are skipped
if the byte is non-ASCII character such whether the first four bytes are 0xB. If it is the case then we will simply take the ASCII part of the byte (b & 0b01111111)
That is quite easily done in Python.
def _filter(char):
return char & 0xF0 == 0xB0 or chr(char) == "." or 48 <= char <= 58
def filter_xbchars(value: bytes) -> str:
return "".join(chr(ch & 0b01111111) for ch in value if _filter(ch))
import pytest
#pytest.mark.parametrize(
"value, expected",
[(b"S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n", "23.67")],
)
def test_simple(value, expected):
assert filter_xbchars(value) == expected
Please be aware that even though code above satisfies the requirements it is an example of a poorly described task and as a result quite nonsensical solution. The code solves the task as you asked for it but we should firstly reconsider whether it even makes sense. I advise you to check the data you will test against and the meaning of the data (protocol).
Good luck :)

If you just want to get 23.67 from that byte string try this:
a = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
b = repr(a)[2:-1]
c = b.split("\\")
d = ''
e = []
for i in c:
if "xb" in i:
e.append(i[2:])
d = "".join(e)
print(d)

Please notice that \xHH is an escape code representing hexadecimal value HH and as such your string '\xb23.6\xb7' does not contain "23.67" but rater "(0xB2)3.6(0xB7)", those value cannot be extracted using a regular expression because it's not present in the string in the first place.
'\xb23.6\xb7' is not a valid UTF-8 sequence, and in Latin-1 extended ASCII it would represent "²3.6·"; the presence of many 0xA0 values would suggest a Latin-1 encoding as it represent a non-breaking space in that encoding (a fairly common character) while in UTF-8 it does not encode a meaningful sequence.

How to get line of hex codes (\x) to hex address (0x) in Python?

How do I get bytes this line of hex ASCII to a hex address as shown below?
I've managed to separate them to each line, but am having issues with the python 3 hex() as it throws errors such as str type when I try to do something like say line.hex().
Input
\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00
Output
0xDDCCBBAA
0x2211FFEE
0x66554433
0x00998877
My Code
import re
a=r"\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
r = '\n'.join(re.findall('................|.$', a))
for line in r.splitlines():
print(line)

s = "\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
print("\n".join([hex(ord(c)) for c in s]))
Will output:
0xaa
0xbb
0xcc
0xdd
0xee
0xff
0x11
0x22
0x33
0x44
0x55
0x66
0x77
0x88
0x99
0x0
Is this what you needed? This assumes the input string s is a string consisting of Unicode characters. It would be quite helpful to know what you actually want to achieve.

I always verify that my input is correct before I start doing anything. So using regexp, I came up with the following:
import re
a=r"\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
# matches a word consisting of 4 bytes encoded as \xHH where H is any hex character
listOfFourHexBytes = re.findall(r'(?:\\x[0-9a-fA-F]{2}){4}', a)
if (len(a) != 4 * 16 or len(listOfFourHexBytes) != 4):
raise Exception("Input does not consist of 16 hexadecimal bytes")
for fourHexbytes in listOfFourHexBytes:
justFourHexBytes = re.sub(r'\\x', '', fourHexbytes)
print("0x" + justFourHexBytes)
# not asked for, but generate number from hex value first
n = int(justFourHexBytes, 16)
print("0x%08X" % n)
It first looks for actual hex bytes in the format you asked for, in groups of 4 hex encoded bytes. Then it checks if there are indeed 4 of those, making sure that the input string is of the correct size (no spurious skipped characters). Then it iterates over those, removing the \x from the strings. The next step is of course easy as pie, just add 0x to the result and print it out.
Usually you need to do something with the values, so I added the int conversion for free.

Length of a single character encoded in UTF-32

Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?
>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64
UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?

Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.
Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:
>>> Bits(bytes = 'a'.encode('utf-32-le')).bin
'01100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
32
The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.
Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:
>>> len('abcd'.encode('utf-32')) # (BOM + 4 chars) * 4 bytes == 20 bytes
20

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is this hexadecimal in the utf16 format? - python

Related

'UTF-8' decoding error while using unireedsolomon package

Bytes operations in Python

How to extract numerical values after expression '\xb' in byte string

How to get line of hex codes (\x) to hex address (0x) in Python?

Length of a single character encoded in UTF-32

Categories

Resources