Python - Reading a UTF-8 encoded string byte-by-byte - python

I have a device that returns a UTF-8 encoded string. I can only read from it byte-by-byte and the read is terminated by a byte of value 0x00.
I'm making a Python 2.7 function for others to access my device and return string.
In a previous design when the device just returned ASCII, I used this in a loop:
x = read_next_byte()
if x == 0:
break
my_string += chr(x)
Where x is the latest byte value read from the device.
Now the device can return a UTF-8 encoded string, but I'm not sure how to convert the bytes that I get back into a UTF-8 encoded string/unicode.
chr(x) understandably causes an error when the x>127, so I thought that using unichr(x) may work, but that assumes the value passed is a full unicode character value, but I only have a part 0-255.
So how can I convert the bytes that I get back from the device into a string that can be used in Python and still handle the full UTF-8 string?
Likewise, if I was given a UTF-8 string in Python, how would I break that down into individual bytes to send to my device and still maintain UTF-8?

The correct solution would be to read until you hit the terminating byte, then convert to UTF-8 at that time (so you have all characters):
mybytes = bytearray()
while True:
x = read_next_byte()
if x == 0:
break
mybytes.append(x)
my_string = mybytes.decode('utf-8')
The above is the most direct translation of your original code. Interestingly, this is one of those cases where two arg iter can be used to dramatically simplify the code by making your C-style stateful byte reader function into a Python iterator that lets you one-line the work:
# If this were Python 3 code, you'd use the bytes constructor instead of bytearray
my_string = bytearray(iter(read_next_byte, 0)).decode('utf-8')

Related

<bytes> to escaped <str> Python 3

Currently, I have Python 2.7 code that receives <str> objects over a socket connection. All across the code we use <str> objects, comparisons, etc. In an effort to convert to Python 3, I've found that socket connections now return <bytes> objects which requires us to change all literals to be like b'abc' to make literal comparisons, etc. This is a lot of work, and although it is apparent why this change was made in Python 3 I am curious if there are any simpler workarounds.
Say I receive <bytes> b'\xf2a27' over a socket connection. Is there a simple way to convert these <bytes> into a <str> object with the same escapes in Python 3.6? I have looked into some solutions myself to no avail.
a = b'\xf2a27'.decode('utf-8', errors='backslashescape')
Above yields '\\xf2a27' with len(a) = 7 instead of the original len(b'\xf2a27') = 3. Indexing is wrong too, this just won't work but it seems like it is headed down the right path.
a = b'\xf2a27'.decode('latin1')
Above yields 'òa27' which contains Unicode characters that I would like to avoid. Although in this case len(a) = 5 and comparisons like a[0] == '\xf2' work, but I'd like to keep the information escaped in representation if possible.
Is there perhaps a more elegant solution that I am missing?
You really have to think about what the data you receive represents and Python 3 makes a strong point in that direction. There's an important difference between a string of bytes that actually represent a collection of bytes and a string of (abstract, unicode) characters.
You may have to think about each piece of data individually if they can have different representations.
Let's take your example of b'\xf2a27' which in its raw form you receive from the socket is just a string of 4 bytes: 0xf2, 0x61, 0x32, 0x37 in hex or 242, 97, 50, 55 in decimal.
Let's say you actually want 4 bytes out of that. You could either keep it as a byte string or convert it into a list or tuple of bytes if that serves you better:
raw_bytes = b'\xf2a27'
list_of_bytes = list(raw_bytes)
tuple_of_bytes = tuple(raw_bytes)
if raw_bytes == b'\xf2a27':
pass
if list_of_bytes == [0xf2, 0x61, 0x32, 0x37]:
pass
if tuple_of_bytes == (0xf2, 0x61, 0x32, 0x37):
pass
Let's say this actually represents a 32-bit integer in which case you should convert it into a Python int. Choose whether it is encoded in little or big endian byte order and make sure you pick the correct one of signed vs. unsigned.
raw_bytes = b'\xf2a27'
signed_little_endian, = struct.unpack('<i', raw_bytes)
signed_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=True)
unsigned_little_endian, = struct.unpack('<I', raw_bytes)
unsigned_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=False)
signed_big_endian, = struct.unpack('>i', raw_bytes)
signed_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=True)
unsigned_big_endian, = struct.unpack('>I', raw_bytes)
unsigned_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=False)
if signed_litte_endian == 926048754:
pass
Let's say it's actually text. Think about what encoding it comes in. In your case it cannot be UTF-8 as b'\xf2' would be a byte string that cannot be correctly decoded as UTF-8. If it's latin1 a.k.a. iso8859-1 and you're sure about it, that's fine.
raw_bytes = b'\xf2a27'
character_string = raw_bytes.decode('iso8859-1')
if character_string == '\xf2a27':
pass
If your choice of encoding was correct, having a '\xf2' or 'ò' character inside the string will also be correct. It's still a single character. 'ò', '\xf2', '\u00f2' and '\U000000f2' are just 4 different ways to represent the same single character in a (unicode) string literal. Also, the len will be 4, not 5.
print(ord(character_string[0])) # will be 242
print(hex(ord(character_string[0]))) # will be 0xf2
print(len(character_string)) # will be 4
If you actually observed a length of 5, you may have observed it at the wrong point. Perhaps after encoding the character string to UTF-8 or having it implicitly encoded to UTF-8 by printing to a UTF-8 Terminal.
Note the difference of the number of bytes output to the shell when changing the default I/O encoding:
PYTHONIOENCODING=UTF-8 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 5
PYTHONIOENCODING=latin1 python3 -c 'print(b"\xf2a27".decode("latin1"), end="")' | wc -c
# will output 4
Ideally, you should perform your comparisons after converting the raw bytes to the correct data type they represent. That makes your code more readable and easier to maintain.
As a general rule of thumb, you should always convert raw bytes to their actual (abstract) data type as soon as you receive them. Then keep it in that abstract data type for processing as long as possible. If necessary, convert it back to some raw data on output.

Comparing special characters in Python

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.
b'Ope\xcc\x81rations'
b'Op\xc3\xa9rations'
My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.
Good to know:
You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:
unicode.enocde() ----> bytes
and
bytes.decode() -----> unicode
and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.
Get to the point:
If you redefine your string to two Byte strings and unicode strings, as follwos:
a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'
and
b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'
you w'll see:
print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))
print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))
output:
a_byte lenght is: 11
b_byte lenght is: 10
So you see they are not the same.
My solution:
If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:
print repr(a_byte),repr(b_byte)
will return:
'Ope\xcc\x81rations','Op\xc3\xa9rations'
You can also normalize the unicode before comparison as #Daniel's answer, as follows:
from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy

This is perhaps a micro-optimization, but I would like to check that a stream of given bytes is valid UTF-8 as it passes through my application, but I don't want to keep the resulted decoded code points. In other words, if I were to call large_string.decode('utf-8'), assuming the encoding succeeds I have no desire to keep the unicode string returned by decoding, and would prefer not to waste memory on it.
There are various ways I could do this, for example read a few bytes at a time, attempt to decode(), then append more bytes until decode() succeeds (or I've exhausted the maximum number of bytes for a single character in the encoding). But ISTM it should be possible to use the existing decoder in a way that simply throws away the decoded unicode characters and not have to roll my own. But nothing immediately comes to mind scouring the stdlib docs.
You can use the incremental decoder provided by the codecs module:
utf8_decoder = codecs.getincrementaldecoder('utf8')()
This is a IncrementalDecoder() instance. You can then feed this decoder data in order and validate the stream:
# for each partial chunk of data:
try:
utf8_decoder.decode(chunk)
except UnicodeDecodeError:
# invalid data
The decoder returns the data decoded so far (minus partial multi-byte sequences, those are kept as state for the next time you decode a chunk). Those smaller strings are cheap to create and discard, you are not creating a large string here.
You can't feed the above loop partial data, because UTF-8 is a format using a variable number of bytes; a partial chunk is liable to have invalid data at the start.
If you can't validate from the start, then your first chunk may start with up to three continuation bytes. You could just remove those first:
first_chunk = b'....'
for _ in range(3):
if first_chunk[0] & 0xc0 == 0x80:
# remove continuation byte
first_chunk = first_chunk[1:]
Now, UTF-8 is structured enough so you could also validate the stream entirely in Python code using more such binary tests, but you simply are not going to match the speed that the built-in decoder can decode at.

Python-3.x - Converting a string representation of a bytearray back to a string

The back-story here is a little verbose, but basically I want to take a string like b'\x04\x0e\x1d' and cast it back into a bytearray.
I am working on a basic implementation of a one time pad, where I take a plaintext A and shared key B to generate a ciphertext C accoring to the equation A⊕B=C. Then I reverse the process with the equation C⊕B=A.
I've already found plenty of python3 functions to encode strings as bytes and then xor the bytes, such as the following:
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
A call to xor_strings() then returns a bytearray:
print( xor_strings("foo", "bar"))
But when I print it to the screen, what I'm shown is actually a string. So I'm assuming that python is just calling some str() function on the bytearray, and I get something that looks like the following:
b'\x04\x0e\x1d'
Herein lies the problem. I want to create a new bytearray from that string. Normally I would just call decode() on the bytearray. But if I enter `b'\x04\x0e\x1d' as input, python sees it as a string, not a bytearray!
How can I take a string like b'\x04\x0e\x1d' as user input and cast it back into a bytearray?
As discussed in the comments, use base64 to send binary data in text form.
import base64
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
# ciphertext is bytes
ciphertext = xor_strings("foo", "bar")
# >>> b'\x04\x0e\x1d'
# ciphertext_b64 is *still* bytes, but only "safe" ones (in the printable ASCII range)
ciphertext_b64 = base64.encodebytes(ciphertext)
# >>> b'BA4d\n'
Now we can transfer the bytes:
# ...we could interpret them as ASCII and print them somewhere
safe_string = ciphertext_b64.decode('ascii')
# >>> BA4d
# ...or write them to a file (or a network socket)
with open('/tmp/output', 'wb') as f:
f.write(ciphertext_b64)
And the recipient can retrieve the original message by:
# ...reading bytes from a file (or a network socket)
with open('/tmp/output', 'rb') as f:
ciphertext_b64_2 = f.read()
# ...or by reading bytes from a string
ciphertext_b64_2 = safe_string.encode('ascii')
# >>> b'BA4d\n'
# and finally decoding them into the original nessage
ciphertext_2 = base64.decodestring(ciphertext_b64_2)
# >>> b'\x04\x0e\x1d'
Of course when it comes to writing bytes to a file or to the network, encoding them as base64 first is superfluous. You can write/read the ciphertext directly if it's the only file content. Only if the ciphertext it is part of a higher structure (JSON, XML, a config file...) encoding it as base64 becomes necessary again.
A note on the use of the words "decode" and "encode".
To encode a string means to turn it from its abstract meaning ("a list of characters") into a storable representation ("a list of bytes"). The exact result of this operation depends on the byte encoding that is being used. For example:
ASCII encoding maps one character to one byte (as a trade-off it can't map all characters that can exist in a Python string).
UTF-8 encoding maps one character to 1-5 bytes, depending on the character.
To decode a byte array means turning it from "a list of bytes" back into "a list of characters" again. This of course requires prior knowledge of what the byte encoding originally was.
ciphertext_b64 above is a list of bytes and is represented as b'BA4d\n' on the Python console.
Its string equivalent, safe_string, looks very similar 'BA4d\n' when printed to the console due to the fact that base64 is a sub-set of ASCII.
The data types however are still fundamentally different. Don't let the console output deceive you.
Responding to that final question only.
>>> type(b'\x04\x0e\x1d')
<class 'bytes'>
>>> bytearray(b'\x04\x0e\x1d')
bytearray(b'\x04\x0e\x1d')
>>> type(bytearray(b'\x04\x0e\x1d'))
<class 'bytearray'>

Categories

Resources