How to read UTF16-BE encoded bytes with length header - python

I want to decode a series of strings of variable length which have been encoded in UTF16-BE preceded by a two-bytes long big-endian integer indicating the half the byte-length of the following string. e.g:
Length String (encoded) Length String (encoded) ...
\x00\x05 \x00H\x00e\x00l\x00l\x00o \x00\x06 \x00W\x00o\x00r\x00l\x00d\x00! ...
All these strings and their length headers are concatenated in one big bytestring.
I have the encoded bytestring as bytes object in memory. I would like to have an iterable function which would yield strings until it reaches the end of the ByteString.

Not a huge improvement, but your code can be streamlined a bit.
def decode_strings(byte_string: ByteString) -> Generator[str]:
with io.BytesIO(byte_string) as stream:
while (s := stream.read(2)):
length = int.from_bytes(s, byteorder="big")
yield bytes.decode(stream.read(length), encoding="utf_16_be")

Currently I do it like this, but somehow I'm imagining Raymond Hettinger's "There must be a better way!".
import io
import functools
from typing import ByteString
from typing import Iterable
# Decoders
int_BE = functools.partial(int.from_bytes, byteorder="big")
utf16_BE = functools.partial(bytes.decode, encoding="utf_16_be")
encoded_strings = b"\x00\x05\x00H\x00e\x00l\x00l\x00o\x00\x06\x00W\x00o\x00r\x00l\x00d\x00!"
header_length = 2
def decode_strings(byte_string: ByteString) -> Iterable[str]:
stream = io.BytesIO(byte_string)
while True:
length = int_BE(stream.read(header_length))
if length:
text = utf16_BE(stream.read(length * 2))
yield text
else:
break
stream.close()
if __name__ == "__main__":
for text in decode_strings(encoded_strings):
print(text)
Thanks for any suggestions.

Related

Exclude escaped byte char from serial.read_until()

I'm writing code to communicate back and forth with a module over serial which returns specific byte values to indicate the start/end of its communication. The length of the data returned can vary as can all content between the start header and end footer.
In an ideal scenario, I'd be able to use the following code to receive all data from the module:
start = b'\x5a'
end = b'\x5b'
max_size = 1024
def get_from_serial(ser: serial.Serial) -> bytes:
with ser:
_ = ser.read_until(expected=start, size=max_size)
data = ser.read_until(expected=end, size=max_size)
return start + data
Unfortunately, there are circumstances where the data sent by the module includes bytes that match either the start or end byte values. In these instances, the module prepends an escape character to them:
valid_start = b'\x5a'
valid_end = b'\x5b'
escaped_start = b'\x5c\x5a'
escaped_end = b'\x5c\x5b'
A valid start/end byte can be preceded by ANY byte value other than an escape one:
good_result = b'\x5a\xff\x5c\x5b\xff\x5b'
bad_result = b'\x5a\xff\x5c\x5b' # missed b'\xff\x5b'
Is there a way to configure ser.read_until() to ignore any escaped instance of a start/end byte and only return when encountering a valid start/end byte?
There's probably a way to do this with a loop that checks if data[-2] == b'\x5c': each time ser.read_until() returns something though I feel it could get complicated if the module returns multiple instances of an escaped start/end byte scattered throughout the data.
Any thoughts or suggestions would be greatly appreciated.
Edit:
Starting to think this isn't actually possible to do from inside ser.read_until() so have added a check before returning the data.
start = b'\x5a'
end = b'\x5b'
escape = b'\x5c'
max_size = 1024
def get_from_serial(ser: serial.Serial) -> bytes:
with ser:
_ = ser.read_until(expected=start, size=max_size)
data = ser.read_until(expected=end, size=max_size)
if valid_packet(data):
return start + data
else:
raise Exception("Invalid packet")
def valid_packet(packet: bytearray) -> bool:
header = packet[:1]
footer = packet[-1:]
escape_check = packet[-2:-1]
valid_header = header == start
valid_footer = footer == end
not_escaped = escape_check != escape
return all([
valid_header,
valid_footer,
not_escaped
])

How can combine few base64 audio chunks (from microphone)

I get base64 chunks from microphone.
I need to concatenate them and send to Google API as one base64 string for speech recognition. Roughly speaking, in the first chunk the word Hello is encoded, and in the second world!. I need to glue two chunks, send them to google api of one line and receive Hello world! in response
You can see Google Speech-to-Text as example. Google also sends data from the microphone in base64 string using websockets (see Network).
Unfortunately, I don't have a microphone at hand - I can't check it. And we must do it now.
Suppose I get
chunk1 = "TgvsdUvK ...."
chunk2 = "UZZxgh5V ...."
Do I understand correctly that it will be enough just
base64.b64encode (chunk1 + chunk2))
Or do you need to know something else? Unfortunately, everything depends on the lack of a microphone (
Your example of encoding chunk1 + chunk2 wouldn't work, since base64 strings have padding at the end. If you just concatenated two base64 strings together, they couldn't be decoded.
For example, the strings StringA and StringB, when their ascii or utf-8 representations are encoded in base64, are the following: U3RyaW5nQQ== and U3RyaW5nQg==. Each one of those can be decoded fine. But, if you concatenated them, your result would be U3RyaW5nQQ==U3RyaW5nQg==, which is invalid:
concatenated_b64_strings = 'U3RyaW5nQQ==U3RyaW5nQg=='
concatenated_b64_strings_bytes = concatenated_b64_strings.encode('ascii')
decoded_strings = base64.b64decode(concatenated_b64_strings_bytes)
print(decoded_strings.decode('ascii')) # just outputs 'StringA', which is incorrect
So, in order to take those two strings (which I'm using as an example in place of binary data) and concatenate them together, starting with only their base64 representations, you have to decode them:
import base64
string1_base64 = 'U3RyaW5nQQ=='
string2_base64 = 'U3RyaW5nQg=='
# need to convert the strings to bytes first in order to decode them
base64_string1_bytes = string1_base64.encode('ascii')
base64_string2_bytes = string2_base64.encode('ascii')
# now, decode them into the actual bytes the base64 represents
base64_string1_bytes_decoded = base64.decodebytes(base64_string1_bytes)
base64_string2_bytes_decoded = base64.decodebytes(base64_string2_bytes)
# combine the bytes together
combined_bytes = base64_string1_bytes_decoded + base64_string2_bytes_decoded
# now, encode these bytes as base64
combined_bytes_base64 = base64.encodebytes(combined_bytes)
# finally, decode these bytes so you're left with a base64 string:
combined_bytes_base64_string = combined_bytes_base64.decode('ascii')
print(combined_bytes_base64_string) # output: U3RyaW5nQVN0cmluZ0I=
# let's prove that it concatenated successfully (you wouldn't do this in your actual code)
base64_combinedstring_bytes = combined_bytes_base64_string.encode('ascii')
base64_combinedstring_bytes_decoded_bytes = base64.decodebytes(base64_combinedstring_bytes)
base64_combinedstring_bytes_decoded_string = base64_combinedstring_bytes_decoded_bytes.decode('ascii')
print(base64_combinedstring_bytes_decoded_string) # output: StringAStringB
In your case, you'd be combining more than just two input base64 strings, but the process is the same. Take all the strings, encode each one to ascii bytes, decode them via base64.decodebytes(), and then add them all together via the += operator:
import base64
input_strings = ['U3RyaW5nQQ==', 'U3RyaW5nQg==']
input_strings_bytes = [input_string.encode('ascii') for input_string in input_strings]
input_strings_bytes_decoded = [base64.decodebytes(input_string_bytes) for input_string_bytes in input_strings_bytes]
combined_bytes = bytes()
for decoded in input_strings_bytes_decoded:
combined_bytes += decoded
combined_bytes_base64 = base64.encodebytes(combined_bytes)
combined_bytes_base64_string = combined_bytes_base64.decode('ascii')
print(combined_bytes_base64_string) # output: U3RyaW5nQVN0cmluZ0I=

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?
You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

Python String Prefix by 4 Byte Length

I'm trying to write a server in Python to communicate with a pre-existing client whose message packets are ASCII strings, but prepended by four-byte unsigned integer values representative of the length of the remaining string.
I've done a receiver, but I'm sure there's a a more pythonic way. Until I find it, I haven't done the sender. I can easily calculate the message length, convert it to bytes and transmit the message.The bit I'm struggling with is creating an integer which is an array of four bytes.
Let me clarify: If my string is 260 characters in length, I wish to prepend a big-endian four byte integer representation of 260. So, I don't want the ASCII string "0260" in front of the string, rather, I want four (non-ASCII) bytes representative of 0x00000104.
My code to receive the length prepended string from the client looks like this:
sizeBytes = 4 # size of the integer representing the string length
# receive big-endian 4 byte integer from client
data = conn.recv(sizeBytes)
if not data:
break
dLen = 0
for i in range(sizeBytes):
dLen = dLen + pow(2,i) * data[sizeBytes-i-1]
data = str(conn.recv(dLen),'UTF-8')
I could simply do the reverse. I'm new to Python and feel that what I've done is probably longhand!!
1) Is there a better way of receiving and decoding the length?
2) What's the "sister" method to encode the length for transmission?
Thanks.
The struct module is helpful here
for writing:
import struct
msg = 'some message containing 260 ascii characters'
length = len(msg)
encoded_length = struct.pack('>I', length)
encoded_length will be a string of 4 bytes with value '\x00\x00\x01\x04'
for reading:
length = struct.unpack('>I', received_msg[:4])[0]
An example using asyncio:
import asyncio
import struct
def send_message(writer, message):
data = message.encode()
size = struct.pack('>L', len(data))
writer.write(size + data)
async def receive_message(reader):
data = await reader.readexactly(4)
size = struct.unpack('>L', data)[0]
data = await reader.readexactly(size)
return data.decode()
The complete code is here

Check if a string is encoded in base64 using Python

Is there a good way to check if a string is encoded in base64 using Python?
I was looking for a solution to the same problem, then a very simple one just struck me in the head. All you need to do is decode, then re-encode. If the re-encoded string is equal to the encoded string, then it is base64 encoded.
Here is the code:
import base64
def isBase64(s):
try:
return base64.b64encode(base64.b64decode(s)) == s
except Exception:
return False
That's it!
Edit: Here's a version of the function that works with both the string and bytes objects in Python 3:
import base64
def isBase64(sb):
try:
if isinstance(sb, str):
# If there's any unicode here, an exception will be thrown and the function will return false
sb_bytes = bytes(sb, 'ascii')
elif isinstance(sb, bytes):
sb_bytes = sb
else:
raise ValueError("Argument must be string or bytes")
return base64.b64encode(base64.b64decode(sb_bytes)) == sb_bytes
except Exception:
return False
import base64
import binascii
try:
base64.decodestring("foo")
except binascii.Error:
print "no correct base64"
This isn't possible. The best you could do would be to verify that a string might be valid Base 64, although many strings consisting of only ASCII text can be decoded as if they were Base 64.
The solution I used is based on one of the prior answers, but uses more up to date calls.
In my code, the my_image_string is either the image data itself in raw form or it's a base64 string. If the decode fails, then I assume it's raw data.
Note the validate=True keyword argument to b64decode. This is required in order for the assert to be generated by the decoder. Without it there will be no complaints about an illegal string.
import base64, binascii
try:
image_data = base64.b64decode(my_image_string, validate=True)
except binascii.Error:
image_data = my_image_string
Using Python RegEx
import re
txt = "VGhpcyBpcyBlbmNvZGVkIHRleHQ="
x = re.search("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$", txt)
if (x):
print("Encoded")
else:
print("Non encoded")
Before trying to decode, I like to do a formatting check first as its the lightest weight check and does not return false positives thus following fail-fast coding principles.
Here is a utility function for this task:
RE_BASE64 = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$"
def likeBase64(s:str) -> bool:
return False if s is None or not re.search(RE_BASE64, s) else True
if the length of the encoded string is the times of 4, it can be decoded
base64.encodestring("whatever you say").strip().__len__() % 4 == 0
so, you just need to check if the string can match something like above, then it won't throw any exception(I Guess =.=)
if len(the_base64string.strip()) % 4 == 0:
# then you can just decode it anyway
base64.decodestring(the_base64string)
#geoffspear is correct in that this is not 100% possible but you can get pretty close by checking the string header to see if it matches that of a base64 encoded string (re: How to check whether a string is base64 encoded or not).
# check if a string is base64 encoded.
def isBase64Encoded(s):
pattern = re.compile("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$")
if not s or len(s) < 1:
return False
else:
return pattern.match(s)
Also not that in my case I wanted to return false if the string is empty to avoid decoding as there's no use in decoding nothing.
I know I'm almost 8 years late but you can use a regex expression thus you can verify if a given input is BASE64.
import re
encoding_type = 'Encoding type: '
base64_encoding = 'Base64'
def is_base64():
element = input("Enter encoded element: ")
expression = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$"
matches = re.match(expression, element)
if matches:
print(f"{encoding_type + base64_encoding}")
else:
print("Unknown encoding type.")
is_base64()
def is_base64(s):
s = ''.join([s.strip() for s in s.split("\n")])
try:
enc = base64.b64encode(base64.b64decode(s)).strip()
return enc == s
except TypeError:
return False
In my case, my input, s, had newlines which I had to strip before the comparison.
x = 'possibly base64 encoded string'
result = x
try:
decoded = x.decode('base64', 'strict')
if x == decoded.encode('base64').strip():
result = decoded
except:
pass
this code put in the result variable decoded string if x is really encoded, and just x if not. Just try to decode doesn't always work.

Categories

Resources