Unpacking a struct ending with an ASCIIZ string - python

I am trying to use struct.unpack() to take apart a data record that ends with an ASCII string.
The record (it happens to be a TomTom ov2 record) has this format (stored little-endian):
1 byte
4 byte int for total record size (including this field)
4 byte int
4 byte int
variable-length string, null-terminated
unpack() requires that the string's length be included in the format you pass it. I can use the second field and the known size of the rest of the record -- 13 bytes -- to get the string length:
str_len = struct.unpack("<xi", record[:5])[0] - 13
fmt = "<biii{0}s".format(str_len)
then proceed with the full unpacking, but since the string is null-terminated, I really wish unpack() would do it for me. It'd also be nice to have this should I run across a struct that doesn't include its own size.
How can I make that happen?

I made two new functions that should be useable as drop-in replacements for the standard pack and unpack functions. They both support the 'z' character to pack/unpack an ASCIIZ string. There are no restrictions to the location or number of occurrences of the 'z' character in the format string:
import struct
def unpack (format, buffer) :
while True :
pos = format.find ('z')
if pos < 0 :
break
asciiz_start = struct.calcsize (format[:pos])
asciiz_len = buffer[asciiz_start:].find('\0')
format = '%s%dsx%s' % (format[:pos], asciiz_len, format[pos+1:])
return struct.unpack (format, buffer)
def pack (format, *args) :
new_format = ''
arg_number = 0
for c in format :
if c == 'z' :
new_format += '%ds' % (len(args[arg_number])+1)
arg_number += 1
else :
new_format += c
if c in 'cbB?hHiIlLqQfdspP' :
arg_number += 1
return struct.pack (new_format, *args)
Here's an example of how to use them:
>>> from struct_z import pack, unpack
>>> line = pack ('<izizi', 1, 'Hello', 2, ' world!', 3)
>>> print line.encode('hex')
0100000048656c6c6f000200000020776f726c64210003000000
>>> print unpack ('<izizi',line)
(1, 'Hello', 2, ' world!', 3)
>>>

The size-less record is fairly easy to handle, actually, since struct.calcsize() will tell you the length it expects. You can use that and the actual length of the data to construct a new format string for unpack() that includes the correct string length.
This function is just a wrapper for unpack(), allowing a new format character in the last position that will drop the terminal NUL:
import struct
def unpack_with_final_asciiz(fmt, dat):
"""
Unpack binary data, handling a null-terminated string at the end
(and only at the end) automatically.
The first argument, fmt, is a struct.unpack() format string with the
following modfications:
If fmt's last character is 'z', the returned string will drop the NUL.
If it is 's' with no length, the string including NUL will be returned.
If it is 's' with a length, behavior is identical to normal unpack().
"""
# Just pass on if no special behavior is required
if fmt[-1] not in ('z', 's') or (fmt[-1] == 's' and fmt[-2].isdigit()):
return struct.unpack(fmt, dat)
# Use format string to get size of contained string and rest of record
non_str_len = struct.calcsize(fmt[:-1])
str_len = len(dat) - non_str_len
# Set up new format string
# If passed 'z', treat terminating NUL as a "pad byte"
if fmt[-1] == 'z':
str_fmt = "{0}sx".format(str_len - 1)
else:
str_fmt = "{0}s".format(str_len)
new_fmt = fmt[:-1] + str_fmt
return struct.unpack(new_fmt, dat)
>>> dat = b'\x02\x1e\x00\x00\x00z\x8eJ\x00\xb1\x7f\x03\x00Down by the river\x00'
>>> unpack_with_final_asciiz("<biiiz", dat)
(2, 30, 4886138, 229297, b'Down by the river')

Related

Python Print Hex variable

I have hex variable that I want to print as hex
data = '\x99\x02'
print (data)
Result is: ™
I want to the python to print 0x9902
Thank you for your help
Please check this one.
data = r'\x99\x02'
a, b = [ x for x in data.split(r'\x') if x]
d = int(a+b, base=16)
print('%#x'%d)
You have to convert every char to its number - ord(char) - and convert every number to hex value - '{:02x}'.format() - and concatenate these values to string. And add string '0x'.
data = '\x99\x02'
print('0x' + ''.join('{:02x}'.format(ord(char)) for char in data))
EDIT: The same but first string is converted to bytes using encode('raw_unicode_escape')
data = '\x99\x02'
print('0x' + ''.join('{:02x}'.format(code) for code in data.encode('raw_unicode_escape')))
and if you have already bytes then you don't have to encode()
data = b'\x99\x02'
print('0x' + ''.join('{:02x}'.format(code) for code in data))
BTW: Similar way you can convert to binary using {:08b}
data = '\x99\x02'
print(''.join('{:08b}'.format(code) for code in data.encode('raw_unicode_escape')))

Convert String to hex and send via serial in Python

I want to convert the string 400AM49L01 to a hexadecimal form (and then into bytes) b'x\34\x30\x30\x41\x4d\x34\x39\x4c\x30', so I can write it with pySerial.
I already tried to convert the elements of a list, which contains the single hexadecimals like 0x31 (equals 4), into bytes, but this will result in b'400AM49L01'.
device = '400AM49L01'
device = device.encode()
device = bytes(device)
device = str(binascii.hexlify(device), 'ascii')
code = '0x'
text = []
count = 0
for i in device:
if count % 2 == 0 and count != 0:
text.append(code)
code = '0x'
count = 0
code += i
count += 1
text.append((code))
result = bytes([int(x, 0) for x in text])
Really looking forward for your help!
The following code will give the result you expecting.
my_str = '400AM49L01'
"".join(hex(ord(c)) for c in my_str).encode()
# Output
# '0x340x300x300x410x4d0x340x390x4c0x300x31'
What is it doing ?
In order to convert a string to hex, you need to convert each character to the integer value from the ascii table using ord().
Convert each int value to hex using the function hex().
Concatenate all hex value generated using join().
Encode the str to bytes using .encode().
Regards!

Convert a 8bit list to a 32 bit integer array in python

what I have :
textdata = "this is my test data"
DataArray = [ord(c) for c in textdata]
now I want to transform this is into x 32 bit integer by combining 4 elements of the list together
Ex : DataArray[0:4] would become a 32 bit integer, and then iterate to the next 4 elements and do the same. In the end, I would have a 32-bit array with all my results in it.
How can I do this in python whitout iterating over the whole string. Is there a simple way to do this?
Using numpy:
>>> import numpy as np
>>> a = np.frombuffer(b'this is my test data', dtype=np.int32)
>>> a
array([1936287860, 544434464, 1948285293, 544502629, 1635017060], dtype=int32)
>>> a.tobytes()
b'this is my test data'
Use '<i4' or similar as dtype for endianness that's portable between machines.
I'm assuming that you can keep your initial data as bytes rather than unicode, because you really should try hard to do that.
As long as your string is an integer multiple of 4, you can use NumPy in a very efficient way:
import numpy as np
data = np.fromstring(textdata, dtype='>u4')
# array([1952999795, 543781664, 1836654708, 1702065184, 1684108385])
'>u4' means 'big-endian unsigned 4-byte integer'.
Edit: If you use NumPy >= 1.14, then np.fromstring is deprecated, and the right way to process your text is by calling np.frombuffer(textdata.encode(), dtype='>u4').
You can use the struct built-in python module:
from struct import unpack
textdata = "this is my test data"
data = list(unpack('i'*(len(textdata)//4), textdata))
Result:
[1936287860, 544434464, 1948285293, 544502629, 1635017060]
You won't need to iterate over the string and you can find other Format Characters if you want to use unsigned integers for example.
You could use something like the following, which uses bit manipulation (big-endian):
def chunk2int(chunk):
""" Converts a chunk (string) into an int, 8 bits per character """
val = 0
for c in chunk:
val = (val << 8) | (ord(c) & 0xFF)
return val
def int2chunk(val):
""" Converts an int into a chunk, consuming 8 bits per character """
rchunk = []
while val:
rchunk.append(val & 0xFF)
val >>= 8
return ''.join(chr(c) for c in reversed(rchunk))
textdata = "this is my test data"
chunks = [textdata[i:i + 4] for i in range(0, len(textdata), 4)]
print(chunks)
data = [chunk2int(c) for c in chunks]
print(data)
chunks = [int2chunk(d) for d in data]
print(chunks)
Produces:
['this', ' is ', 'my t', 'est ', 'data']
[1952999795, 543781664, 1836654708, 1702065184, 1684108385]
['this', ' is ', 'my t', 'est ', 'data']
If you're using characters with 1 <= ord(c) <= 255 in your input text, this will work. If there are null bytes in your string, the int2chunk method may terminate early, in which case you'd have to pad the chunks.
There's also the struct module, which may be worth looking into, and where you can change the endianness much more simply.

python Incorrect formatting Cyrillic

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks
Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Howto Remove Garbage Data from String

I'm in a situation where I have to use Python to read and write to an EEPROM on an embedded device. The first page (256 bytes) is used for non-volatile data storage. My problem is that the variables can vary in length, and I need to read a fixed amount.
For example, an string is stored at address 30 and can be anywhere from 6 to 10 bytes in length. I need to read the maximum possible length, because I don't know where it ends. What that does is it gives me excess garbage in the string.
data_str = ee_read(bytecount)
dbgmsg("Reading from EEPROM: addr = " + str(addr_low) + " value = " + str(data_str))
> Reading from EEPROM: addr = 30 value = h11c13����
I am fairly new to Python. Is there a way to automatically chop off that data in the string after it's been read in?
Do you mean something like:
>>> s = 'Reading from EEPROM: addr = 30 value = h11c13����'
>>> s
'Reading from EEPROM: addr = 30 value = h11c13\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd'
>>> filter(lambda x: ord(x)<128,s)
'Reading from EEPROM: addr = 30 value = h11c13'
On python3, you'll need to to join the string:
''.join(filter(lambda x: ord(x)<128,s)
A version which works for python2 and python3 would be:
''.join(x for x in s if ord(x) < 128)
Finally, it is concieveable that the excess garbage could contain printing characters. In that case you might want to take only characters until you read a non-printing character, itertools.takewhile could be helpful...
import string #doesn't exist on python3.x, use the builtin `str` type instead.
from itertools import takewhile
printable = set(string.printable)
''.join(takewhile(lambda x: x in printable, s))

Categories

Resources