How does Python store datetime internally? - python

I found _datetimemodule.c which seems to be the right file, but I need a bit of help as C is not my strength.
>>> import datetime
>>> import sys
>>> d = datetime.datetime.now()
>>> sys.getsizeof(d)
48
>>> d = datetime.datetime(2018, 12, 31, 23, 59, 59, 123)
>>> sys.getsizeof(d)
48
So a timezone-unaware datetime object nees 48 Bytes. Looking at the PyDateTime_DateTimeType, it seems to be a PyDateTime_DateType and a PyDateTime_TimeType. Maybe also _PyDateTime_BaseTime?
From looking at the code, I have the impression that one component is stored for each field in YYYY-mm-dd HH:MM:ss, meaning:
Year: e.g. int (e.g int16_t would be 16 bit)
Month: e.g int8_t
day: e.g. int8_t
Hour: e.g. int8_t
Minute: e.g. int8_t
Second: e.g. int8_t
Microsecond: e.g. uint16_t
But that would be 2*16 + 5 * 8 = 72 Bit = 9 Byte and not 48 Byte as Python tells me.
Where is my assumption about the internal structure of datetime wrong? How can I see this in the code?
(I guess this might differ between Python implementations - if so, please focus on cPython)

You're missing a key part of the picture: the actual datetime struct definitions, which lie in Include/datetime.h. There are also important comments in there. Here are some key excerpts:
/* Fields are packed into successive bytes, each viewed as unsigned and
* big-endian, unless otherwise noted:
*
* byte offset
* 0 year 2 bytes, 1-9999
* 2 month 1 byte, 1-12
* 3 day 1 byte, 1-31
* 4 hour 1 byte, 0-23
* 5 minute 1 byte, 0-59
* 6 second 1 byte, 0-59
* 7 usecond 3 bytes, 0-999999
* 10
*/
...
/* # of bytes for year, month, day, hour, minute, second, and usecond. */
#define _PyDateTime_DATETIME_DATASIZE 10
...
/* The datetime and time types have hashcodes, and an optional tzinfo member,
* present if and only if hastzinfo is true.
*/
#define _PyTZINFO_HEAD \
PyObject_HEAD \
Py_hash_t hashcode; \
char hastzinfo; /* boolean flag */
...
/* All datetime objects are of PyDateTime_DateTimeType, but that can be
* allocated in two ways too, just like for time objects above. In addition,
* the plain date type is a base class for datetime, so it must also have
* a hastzinfo member (although it's unused there).
*/
...
#define _PyDateTime_DATETIMEHEAD \
_PyTZINFO_HEAD \
unsigned char data[_PyDateTime_DATETIME_DATASIZE];
typedef struct
{
_PyDateTime_DATETIMEHEAD
} _PyDateTime_BaseDateTime; /* hastzinfo false */
typedef struct
{
_PyDateTime_DATETIMEHEAD
unsigned char fold;
PyObject *tzinfo;
} PyDateTime_DateTime; /* hastzinfo true */
Additionally, note the following lines in Modules/_datetimemodule.c:
static PyTypeObject PyDateTime_DateTimeType = {
PyVarObject_HEAD_INIT(NULL, 0)
"datetime.datetime", /* tp_name */
sizeof(PyDateTime_DateTime), /* tp_basicsize */
That tp_basicsize line says sizeof(PyDateTime_DateTime), not sizeof(_PyDateTime_BaseDateTime), and the type doesn't implement any special __sizeof__ handling. That means the datetime.datetime type reports its instance size as the size of a time-zone aware datetime, even for unaware instances.
The 48-byte count you're seeing breaks down as follows:
8-byte refcount
8-byte type pointer
8-byte cached hash
1-byte "hastzinfo" flag
10-byte manually packed unsigned char[10] containing datetime data
1-byte "fold" flag (DST-related)
4-byte padding, to align the tzinfo pointer
8-byte tzinfo pointer
This is true even though the actual memory layout of your unaware instance doesn't have a fold flag or tzinfo pointer.
This is, of course, all implementation details. It may be different on a different Python implementation, or a different CPython version, or a 32-bit CPython build, or a CPython debug build (there's extra stuff in the PyObject_HEAD when CPython is compiled with Py_TRACE_REFS defined).

Related

C++ byte array (struct) interpreted by Python

I am trying to pass a C++ struct from my arduino to my raspberry pi. I have a struct that looks like this:
struct node_status
{
char *node_type = "incubator";
char *sub_type; // set the type of incubator
int sub_type_id;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
I tried using the python module named struct
from struct import *
print("Rcvd Node Status msg from 0{:o}".format(header.from_node))
print("node_type: {}".format(unpack("10s",payload[0]))) #node_type
node_type = unpack("10s",payload[0])
print("sub_type: {}".format(unpack("10s",payload[1]), header.from_node)) #sub_type
sub_type = unpack("10s",payload[1])
print("sub_type_id: {}".format(unpack("b",payload[2])))
sub_type_id = unpack("b",payload[2])
print("sleep: {}".format(unpack("?",payload)[3])) #sleep
sleep = unpack("?",payload[3])
print("check_in_time: {}".format(unpack("l",payload[4]))) #check_in_time
check_in_time = unpack("l",payload[4])
print("Lock: {}".format(unpack("?",payload[5]))) #LOCK
Lock = unpack("?",payload[5])
but I am not having much luck. I was even looking at just using ctypes module but seem to not be going anywhere..
from ctypes import *
class interpret_nodes_status(Structure):
_fields_ = [('node_type',c_char_p),
('sub_type',c_char_p),
('sub_type_id',c_int),
('sleep',c_bool),
(check_in_time',c_int),
('LOCK',c_bool)]
nodestatus = translate_nodes_status(payload)
but that just gives me an error
TypeError: bytes or integer address expected instead of bytearray instance
What can I do? WHERE am I going wrong with this?
EDIT:
I am using the RF24Mesh Library from
https://github.com/nRF24/RF24Mesh
The way I send the message is this?
RF24NetworkHeader header();
if (!mesh.write(&nodeStatus, /*type*/ 126, sizeof(nodeStatus), /*to node*/ 000))
{ // Send the data
if ( !mesh.checkConnection() )
{
Serial.println("Renewing Address");
mesh.renewAddress();
}
}
else
{
Serial.println("node status msg Sent");
return;
}
}
Your C program is just sending the struct, but the struct doesn't contain any of the string data. It only includes pointers (addresses) which are not usable by any other process (different address spaces).
You would need to determine a way to send all the required data, which would likely mean sending the length of each string and its data.
One way to do that would be to use a maximum length and just store the strings in your struct:
struct node_status
{
char node_type[48];
char sub_type[48]; // set the type of incubator
int sub_type_id;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
You would then need to copy strings into those buffers instead of assigning them, and check for buffer overflow. If the strings are ever entered by users, this has security implications.
Another approach is to pack the data into a single block just when you send it.
You could use multiple writes, as well, but I don't know this mesh library or how you would set the type parameter to do that. Using a buffer is something like:
// be sure to check for null on your strings, too.
int lennodetype = strlen(nodeStatus.node_type);
int lensubtype = strlen(nodeStatus.sub_type);
int bufsize = sizeof(nodeStatus) + lennodetype + lensubtype;
byte* buffer = new byte[bufsize];
int offset = 0;
memcpy(buffer+offset, &lennodetype, sizeof(int));
offset += sizeof(int);
memcpy(buffer+offset, nodeStatus.node_type, lennodetype * sizeof(char));
offset += lennodetype * sizeof(char);
memcpy(buffer+offset, &lensubtype, sizeof(int));
offset += sizeof(int);
memcpy(buffer+offset, nodeStatus.sub_type, lensubtype * sizeof(char));
offset += lensubtype * sizeof(char);
// this still copies the pointers, which aren't needed, but simplifies the code
// and 8 unused bytes shouldn't matter too much. You could adjust this line to
// eliminate it if you wanted.
memcpy(buffer+offset, &nodeStatus, sizeof(nodeStatus));
if (!mesh.write(buffer,
/*type*/ 126,
bufsize,
/*to node*/ 000))
{ // Send the data
if ( !mesh.checkConnection() )
{
Serial.println("Renewing Address");
mesh.renewAddress();
}
}
else
{
Serial.println("node status msg Sent");
}
delete [] buffer;
Now that the data is actually SENT (a prerequisite for reading the data) the data you need should all be in the payload array. You will need to unpack it, but you can't just pass unpack a single byte, it needs the array:
len = struct.unpack("#4i", payload)
offset = 4
node_type = struct.unpack_from("{}s".format(len), payload, offset)
offset += len
len = struct.unpack_from("#4i", payload, offset)
offset += 4
sub_type = struct.unpack_from("{}s".format(len), payload, offset)
offset += len
...
I upvoted Garr Godfrey's answer as it is a good one indeed. However, it will increase the struct's size. This neither a good nor bad thing, however if for some reason you would like to keep the solution based on char* pointers instead of arrays (e.g. you don't know the maximum length of the strings), it can be achieved the following way (my code makes assumption of int's size being 4 bytes, little endian, bool's size=1bytes, char size=1byte):
//_Static_assert(sizeof(int)==4u, "Int size has to be 4 bytes");
//the above one is C11, the one below is C++:
//feel free to ifdef that if you need it
static_assert(sizeof(int)==4u, "Int size has to be 4 bytes");
struct node_status
{
char* node_type;
char* sub_type; // set the type of incubator
int sub_type_id;
bool sleep; // set to sleep
int check_in_time; // set check in time
bool LOCK; // set if admin control is true/false
};
size_t serialize_node_status(const struct node_status* st, char* buffer)
{
//this bases on the assumption buffer is large enough
//and string pointers are not null
size_t offset=0u;
size_t l = 0;
l = strlen(st->node_type)+1;
memcpy(buffer+offset, st->node_type, l);
offset += l;
l = strlen(st->sub_type)+1;
memcpy(buffer+offset, st->sub_type, l);
offset += l;
l = sizeof(st->sub_type_id);
memcpy(buffer+offset, &st->sub_type_id, l);
offset += l;
l = sizeof(st->sleep);
memcpy(buffer+offset, &st->sleep, l);
offset += l;
l = sizeof(st->check_in_time);
memcpy(buffer+offset, &st->check_in_time, l);
offset += l;
l = sizeof(st->LOCK);
memcpy(buffer+offset, &st->LOCK, l);
offset += l;
return offset;
// sending:
char buf[100] = {0}; //pick the needed size or allocate it dynamically
struct node_status nodeStatus = {"abcz", "x", 20, true, 999, false};
size_t serialized_bytes = serialize_node_status(&nodeStatus, buf);
mesh.write(buf, /*type*/ 126, serialized_bytes, /*to node*/ 000);
Side note: assigning string literals directly to char pointers is not valid C++.
So the string types either should be const char*, e.g. const char* node_type or the file should be compiled as C (where you can get away with it). Arduino often tends to have its own compilation options set, so it is likely to work due to compiler extension (or just inhibited warning). Thus, not being sure what exactly is going to be used, I wrote a C11-compatible version.
And then on Python's end:
INT_SIZE=4
class node_status:
def __init__(self,
nt: str,
st: str,
stid: int,
sl: bool,
cit: int,
lck: bool):
self.node_type = nt
self.sub_type = st
self.sub_type_id = stid
self.sleep = sl
self.check_in_time = cit
self.LOCK = lck
def __str__(self):
s=f'node_type={self.node_type} sub_type={self.sub_type}'
s+=f' sub_type_id={self.sub_type_id} sleep={self.sleep}'
s+=f' check_in_time={self.check_in_time} LOCK={self.LOCK}'
return s;
#classmethod
def from_bytes(cls, b: bytes):
offset = b.index(0x00)+1
nt = str(b[:offset], 'utf-8')
b=b[offset:]
offset = b.index(0x00)+1
st = str(b[:offset], 'utf-8')
b=b[offset:]
stid = int.from_bytes(b[:INT_SIZE], 'little')
b = b[INT_SIZE:]
sl = bool(b[0])
b = b[1:]
cit = int.from_bytes(b[:INT_SIZE], 'little')
b = b[INT_SIZE:]
lck = bool(b[0])
b = b[1:]
assert(len(b) == 0)
return cls(nt, st, stid, sl, cit, lck)
#and the deserialization goes like this:
fromMesh1 = bytes([0x61,0x62,0x63,0x0,0x78,0x79,0x7A,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x1])
fromMesh2 = bytes([0x61,0x62,0x63,0x0,0x78,0x79,0x7A,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x0])
fromMesh3 = bytes([0x61,0x62,0x63,0x7A,0x0,0x78,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x0])
print(node_status.from_bytes(fromMesh1))
print(node_status.from_bytes(fromMesh2))
print(node_status.from_bytes(fromMesh3))
These are all good answers but not what was required. I suppose a more in depth knowledge of the RF24Mesh library was needed. I have been able to find the answer with the help of some RF24 pro's. Here is my solution:
I had to change the struct to specific sizes using char name[10] on the C++ arduino side.
struct node_status
{
char node_type[10] = "incubator";
char sub_type[10] = "chicken"; // set the type of incubator
int sub_type_id = 1;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
Unfortunately, it looks like read() returns the payload with a length of what you passed to the read() function. This is unintuitive and should be improved. Not to mention, the parameter specifying the length of the payload to return should be optional.
Until they get a fix for this, I will have to slice the payload to only the length that struct.pack() needs (which can be determined based on the format specifier string). So, basically
# get the max sized payload despite what was actually received
head, payload = network.read(144)
# unpack 30 bytes
(
node_type,
sub_type,
sub_type_id,
sleep,
check_in_time,
LOCK,
) = struct.unpack("<10s10si?i?", payload[:30])
I finally got it to work using this method. I want to be fair about giving the points and would like to have your opinion on who should get them that was closest to this method. Please comment below.

How to find out internal string encoding?

From PEP 393 I understand that Python can use multiple encodings internally when storing strings: latin1, UCS-2, UCS-4. Is it possible to find out what encoding is used to store a particular string, e.g. in the interactive interpreter?
There is a CPython C API function for the kind of the unicode object: PyUnicode_KIND.
In case you have Cython and IPython1 you can easily access that function:
In [1]: %load_ext cython
...:
In [2]: %%cython
...:
...: cdef extern from "Python.h":
...: int PyUnicode_KIND(object o)
...:
...: cpdef unicode_kind(astring):
...: if type(astring) is not str:
...: raise TypeError('astring must be a string')
...: return PyUnicode_KIND(astring)
In [3]: a = 'a'
...: b = 'Ǧ'
...: c = '😀'
In [4]: unicode_kind(a), unicode_kind(b), unicode_kind(c)
Out[4]: (1, 2, 4)
Where 1 represents latin-1 and 2 and 4 represent UCS-2 and UCS-4 respectively.
You could then use a dictionary to map these numbers into a string that represents the encoding.
1 It's also possible without Cython and/or IPython, the combination is just very handy, otherwise it would be more code (without IPython) and/or require a manual installation (without Cython).
The only way you can test this from the Python layer (without resorting to manually mucking about with object internals via ctypes or Python extension modules) is by checking the ordinal value of the largest character in the string, which determines whether the string is stored as ASCII/latin-1, UCS-2 or UCS-4. A solution would be something like:
def get_bpc(s):
maxordinal = ord(max(s, default='\0'))
if maxordinal < 256:
return 1
elif maxordinal < 65536:
return 2
else:
return 4
You can't actually rely on sys.getsizeof because, for non-ASCII strings (even one byte per character strings that fit in the latin-1 range), the string might or might not have populated the UTF-8 representation of the string, and tricks like adding an extra character to it and comparing sizes could actually show the size decrease, and it can actually happen "at a distance", so you're not directly responsible for the existence of the cached UTF-8 form on the string you're checking. For example:
>>> e = 'é'
>>> sys.getsizeof(e)
74
>>> sys.getsizeof(e + 'a')
75
>>> class é: pass # One of several ways to trigger creation/caching of UTF-8 form
>>> sys.getsizeof(e)
77 # !!! Grew three bytes even though it's the same variable
>>> sys.getsizeof(e + 'a')
75 # !!! Adding a character shrunk the string!
One way of finding out which exact internal encoding CPython uses for a specific unicode string is to peek in the actual (CPython) object.
According to PEP 393 (Specification section), all unicode string objects start with PyASCIIObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
struct {
unsigned int interned:2;
unsigned int kind:2;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
wchar_t *wstr;
} PyASCIIObject;
Character size is stored in the kind bit-field, as described in the PEP, as well as in the code comments in unicodeobject:
00 => str is not initialized (data are in wstr)
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);
After we get the address of the string with id(string), we can use the ctypes module to read the object's bytes (and the kind field):
import ctypes
mystr = "x"
first_byte = ctypes.c_uint8.from_address(id(mystr)).value
The offset from the object's start to kind is PyObject_HEAD + Py_ssize_t length + Py_hash_t hash, which in turn is Py_ssize_t ob_refcnt + pointer to ob_type + Py_ssize_t length + size of another pointer for the hash type:
offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
(which is 32 on x64)
All put together:
import ctypes
def bytes_per_char(s):
offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
kind = ctypes.c_uint8.from_address(id(s) + offset).value >> 2 & 3
size = {0: ctypes.sizeof(ctypes.c_wchar), 1: 1, 2: 2, 3: 4}
return size[kind]
Gives:
>>> bytes_per_char('test')
1
>>> bytes_per_char('đžš')
2
>>> bytes_per_char('😀')
4
Note we had to handle the special case of kind == 0, because than the character type is exactly wchar_t (which is 16 or 32 bits, depending on the platform).

Why sizeof varies in given example of ctypes struct packing?

I would really appreciate if any explanation can be given to output of following piece of code. I am not getting why sizeof(struct_2) and sizeof(my_struct_2) are different, provided sizeof(struct_1) and sizeof(c_int) is same.
It seems ctypes packed struct within struct in some different way ?
from ctypes import *
class struct_1(Structure):
pass
int8_t = c_int8
int16_t = c_int16
uint8_t = c_uint8
struct_1._fields_ = [
('tt1', int16_t),
('tt2', uint8_t),
('tt3', uint8_t),
]
class struct_2(Structure):
pass
int8_t = c_int8
int16_t = c_int16
uint8_t = c_uint8
struct_2._fields_ = [
('t1', int8_t),
('t2', uint8_t),
('t3', uint8_t),
('t4', uint8_t),
('t5', int16_t),
('t6', struct_1),
('t7', struct_1 * 6),
]
class my_struct_2(Structure):
#_pack_ = 1 # This will give answer as 34
#_pack_ = 4 #36
_fields_ = [
('t1', c_int8),
('t2', c_uint8),
('t3', c_uint8),
('t4', c_uint8),
('t5', c_int16),
('t6', c_int),
('t7', c_int * 6),
]
print "size of c_int : ", sizeof(c_int)
print "size of struct_1 : ", sizeof(struct_1)
print "size of my struct_2 : ", sizeof(my_struct_2)
print "siz of origional struct_2: ", sizeof(struct_2)
OUTPUT:
size of c_int : 4
size of struct_1 : 4
size of my struct_2 : 36
siz of origional struct_2: 34 ==> why not 36 ??
EDIT:
Rename t6->t7 (array of struct_1) and removed pack=2 from struct_2. But still I see different size for struct_2 and my_struct_2
The difference arises from the presence or absence of padding between or after elements in the structure layout, because when the sizes differ, the larger one accounts for more bytes than are required for all the individual structure members. Members t5 and t6 alone are sufficient to demonstrate the difference, and there is no difference if t5 (only) is omitted.
A little experimentation shows that by default (i.e. when the _pack_ member is not specified), ctypes provides 2-byte alignment for structure type struct_1, but 4-byte alignment for type c_int. Or it does on my system, anyway. The ctypes documentation claims that by default it lays out structures the same way that the system's C compiler (by default) does, and that indeed seems to be the case. Consider this C program:
#include <stdio.h>
#include <stdint.h>
#include <stddef.h>
int main() {
struct s {
int16_t x;
int8_t y;
uint8_t z;
};
struct t1 {
int16_t x;
struct s y;
};
struct t2 {
int16_t x;
int y;
};
printf("The size of int is %zu\n", sizeof(int));
printf("The size of struct s is %zu\n", sizeof(struct s));
printf("The size of struct t1 is %zu\n", sizeof(struct t1));
printf("The size of struct t2 is %zu\n", sizeof(struct t2));
printf("\nThe offset of t1.y is %zu\n", offsetof(struct t1, y));
printf("The offset of t2.y is %zu\n", offsetof(struct t2, y));
}
Its output for me (on CentOS 7 w/ GCC 4.8 on x86_64) is:
The size of int is 4
The size of struct s is 4
The size of struct t1 is 6
The size of struct t2 is 8
The offset of t1.y is 2
The offset of t2.y is 4
Observe that the sizes of int and struct s are the same (4 bytes), but the compiler is aligning the struct s on a 2-byte boundary within struct t1, whereas it aligns the int on a 4-byte boundary within struct t2. This matches perfectly with the behavior of ctypes on the same system.
As for why GCC chooses the alignment it does, I observe that if I add a member of type int to struct s then GCC switches to using 4-byte alignment for the struct, as well as arranging (by default) for the offset of the int within the structure to be a multiple of 4 bytes. It is reasonable to conclude that GCC is laying out members within a struct and choosing the alignment of the struct overall so that all members of every aligned struct instance are themselves naturally aligned. Do note, however, that this is just an example. C implementations are largely at their own discretion with respect to choosing structure layout and alignment requirements.

How to unpack a C-style structure inside another structure?

I am receiving data via socket interface from an application (server) written in C. The data being posted has the following structure. I am receiving data with a client written in Python.
struct hdr
{
int Id;
char PktType;
int SeqNo;
int Pktlength;
};
struct trl
{
char Message[16];
long long info;
};
struct data
{
char value[10];
double result;
long long count;
short int valueid;
};
typedef struct
{
struct hdr hdr_buf;
struct data data_buf[100];
struct trl trl_buf;
} trx_unit;
How do I unpack the received data to access my inner data buffer?
Using the struct library is the way to go. However, you will have to know a bit more about the C program that is serializing the data. Consider the hdr structure. If the C program is sending it using the naive approach:
struct hdr header;
send(sd, &hdr, sizeof(header), 0);
Then your client cannot safely interpret the bytes that are sent to it because there is an indeterminate amount of padding inserted between the struct members. In particular, I would expect three bytes of padding following the PktType member.
The safest way to approach sending around binary data is to have the server and client serialize the bytes directly to ensure that there is no additional padding and to make the byte ordering of multibyte integers explicit. For example:
/*
* Send a header over a socket.
*
* The header is sent as a stream of packed bytes with
* integers in "network" byte order. For example, a
* header value of:
* Id: 0x11223344
* PktType: 0xff
* SeqNo: 0x55667788
* PktLength: 0x99aabbcc
*
* is sent as the following byte stream:
* 11 22 33 44 ff 55 66 77 88 99 aa bb cc
*/
void
send_header(int sd, struct hdr const* header)
{ /* NO ERROR HANDLING */
uint32_t num = htonl((uint32_t)header->Id);
send(sd, &num, sizeof(num), 0);
send(sd, &header->PktType, sizeof(header->PktType), 0);
num = htonl((uint32_t)header->SeqNo);
send(sd, &num, sizeof(num), 0);
num = htonl((uint32_t)header->PktLength);
send(sd, &num, sizeof(num), 0);
}
This will ensure that your client can safely decode it using the struct module:
buf = s.recv(13) # packed data is 13 bytes long
id_, pkt_type, seq_no, pkt_length = struct.unpack('>IBII', buf)
If you cannot modify the C code to fix the serialization indeterminacy, then you will have to read the data from the stream and figure out where the C compiler is inserting padding and manually build struct format strings to match using the padding byte format character to ignore padding values.
I usually write a decoder class in Python that reads a complete value from the socket. In your case it would look something like:
class PacketReader(object):
def __init__(self, sd):
self._socket = sd
def read_packet(self):
id_, pkt_type, seq_no, pkt_length = self._read_header()
data_bufs = [self._read_data_buf() for _ in range(0, 100)]
message, info = self._read_trl()
return {'id': id_, 'pkt_type': pkt_type, 'seq_no': seq_no,
'data_bufs': data_bufs, 'message': message,
'info': info}
def _read_header(self):
"""
Read and unpack a ``hdr`` structure.
:returns: a :class:`tuple` of the header data values
in order - *Id*, *PktType*, *SeqNo*, and *PktLength*
The header is assumed to be packed as 13 bytes with
integers in network byte order.
"""
buf = self._socket.read(13)
# > Multibyte values in network order
# I Id as 32-bit unsigned integer value
# B PktType as 8-bit unsigned integer value
# I SeqNo as 32-bit unsigned integer value
# I PktLength as 32-bit unsigned integer value
return struct.unpack('>IBII', buf)
def _read_data_buf(self):
"""
Read and unpack a single ``data`` structure.
:returns: a :class:`tuple` of data values in order -
*value*, *result*, *count*, and *value*
The data structure is assumed to be packed as 28 bytes
with integers in network byte order and doubles encoded
as IEEE 754 binary64 in network byte order.
"""
buf = self._socket.read(28) # assumes double is binary64
# > Multibyte values in network order
# 10s value bytes
# d result encoded as IEEE 754 binary64 value
# q count encoded as a 64-bit signed integer
# H valueid as a 16-bit unsigned integer value
return struct.unpack('>10sdqH', buf)
def _read_trl(self):
"""
Read and unpack a ``trl`` structure.
:returns: a :class:`tuple` of trl values in order -
*Message* as byte string, *info*
The structure is assumed to be packed as 24 bytes with
integers in network byte order.
"""
buf = self.socket.read(24)
# > Multibyte values in network order
# 16s message bytes
# q info encoded as a 64-bit signed value
return struct.unpack('>16sq', buf)
Mind you that this is untested and probably contains syntax errors but that is how I would approach the problem.
The struct library has all you need to do this.

Accessing bitfields while reading/writing binary data structures

I'm writing a parser for a binary format. This binary format involves different tables which are again in binary format containing varying field sizes usually (somewhere between 50 - 100 of them).
Most of these structures will have bitfields and will look something like these when represented in C:
struct myHeader
{
unsigned char fieldA : 3
unsigned char fieldB : 2;
unsigned char fieldC : 3;
unsigned short fieldD : 14;
unsigned char fieldE : 4
}
I came across the struct module but realized that its lowest resolution was a byte and not a bit, otherwise the module pretty much was the right fit for this work.
I know bitfields are supported using ctypes, but I'm not sure how to interface ctypes structs containing bitfields here.
My other option is to manipulate the bits myself and feed it into bytes and use it with the struct module - but since I have close to 50-100 different types of such structures, writing the code for that becomes more error-prone. I'm also worried about efficiency since this tool might be used to parse large gigabytes of binary data.
Thanks.
Using bitstring (which you mention you're looking at) it should be easy enough to implement. First to create some data to decode:
>>> myheader = "3, 2, 3, 14, 4"
>>> a = bitstring.pack(myheader, 1, 0, 5, 1000, 2)
>>> a.bin
'00100101000011111010000010'
>>> a.tobytes()
'%\x0f\xa0\x80'
And then decoding it again is just
>>> a.readlist(myheader)
[1, 0, 5, 1000, 2]
Your main concern might well be the speed. The library is well optimised Python, but that's not nearly as fast as a C library would be.
I haven't rigorously tested this, but it seems to work with unsigned types (edit: it works with signed byte/short types, too).
Edit 2: This is really hit or miss. It depends on the way the library's compiler packed the bits into the struct, which is not standardized. For example, with gcc 4.5.3 it works as long as I don't use the attribute to pack the struct, i.e. __attribute__ ((__packed__)) (so instead of 6 bytes it gets packed into 4 bytes, which you can check with __alignof__ and sizeof). I can make it almost work by adding _pack_ = True to the ctypes Structure definition, but it fails for fieldE. gcc notes: "Offset of packed bit-field ‘fieldE’ has changed in GCC 4.4".
import ctypes
class MyHeader(ctypes.Structure):
_fields_ = [
('fieldA', ctypes.c_ubyte, 3),
('fieldB', ctypes.c_ubyte, 2),
('fieldC', ctypes.c_ubyte, 3),
('fieldD', ctypes.c_ushort, 14),
('fieldE', ctypes.c_ubyte, 4),
]
lib = ctypes.cdll.LoadLibrary('C/bitfield.dll')
hdr = MyHeader()
lib.set_header(ctypes.byref(hdr))
for x in hdr._fields_:
print("%s: %d" % (x[0], getattr(hdr, x[0])))
Output:
fieldA: 3
fieldB: 1
fieldC: 5
fieldD: 12345
fieldE: 9
C:
typedef struct _MyHeader {
unsigned char fieldA : 3;
unsigned char fieldB : 2;
unsigned char fieldC : 3;
unsigned short fieldD : 14;
unsigned char fieldE : 4;
} MyHeader, *pMyHeader;
int set_header(pMyHeader hdr) {
hdr->fieldA = 3;
hdr->fieldB = 1;
hdr->fieldC = 5;
hdr->fieldD = 12345;
hdr->fieldE = 9;
return(0);
}

Categories

Resources