Encoding/Decoding (Berkeley database records) in python3

Encoding/Decoding (Berkeley database records) in python3 - python

I have a pre-existing berkeley database written to and read from a program written in C++. I need to sidestep using this program and write to the database directly using python.
I can do this, but am having a heck of a time trying to encode my data properly such that it is in the proper form and can then be read by the original C++ program. In fact, I can't figure out how to decode the existing data when I know what the values are.
The keys of the key value pairs in the database should be timestamps in the form YYYYMMDDHHmmSS. The values should be five doubles and an int mashed together, by which I mean (from the source code of the C++ program), the following structure(?) DVALS
typedef struct
{
double d1;
double d2;
double d3;
double d4;
double d5;
int i1;
} DVALS;
is written to the database as the value of the key value pair like so:
DBT data;
memset(&data, 0, sizeof(DBT));
DVALS dval;
memset(&dval, 0, sizeof(DVALS));
data.data = &dval;
data.size = sizeof(DVALS);
db->put(db, NULL, &key, &data, 0);
Luckily, I can know what the values are. So if I run from the command line
db_dump myfile
the final record is:
323031393033313431353533303000
ae47e17a140e4040ae47e17a140e4040ae47e17a140e4040ae47e17a140e40400000000000b6a4400000000000000000
Using python's bsddb3 module I can pull this record out also:
from bsddb3 import db
myDB = db.DB()
myDB.open('myfile', None, db.DB_BTREE)
cur = myDB.cursor()
kvpair = cur.last()
With kvpair now holding the following information:
(b'20190314155300\x00', b'\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\x00\x00\x00\x00\x00\xb6\xa4#\x00\x00\x00\x00\x00\x00\x00\x00')
The timestamp is easy to read and in this case the actual values are as follows:
d1 = d2 = d3 = d4 = 32.11
d5 = 2651
i1 = 0
As the '\xaeG\xe1z\x14\x0e##' sequence is repeated 4 times I think it corresponds to the value 32.11
So I think my question may just be about encoding/decoding, but perhaps there is more to it, hence the backstory.
kvpair[1].decode('utf-8')
Using a variety of encodings just gives errors similar to this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte

The value data is binary so it may be unpacked using Python's struct module.
>>> import struct
>>> bs = b'\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\x00\x00\x00\x00\x00\xb6\xa4#\x00\x00\x00\x00\x00\x00\x00\x00'
>>> len(bs)
48
>>> struct.unpack('<5di4x', bs)
(32.11, 32.11, 32.11, 32.11, 2651.0, 0)
struct.unpack takes two arguments: a format specifier that defines the data format and types and the data to be unpacked. The format '<5di4x' describes:
<: little endian order
5d: five doubles (8 bytes each)
i: one signed int (4 bytes; I for unsigned)
4x: four pad bytes
Data can be packed in the same way, using struct.pack.
>>> nums = [32.11, 32.11, 32.11, 32.11, 2651, 0]
>>> format_ = '5di4x'
>>> packed = struct.pack(format_, *nums)
>>> packed == bs
True
>>>

Related

Can pandas read c++ binary file directly?

I have a large file, which is outputed by my c++ code.
it save struct into file with binary format.
For example:
Struct A {
char name[32]:
int age;
double height;
};
output code is like:
std::fstream f;
for (int i = 0; i < 10000000; ++ i)
A a;
f.write(&a, sizeof(a));
I want to handle it in python with pandas DataFrame.
Is there any good methos that can read it elegantly?

Searching for read_bin I found this
issue that suggests using np.fromfile to load the data into a numpy array, then converting to a dataframe:
import numpy as np
import pandas as pd
dt = np.dtype(
[
("name", "S32"), # 32-length zero-terminated bytes
("age", "i4"), # 32-bit signed integer
("height", "f8"), # 64-bit floating-point number
],
)
records = np.fromfile("filename.bin", dt)
df = pd.DataFrame(records)
Please note that I have not tested this code, so there could be some problems in the data types I picked:
the byte order might be different (big/small endian dt = np.dtype([('big', '>i4'), ('little', '<i4')]))
the type for the char array is a null terminated byte array, that I think will result in a bytes type object in python, so you might want to convert that to string (using df['name'] = df['name'].str.decode('utf-8'))
More info on the data types can be found in the numpy docs.
Cheers!

Untested, based on a quick review of the Python struct module's documentation.
import struct
def reader(filehandle):
"""
Accept an open filehandle; read and yield tuples according to the
specified format (see the source) until the filehandle is exhausted.
"""
mystruct = struct.Struct("32sid")
while True:
buf = filehandle.read(mystruct.size)
if len(buf) == 0:
break
name, age, height = mystruct.unpack(buf)
yield name, age, height
Usage:
with open(filename, 'rb') as data:
for name, age, height in reader(data):
# do things with those values
I don't know enough about C++ endianness conventions to decide if you should add a modifier to swap around the byte order somewhere. I'm guessing if C++ and Python are both running on the same machine, you would not have to worry about this.

Why python has different types of bytes

I have two variables, one is b_d, the other is b_test_d.
When I type b_d in the console, it shows:
b'\\\x8f\xc2\xf5(\\\xf3?Nb\x10X9\xb4\x07#\x00\x00\x00\x00\x00\x00\xf0?'
when I type b_test_d in the console, it shows:
b'[-2.1997713216,-1.4249271187,-1.1076795391,1.5224958034,-0.1709796203,0.3663875698,0.14846441,-0.7415930061,-1.7602231949,0.126605689,0.6010934792,-0.466415358,1.5675525816,1.00836295,1.4332792992,0.6113384254,-1.8008540571,-0.9443408896,1.0943670356,-1.0114642686,1.443892627,-0.2709427287,0.2990462512,0.4650133591,0.2560791327,0.2257600462,-2.4077429827,-0.0509983213,1.0062187148,0.4315075795,-0.6116110033,0.3495131413,-0.3249903375,0.3962305931,-0.1985757285,1.165792433,-1.1171953063,-0.1732557874,-0.3791600654,-0.2860519953,0.7872658859,0.217728374,-0.4715179983,-0.4539613811,-0.396353657,1.2326862425,-1.3548659354,1.6476230786,0.6312713442,-0.735444661,-0.6853447369,-0.8480631975,0.9538606574,0.6653542368,-0.2833696021,0.7281604648,-0.2843872095,0.1461980484,-2.3511731773,-0.3118047948,-1.6938613893,-0.0359659687,-0.5162134311,-2.2026641552,-0.7294895084,0.7493073213,0.1034096968,0.6439803068,-0.2596155272,0.5851323455,1.0173285542,-0.7370464113,1.0442954406,-0.5363832595,0.0117795359,0.2225617514,0.067571974,-0.9154681906,-0.293808596,1.3717113798,0.4919516922,-0.3254944005,1.6203744532,-0.1810222279,-0.6111596457,1.344064259,-0.4596893179,-0.2356197144,0.4529942046,1.6244603294,0.1849995925,0.6223061217,-0.0340662398,0.8365900535,-0.6804201929,0.0149665385,0.4132453788,0.7971962667,-1.9391525531,0.1440486871,-0.7103617816,0.9026539637,0.6665798363,-1.5885073458,1.4084493329,-1.397040825,1.6215697667,1.7057148522,0.3802647045,-0.4239271483,1.4773614536,1.6841461329,0.1166845529,-0.3268795898,-0.9612751672,0.4062399443,0.357209662,-0.2977362702,-0.3988147401,-0.1174652196,0.3350589818,-1.8800423584,0.0124169787,1.0015110265,0.789541751,-0.2710408983,1.4987300181,-1.1726824468,-0.355322591,0.6567978423,0.8319110558,0.8258835069,-1.1567887763,1.9568551122,1.5148655075,1.0589021915,-0.4388232953,-0.7451680183,-2.1897621693,0.4502135234,-1.9583089063,0.1358789518,-1.7585860897,0.452259777,0.7406800349,-1.3578980418,1.108740204,-1.1986272667,-1.0273598206,-1.8165822264,1.0853600894,-0.273943514,0.8589890805,1.3639094329,-0.6121993589,-0.0587067992,0.0798457584,1.0992814648,-1.0455733611,1.4780003064,0.5047157705,0.1565451605,0.9656886956,-0.5998330255,0.4846727299,0.8790524818,1.0288893846,-2.0842447397,0.4074607421,2.1523241756,-1.1268047125,-0.6016001524,-1.3302141561,1.1869516954,1.0988060125,0.7405900405,1.1813110811,0.8685330644,2.0927140519,-1.7171952009,0.9231993147,0.320874115,0.7465845079,-0.1034484959,-0.4776822499,0.436218328,-0.4083564542,0.4835567895,1.0733230373,-0.858658902,-0.4493571034,0.4506418221,1.6696649735,-0.9189799982,-1.1690356499,-1.0689397924,0.3174297583,1.0403701444,0.5440082812,-0.1128248996]'
Both of them are bytes type, but I can use numpy.frombuffer to read the b_d, but not the b_test_d. And they look very different. Why do I have these two types of bytes?
Thank you.

[A]nyone can point out how to use Json marshall to convert the byte to the same type of bytes as the first one?
This isn't the right question, but I think I know what you're asking. You say you're getting the 2nd array via JSON marshalling, but that it's also not under your control:
it was obtained by json marshal (convert a received float array to byte array, and then convert the result to base64 string, which is done by someone else)
That's fine though, you just have to do a few steps of processing to get to a state equivalent to the first set of bytes.
First, some context to what's going on. You've already seen that numpy can understand your first set of bytes.
>>> numpy.frombuffer(data)
[1.21 2.963 1. ]
Based on its output, it looks like numpy is interpreting your data as 3 doubles, with 8 bytes each (24 bytes total)...
>>> data = b'\\\x8f\xc2\xf5(\\\xf3?Nb\x10X9\xb4\x07#\x00\x00\x00\x00\x00\x00\xf0?'
>>> len(data)
24
...which the struct module can also interpret.
# Separate into 3 doubles
x, y, z = data[:8], data[8:16], data[16:]
print([struct.unpack('d', i) for i in (x, y, z)])
[(1.21,), (2.963,), (1.0,)
There's actually (at least) 2 ways you can get a numpy array out of this.
Short way
1. Convert to string
# Original JSON data (snipped)
junk = b'[-2.1997713216,-1.4249271187,-1.1076795391,...]'
# Decode from bytes to a string (defaults to utf-8), then
# trim off the brackets (first and last characters in the string)
as_str = junk.decode()[1:-1]
2. Use numpy.fromstring
numpy.fromstring(as_str, dtype=float, sep=',')
# Produces:
array([-2.19977132, -1.42492712, -1.10767954, 1.5224958 , -0.17097962,
0.36638757, 0.14846441, -0.74159301, -1.76022319, 0.12660569,
0.60109348, -0.46641536, 1.56755258, 1.00836295, 1.4332793 ,
0.61133843, -1.80085406, -0.94434089, 1.09436704, -1.01146427,
1.44389263, -0.27094273, 0.29904625, 0.46501336, 0.25607913,
0.22576005, -2.40774298, -0.05099832, 1.00621871, 0.43150758,
... ])
Long way
Note: I found the fromstring method after writing this part up, figured I'd leave it here to at least help explain the byte differences.
1. Convert the JSON data into an array of numeric values.
# Original JSON data (snipped)
junk = b'[-2.1997713216,-1.4249271187,-1.1076795391,...]'
# Decode from bytes to a string - defaults to utf-8
junk = junk.decode()
# Trim off the brackets - First and last characters in the string
junk = junk[1:-1]
# Separate into values
junk = junk.split(',')
# Convert to numerical values
doubles = [float(val) for val in junk]
# Or, as a one-liner
doubles = [float(val) for val in junk.decode()[1:-1].split(',')]
# "doubles" currently holds:
[-2.1997713216,
-1.4249271187,
-1.1076795391,
1.5224958034,
...]
2. Use struct to get byte-representations for the doubles
import struct
as_bytes = [struct.pack('d', val) for val in doubles]
# "as_bytes" currently holds:
[b'\x08\x9b\xe7\xb4!\x99\x01\xc0',
b'\x0b\x00\xe0`\x80\xcc\xf6\xbf',
b'+ ..\x0e\xb9\xf1\xbf',
b'hg>\x8f$\\\xf8?',
...]
3. Join all the double values (as bytes) into a single byte-string, then submit to numpy
new_data = b''.join(as_bytes)
numpy.frombuffer(new_data)
# Produces:
array([-2.19977132, -1.42492712, -1.10767954, 1.5224958 , -0.17097962,
0.36638757, 0.14846441, -0.74159301, -1.76022319, 0.12660569,
0.60109348, -0.46641536, 1.56755258, 1.00836295, 1.4332793 ,
0.61133843, -1.80085406, -0.94434089, 1.09436704, -1.01146427,
1.44389263, -0.27094273, 0.29904625, 0.46501336, 0.25607913,
0.22576005, -2.40774298, -0.05099832, 1.00621871, 0.43150758,
... ])

A bytes object can be in any format. It is "just a bunch of bytes" without context. For display Python will represent byte values <128 as their ASCII value, and use hex escape codes (\x##) for others.
The first looks like IEEE 754 double precision floating point. numpy or struct can read it. The second one is in JSON format. Use the json module to read it:
import numpy as np
import json
import struct
b1 = b'\\\x8f\xc2\xf5(\\\xf3?Nb\x10X9\xb4\x07#\x00\x00\x00\x00\x00\x00\xf0?'
b2 = b'[-2.1997713216,-1.4249271187,-1.1076795391,1.5224958034]'
j = json.loads(b2)
n = np.frombuffer(b1)
s = struct.unpack('3d',b1)
print(j,n,s,sep='\n')
# To convert b2 into a b1 format
b = struct.pack('4d',*j)
print(b)
Output:
[-2.1997713216, -1.4249271187, -1.1076795391, 1.5224958034]
[1.21 2.963 1. ]
(1.21, 2.963, 1.0)
b'\x08\x9b\xe7\xb4!\x99\x01\xc0\x0b\x00\xe0`\x80\xcc\xf6\xbf+ ..\x0e\xb9\xf1\xbfhg>\x8f$\\\xf8?'

Why can't Python struct module pack (or unpack) multi bytes with little endian

I'm dealing with some multi bytes issues. For example, I have a variable a = b'\x00\x01\x02\x03', it is a bytes object rather than int. I'd like to struct.pack it to form a package with little endian, but <4s didn't work. In fact, <4s and >4s get the same results. What to do if I'd like the result to be b'\x03\x02\x01\x00.
I know I could use struct.pack('<L', struct.unpack('>L', a)), but is it the only and correct way to deal with multi bytes objects?
Example:
import struct
import secrets
mhdr = b'\x20'
joineui = b'\x00\x01\x02\x03\x04\x05\x06\x07'
deveui = b'\x08\x09\x10\x11\x12\x13\x14\x15'
devnonce = secrets.token_bytes(2)
joinreq = struct.pack(
'<s8s8s2s',
mhdr,
joineui,
deveui,
devnonce,
)
# The expected joinreq should be b'\x20\x07\x06\x05\x04\x03\x02\x01\x00\x15\x14\x13\x12\x11\x10\x09\x08...'

It seems to me you do not want to have 4 single chars, but instead 1 integer.
So instead of '4s' you should try using 'i' or 'I' (whether it is signed or unsigned).
Your example should look like
import struct
import secrets
mhdr = b'\x20'
joineui = b'\x00\x01\x02\x03\x04\x05\x06\x07'
deveui = b'\x08\x09\x10\x11\x12\x13\x14\x15'
devnonce = secrets.token_bytes(2)
joinreq = struct.pack(
'<BQQH', #use small letters if the values are signed instead of unsigned
mhdr,
joineui,
deveui,
devnonce,
)
"Q" stands for long long unsigned (8byte). If you want to use float instead you can use d for double float precision (8byte).
You can see the meaning of all letters in the documentation of struct.

How to find out internal string encoding?

From PEP 393 I understand that Python can use multiple encodings internally when storing strings: latin1, UCS-2, UCS-4. Is it possible to find out what encoding is used to store a particular string, e.g. in the interactive interpreter?

There is a CPython C API function for the kind of the unicode object: PyUnicode_KIND.
In case you have Cython and IPython1 you can easily access that function:
In [1]: %load_ext cython
...:
In [2]: %%cython
...:
...: cdef extern from "Python.h":
...: int PyUnicode_KIND(object o)
...:
...: cpdef unicode_kind(astring):
...: if type(astring) is not str:
...: raise TypeError('astring must be a string')
...: return PyUnicode_KIND(astring)
In [3]: a = 'a'
...: b = 'Ǧ'
...: c = '😀'
In [4]: unicode_kind(a), unicode_kind(b), unicode_kind(c)
Out[4]: (1, 2, 4)
Where 1 represents latin-1 and 2 and 4 represent UCS-2 and UCS-4 respectively.
You could then use a dictionary to map these numbers into a string that represents the encoding.
1 It's also possible without Cython and/or IPython, the combination is just very handy, otherwise it would be more code (without IPython) and/or require a manual installation (without Cython).

The only way you can test this from the Python layer (without resorting to manually mucking about with object internals via ctypes or Python extension modules) is by checking the ordinal value of the largest character in the string, which determines whether the string is stored as ASCII/latin-1, UCS-2 or UCS-4. A solution would be something like:
def get_bpc(s):
maxordinal = ord(max(s, default='\0'))
if maxordinal < 256:
return 1
elif maxordinal < 65536:
return 2
else:
return 4
You can't actually rely on sys.getsizeof because, for non-ASCII strings (even one byte per character strings that fit in the latin-1 range), the string might or might not have populated the UTF-8 representation of the string, and tricks like adding an extra character to it and comparing sizes could actually show the size decrease, and it can actually happen "at a distance", so you're not directly responsible for the existence of the cached UTF-8 form on the string you're checking. For example:
>>> e = 'é'
>>> sys.getsizeof(e)
74
>>> sys.getsizeof(e + 'a')
75
>>> class é: pass # One of several ways to trigger creation/caching of UTF-8 form
>>> sys.getsizeof(e)
77 # !!! Grew three bytes even though it's the same variable
>>> sys.getsizeof(e + 'a')
75 # !!! Adding a character shrunk the string!

One way of finding out which exact internal encoding CPython uses for a specific unicode string is to peek in the actual (CPython) object.
According to PEP 393 (Specification section), all unicode string objects start with PyASCIIObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
struct {
unsigned int interned:2;
unsigned int kind:2;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
wchar_t *wstr;
} PyASCIIObject;
Character size is stored in the kind bit-field, as described in the PEP, as well as in the code comments in unicodeobject:
00 => str is not initialized (data are in wstr)
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);
After we get the address of the string with id(string), we can use the ctypes module to read the object's bytes (and the kind field):
import ctypes
mystr = "x"
first_byte = ctypes.c_uint8.from_address(id(mystr)).value
The offset from the object's start to kind is PyObject_HEAD + Py_ssize_t length + Py_hash_t hash, which in turn is Py_ssize_t ob_refcnt + pointer to ob_type + Py_ssize_t length + size of another pointer for the hash type:
offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
(which is 32 on x64)
All put together:
import ctypes
def bytes_per_char(s):
offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
kind = ctypes.c_uint8.from_address(id(s) + offset).value >> 2 & 3
size = {0: ctypes.sizeof(ctypes.c_wchar), 1: 1, 2: 2, 3: 4}
return size[kind]
Gives:
>>> bytes_per_char('test')
1
>>> bytes_per_char('đžš')
2
>>> bytes_per_char('😀')
4
Note we had to handle the special case of kind == 0, because than the character type is exactly wchar_t (which is 16 or 32 bits, depending on the platform).

Unpacking and packing back a struct consisting of single bytes

I am getting struct.error: bad char in struct format when packing bytes back in the struct even without making any changes to them.
I am trying to do bitwise operations on each byte in RGBTRIPLE of a 24-bit BMP image. For the sake of simplicity, I am posting the code with just one sample bytes sequence representing a pixel in a Bitmap; I don't make any bitwise operations on it, just try to pack it back.
from struct import *
from collections import namedtuple
def main():
RGBTRIPLE = namedtuple('RGBTRIPLE', 'rgbtRed rgbtGreen rgbtBlue')
rgbt_fmt = '=BBB'
rgbt_size = calcsize(rgbt_fmt)
rgbt_buffer = b'\x1c\x1e\x1f'
rgbt = RGBTRIPLE._make(unpack(rgbt_fmt, rgbt_buffer))
rgbtRed = rgbt.rgbtRed
rgbtGreen = rgbt.rgbtGreen
rgbtBlue = rgbt.rgbtBlue
rgbt_buffer = pack('rgbt_fmt', rgbtRed, rgbtGreen, rgbtBlue)
if __name__ == "__main__":
main()
From what I understand, the problem is that when I am unpacking bytes, I am getting ints with size > 1 byte. What is the best way to fix the size of those ints at 1 byte, so I can pack them back using the same =BBB struct format?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding/Decoding (Berkeley database records) in python3 - python

Related

Can pandas read c++ binary file directly?

Why python has different types of bytes

Why can't Python struct module pack (or unpack) multi bytes with little endian

How to find out internal string encoding?

Unpacking and packing back a struct consisting of single bytes

Categories

Resources