Can pandas read c++ binary file directly?

Can pandas read c++ binary file directly? - python

I have a large file, which is outputed by my c++ code.
it save struct into file with binary format.
For example:
Struct A {
char name[32]:
int age;
double height;
};
output code is like:
std::fstream f;
for (int i = 0; i < 10000000; ++ i)
A a;
f.write(&a, sizeof(a));
I want to handle it in python with pandas DataFrame.
Is there any good methos that can read it elegantly?

Searching for read_bin I found this
issue that suggests using np.fromfile to load the data into a numpy array, then converting to a dataframe:
import numpy as np
import pandas as pd
dt = np.dtype(
[
("name", "S32"), # 32-length zero-terminated bytes
("age", "i4"), # 32-bit signed integer
("height", "f8"), # 64-bit floating-point number
],
)
records = np.fromfile("filename.bin", dt)
df = pd.DataFrame(records)
Please note that I have not tested this code, so there could be some problems in the data types I picked:
the byte order might be different (big/small endian dt = np.dtype([('big', '>i4'), ('little', '<i4')]))
the type for the char array is a null terminated byte array, that I think will result in a bytes type object in python, so you might want to convert that to string (using df['name'] = df['name'].str.decode('utf-8'))
More info on the data types can be found in the numpy docs.
Cheers!

Untested, based on a quick review of the Python struct module's documentation.
import struct
def reader(filehandle):
"""
Accept an open filehandle; read and yield tuples according to the
specified format (see the source) until the filehandle is exhausted.
"""
mystruct = struct.Struct("32sid")
while True:
buf = filehandle.read(mystruct.size)
if len(buf) == 0:
break
name, age, height = mystruct.unpack(buf)
yield name, age, height
Usage:
with open(filename, 'rb') as data:
for name, age, height in reader(data):
# do things with those values
I don't know enough about C++ endianness conventions to decide if you should add a modifier to swap around the byte order somewhere. I'm guessing if C++ and Python are both running on the same machine, you would not have to worry about this.

Related

How do you wrap a C function that returns a pointer to a malloc'd array with ctypes?

I have a C function that reads a binary file and returns a dynamically sized array of unsigned integers (the size is based off metadata from the binary file):
//example.c
#include <stdio.h>
#include <stdlib.h>
__declspec(dllexport)unsigned int *read_data(char *filename, size_t* array_size){
FILE *f = fopen(filename, "rb");
fread(array_size, sizeof(size_t), 1, f);
unsigned int *array = (unsigned int *)malloc(*array_size * sizeof(unsigned int));
fread(array, sizeof(unsigned int), *array_size, f);
fclose(f);
return array;
}
This answer appears to be saying that the correct way to pass the created array from C to Python is something like this:
# example_wrap.py
from ctypes import *
import os
os.add_dll_directory(os.getcwd())
indexer_dll = CDLL("example.dll")
def read_data(filename):
filename = bytes(filename, 'utf-8')
size = c_size_t()
ptr = indexer_dll.read_data(filename, byref(size))
return ptr[:size]
However, when I run the python wrapper, the code silently fails at ptr[:size] as if I'm trying to access an array out of bounds, and I probably am, but what is the correct way to pass this dynamically size array?

A few considerations:
First, you need to properly set the prototype of the C function so that ctypes can properly convert between the C and Python types.
Second, since size is actually a ctypes.c_size_t object, you actually need to use size.value to access the numeric value of the array size.
Third, since ptr[:size.value] actually copies the array contents to a Python list, you'll want to make sure you also free() the allocated C array since you're not going to use it anymore.
(Perhaps copying the array to a Python list is not ideal here, but I'll assume it's ok here, since otherwise you have more complexity in handling the C array in Python.)
This should work:
from ctypes import *
import os
os.add_dll_directory(os.getcwd())
indexer_dll = CDLL("example.dll")
indexer_dll.read_data.argtypes = [c_char_p, POINTER(c_size_t)
indexer_dll.read_data.restype = POINTER(c_int)
libc = cdll.msvcrt
def read_data(filename):
filename = bytes(filename, 'utf-8')
size = c_size_t()
ptr = indexer_dll.read_data(filename, byref(size))
result = ptr[:size.value]
libc.free(ptr)
return result

Why can't Python struct module pack (or unpack) multi bytes with little endian

I'm dealing with some multi bytes issues. For example, I have a variable a = b'\x00\x01\x02\x03', it is a bytes object rather than int. I'd like to struct.pack it to form a package with little endian, but <4s didn't work. In fact, <4s and >4s get the same results. What to do if I'd like the result to be b'\x03\x02\x01\x00.
I know I could use struct.pack('<L', struct.unpack('>L', a)), but is it the only and correct way to deal with multi bytes objects?
Example:
import struct
import secrets
mhdr = b'\x20'
joineui = b'\x00\x01\x02\x03\x04\x05\x06\x07'
deveui = b'\x08\x09\x10\x11\x12\x13\x14\x15'
devnonce = secrets.token_bytes(2)
joinreq = struct.pack(
'<s8s8s2s',
mhdr,
joineui,
deveui,
devnonce,
)
# The expected joinreq should be b'\x20\x07\x06\x05\x04\x03\x02\x01\x00\x15\x14\x13\x12\x11\x10\x09\x08...'

It seems to me you do not want to have 4 single chars, but instead 1 integer.
So instead of '4s' you should try using 'i' or 'I' (whether it is signed or unsigned).
Your example should look like
import struct
import secrets
mhdr = b'\x20'
joineui = b'\x00\x01\x02\x03\x04\x05\x06\x07'
deveui = b'\x08\x09\x10\x11\x12\x13\x14\x15'
devnonce = secrets.token_bytes(2)
joinreq = struct.pack(
'<BQQH', #use small letters if the values are signed instead of unsigned
mhdr,
joineui,
deveui,
devnonce,
)
"Q" stands for long long unsigned (8byte). If you want to use float instead you can use d for double float precision (8byte).
You can see the meaning of all letters in the documentation of struct.

Encoding/Decoding (Berkeley database records) in python3

I have a pre-existing berkeley database written to and read from a program written in C++. I need to sidestep using this program and write to the database directly using python.
I can do this, but am having a heck of a time trying to encode my data properly such that it is in the proper form and can then be read by the original C++ program. In fact, I can't figure out how to decode the existing data when I know what the values are.
The keys of the key value pairs in the database should be timestamps in the form YYYYMMDDHHmmSS. The values should be five doubles and an int mashed together, by which I mean (from the source code of the C++ program), the following structure(?) DVALS
typedef struct
{
double d1;
double d2;
double d3;
double d4;
double d5;
int i1;
} DVALS;
is written to the database as the value of the key value pair like so:
DBT data;
memset(&data, 0, sizeof(DBT));
DVALS dval;
memset(&dval, 0, sizeof(DVALS));
data.data = &dval;
data.size = sizeof(DVALS);
db->put(db, NULL, &key, &data, 0);
Luckily, I can know what the values are. So if I run from the command line
db_dump myfile
the final record is:
323031393033313431353533303000
ae47e17a140e4040ae47e17a140e4040ae47e17a140e4040ae47e17a140e40400000000000b6a4400000000000000000
Using python's bsddb3 module I can pull this record out also:
from bsddb3 import db
myDB = db.DB()
myDB.open('myfile', None, db.DB_BTREE)
cur = myDB.cursor()
kvpair = cur.last()
With kvpair now holding the following information:
(b'20190314155300\x00', b'\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\x00\x00\x00\x00\x00\xb6\xa4#\x00\x00\x00\x00\x00\x00\x00\x00')
The timestamp is easy to read and in this case the actual values are as follows:
d1 = d2 = d3 = d4 = 32.11
d5 = 2651
i1 = 0
As the '\xaeG\xe1z\x14\x0e##' sequence is repeated 4 times I think it corresponds to the value 32.11
So I think my question may just be about encoding/decoding, but perhaps there is more to it, hence the backstory.
kvpair[1].decode('utf-8')
Using a variety of encodings just gives errors similar to this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte

The value data is binary so it may be unpacked using Python's struct module.
>>> import struct
>>> bs = b'\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\xaeG\xe1z\x14\x0e##\x00\x00\x00\x00\x00\xb6\xa4#\x00\x00\x00\x00\x00\x00\x00\x00'
>>> len(bs)
48
>>> struct.unpack('<5di4x', bs)
(32.11, 32.11, 32.11, 32.11, 2651.0, 0)
struct.unpack takes two arguments: a format specifier that defines the data format and types and the data to be unpacked. The format '<5di4x' describes:
<: little endian order
5d: five doubles (8 bytes each)
i: one signed int (4 bytes; I for unsigned)
4x: four pad bytes
Data can be packed in the same way, using struct.pack.
>>> nums = [32.11, 32.11, 32.11, 32.11, 2651, 0]
>>> format_ = '5di4x'
>>> packed = struct.pack(format_, *nums)
>>> packed == bs
True
>>>

How to print values of a string full of "chaos question marks"

I'm debugging with python audio, having a hard time with the audio coding.
Here I have a string full of audio data, say, [10, 20, 100].
However the data is stored in a string variable,
data = "����������������"
I want to inspect the values of this string.
Below is the things I tried
Print as int
I tried to use print "%i" % data[0]
ended up with
Traceback (most recent call last):
File "wire.py", line 28, in <module>
print "%i" % data[i]
TypeError: %d format: a number is required, not str
Convert to int
int(data[0]) ended up with
Traceback (most recent call last):
File "wire.py", line 27, in <module>
print int(data[0])
ValueError: invalid literal for int() with base 10: '\xd1'
Any idea on this? I want to print the string in a numerical way since the string is actually an array of sound wave.
EDIT
All your answers turned out to be really helpful.
The string is actually generated from the microphone so I believe it to be raw wave form, or vibration data. Further this should be referred to the audio API document, PortAudio.
After looking into PortAudio, I find this helpful example.
** This routine will be called by the PortAudio engine when audio is needed.
** It may called at interrupt level on some machines so don't do anything
** that could mess up the system like calling malloc() or free().
static int patestCallback( const void *inputBuffer, void *outputBuffer,
unsigned long framesPerBuffer,
const PaStreamCallbackTimeInfo* timeInfo,
PaStreamCallbackFlags statusFlags,
void *userData )
{
paTestData *data = (paTestData*)userData;
float *out = (float*)outputBuffer;
unsigned long i;
(void) timeInfo; /* Prevent unused variable warnings. */
(void) statusFlags;
(void) inputBuffer;
for( i=0; i<framesPerBuffer; i++ )
{
*out++ = data->sine[data->left_phase]; /* left */
*out++ = data->sine[data->right_phase]; /* right */
data->left_phase += 1;
if( data->left_phase >= TABLE_SIZE ) data->left_phase -= TABLE_SIZE;
data->right_phase += 3; /* higher pitch so we can distinguish left and right. */
if( data->right_phase >= TABLE_SIZE ) data->right_phase -= TABLE_SIZE;
}
return paContinue;
}
This indicates that there is some way that I can interpret the data as float

To be clear, your audio data is a byte string. The byte string is a representation of the bytes stored in the audio file. You are not going to simply be able to convert those bytes into meaningful values without knowing what is in the binary first.
As an example, the mp3 specification says that each mp3 contains header frames (described here: http://en.wikipedia.org/wiki/MP3). To read the header you would either need to use something like bitstring, or if you feel comfortable doing the bitwise manipulation yourself then you would just need to unpack an integer (4 bytes) and do some math to figure out the values of the 32 individual bits.
It really all depends on what you are trying to read, and how the data was generated. If you have whole byte numbers, then struct will serve you well.

If you're ok with the \xd1 mentioned above:
for item in data: print repr(item),
Note that for x in data will iterate over each value in the list rather than its location. If you want the location you can use for i in range(len(data)): ...
If you want them in numerical form, replace repr(item) with ord(item).

It is better if you use the new {}.format method:
data = "����������������"
print '{0}'.format(data[3])

You could use ord to map each byte to its numeric value between 0-255:
print map(ord, data)
Or, for Python 3 compatibility, do:
print([ord(c) for c in data])
It will also work with Unicode glyphs, which might not be what you want, so make sure you have a bytearray or an actual str or bytes object in Python 2.

how to convert wav file to float amplitude

so I asked everything in the title:
I have a wav file (written by PyAudio from an input audio) and I want to convert it in float data corresponding of the sound level (amplitude) to do some fourier transformation etc...
Anyone have an idea to convert WAV data to float?

I have identified two decent ways of doing this.
Method 1: using the wavefile module
Use this method if you don't mind installing some extra libraries which involved a bit of messing around on my Mac but which was easy on my Ubuntu server.
https://github.com/vokimon/python-wavefile
import wavefile
# returns the contents of the wav file as a double precision float array
def wav_to_floats(filename = 'file1.wav'):
w = wavefile.load(filename)
return w[1][0]
signal = wav_to_floats(sys.argv[1])
print "read "+str(len(signal))+" frames"
print "in the range "+str(min(signal))+" to "+str(max(signal))
Method 2: using the wave module
Use this method if you want less module install hassles.
Reads a wav file from the filesystem and converts it into floats in the range -1 to 1. It works with 16 bit files and if they are > 1 channel, will interleave the samples in the same way they are found in the file. For other bit depths, change the 'h' in the argument to struct.unpack according to the table at the bottom of this page:
https://docs.python.org/2/library/struct.html
It will not work for 24 bit files as there is no data type that is 24 bit, so there is no way to tell struct.unpack what to do.
import wave
import struct
import sys
def wav_to_floats(wave_file):
w = wave.open(wave_file)
astr = w.readframes(w.getnframes())
# convert binary chunks to short
a = struct.unpack("%ih" % (w.getnframes()* w.getnchannels()), astr)
a = [float(val) / pow(2, 15) for val in a]
return a
# read the wav file specified as first command line arg
signal = wav_to_floats(sys.argv[1])
print "read "+str(len(signal))+" frames"
print "in the range "+str(min(signal))+" to "+str(max(signal))

I spent hours trying to find the answer to this. The solution turns out to be really simple: struct.unpack is what you're looking for. The final code will look something like this:
rawdata=stream.read() # The raw PCM data in need of conversion
from struct import unpack # Import unpack -- this is what does the conversion
npts=len(rawdata) # Number of data points to be converted
formatstr='%ih' % npts # The format to convert the data; use '%iB' for unsigned PCM
int_data=unpack(formatstr,rawdata) # Convert from raw PCM to integer tuple
Most of the credit goes to Interpreting WAV Data. The only trick is getting the format right for unpack: it has to be the right number of bytes and the right format (signed or unsigned).

Most wave files are in PCM 16-bit integer format.
What you will want to:
Parse the header to known which format it is (check the link from Xophmeister)
Read the data, take the integer values and convert them to float
Integer values range from -32768 to 32767, and you need to convert to values from -1.0 to 1.0 in floating points.
I don't have the code in python, however in C++, here is a code excerpt if the PCM data is 16-bit integer, and convert it to float (32-bit):
short* pBuffer = (short*)pReadBuffer;
const float ONEOVERSHORTMAX = 3.0517578125e-5f; // 1/32768
unsigned int uFrameRead = dwRead / m_fmt.Format.nBlockAlign;
for ( unsigned int i = 0; i < uFrameCount * m_fmt.Format.nChannels; ++i )
{
short i16In = pBuffer[i];
out_pBuffer[i] = (float)i16In * ONEOVERSHORTMAX;
}
Be careful with stereo files, as the stereo PCM data in wave files is interleaved, meaning the data looks like LRLRLRLRLRLRLRLR (instead of LLLLLLLLRRRRRRRR). You may or may not need to de-interleave depending what you do with the data.

This version reads a wav file from the filesystem and converts it into floats in the range -1 to 1. It works with files of all sample widths and it will interleave the samples in the same way they are found in the file.
import wave
def read_wav_file(filename):
def get_int(bytes_obj):
an_int = int.from_bytes(bytes_obj, 'little', signed=sampwidth!=1)
return an_int - 128 * (sampwidth == 1)
with wave.open(filename, 'rb') as file:
sampwidth = file.getsampwidth()
frames = file.readframes(-1)
bytes_samples = (frames[i : i+sampwidth] for i in range(0, len(frames), sampwidth))
return [get_int(b) / pow(2, sampwidth * 8 - 1) for b in bytes_samples]
Also here is a link to the function that converts floats back to ints and writes them to desired wav file:
https://gto76.github.io/python-cheatsheet/#writefloatsamplestowavfile

The Microsoft WAVE format is fairly well documented. See https://ccrma.stanford.edu/courses/422/projects/WaveFormat/ for example. It wouldn't take much to write a file parser to open and interpret the data to get the information you require... That said, it's almost certainly been done before, so I'm sure someone will give an "easier" answer ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can pandas read c++ binary file directly? - python

Related

How do you wrap a C function that returns a pointer to a malloc'd array with ctypes?

Why can't Python struct module pack (or unpack) multi bytes with little endian

Encoding/Decoding (Berkeley database records) in python3

How to print values of a string full of "chaos question marks"

how to convert wav file to float amplitude

Categories

Resources