Different compression sizes using Python and C++ - python

I am using zlib for compressing some data, and I am running into a weird issue: data compressed with python is smaller than the one using C++. I have 130MB of simulation data I want to save compressed(too many files for all the necessary data).
Using C++, I have something of the sort:
//calculate inputData (double * 256 * 256 * 256)
unsigned int length = inputLength;
unsigned int outLength = length + length/1000 + 12 + 1;
printf("Length: %d %d\n", length, outLength);
Byte *outData = new Byte[outLength];
z_stream strm;
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.next_in = (Byte *) inputData;
strm.avail_in = length;
deflateInit(&strm, -1);
do {
strm.next_out = outData;
strm.avail_out = outLength;
deflate(&strm, Z_FINISH);
unsigned int have = outLength - strm.avail_out;
fwrite(outData, 1, have, output);
} while(strm.avail_out == 0);
deflateEnd(&strm);
delete[] outData;
The result using C++ is around 120MB, which is hardly what I expect, as the original is close to 130MB.
In python:
from array import array
import zlib
// read data from uncompressed file
arrD = array.array('d', data)
file.write(zlib.compress(arrD))
The result using Python is around 50MB using the same input data, less than the half. The C++ code is mostly based on the one using in python's implementation, which makes this issue even weirder.
For C++, I am using Visual Studio 2010 Profession with Zlib 1.2.8 compiled by myself.
For Python, I am using the official python 3.4.2.

Related

How to decode (from base64) a python np-array and reload it in c++ as a vector of floats?

In my project I work with word vectors as numpy arrays with a dimension of 300. I want to store the processed arrays in a mongo database, base64 encoded, because this saves a lot of storage space.
Python code
import base64
import numpy as np
vector = np.zeros(300, dtype=np.float32) # represents some word-vector
vector = base64.b64encode(vector) # base64 encoding
# Saving vector to MongoDB...
In MongoDB it is saved in as binary like this. In C++ I would like to load this binary data as a std::vector. Therefore I have to decode the data first and then load it correctly. I was able to get the binary data into the c++ program with mongocxx and had it as a uint8_t* with a size of 1600 - but now I don't know what to do and would be happy if someone could help me. Thank you (:
C++ Code
const bsoncxx::document::element elem_vectors = doc["vectors"];
const bsoncxx::types::b_binary vectors = elemVectors.get_binary();
const uint32_t b_size = vectors.size; // == 1600
const uint8_t* first = vectors.bytes;
// How To parse this as a std::vector<float> with a size of 300?
Solution
I added these lines to my C++ code and was able to load a vector with 300 elements and all correct values.
const std::string encoded(reinterpret_cast<const char*>(first), b_size);
std::string decoded = decodeBase64(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decoded.size() / sizeof(float); ++i) {
vec[i] = *(reinterpret_cast<const float*>(decoded.c_str() + i * sizeof(float)));
}
To mention: Thanks to #Holt's info, it is not wise to encode a Numpy array base64 and then store it as binary. Much better to call ".to_bytes()" on the numpy array and then store that in MongoDB, because it reduces the document size from 1.7kb (base64) to 1.2kb (to_bytes()) and then saves computation time because the encoding (and decoding!) doesn't have to be computed!
Thank #Holt for pointing out my mistake.
First, you can't save the storage space by using base64 encoding. On the contrary, it will waste your storage. For an array with 300 floats, the storage is only 300 * 4 = 1200bytes. While after you encode it, the storage will be 1600 bytes! See more about base64 here.
Second, you want to parse the bytes into a vector<float>. You need to decode the bytes if you still use the base64 encoding. I suggest you use some third-party library or try this question. Suppose you already have the decode function.
std::string base64_decode(std::string const& encoded_string); // or something like that.
You need to use reinterpret_cast to get the value.
const std::string encoded(first, b_size);
std::string decoded = base64_decode(encoded);
std::vector<float> vec(300);
for (size_t i = 0; i < decode.size() / sizeof(float); ++i) {
vec[i] = *(reinterpret_cast<const double*>(decoded.c_str()) + i);
}

Freeing memory when using ctypes

I am using ctypes to try and speed up my code.
My problem is similar to the one in this tutorial : https://cvstuff.wordpress.com/2014/11/27/wraping-c-code-with-python-ctypes-memory-and-pointers/
As pointed out in the tutorial I should free the memory after using the C function. Here is my C code
//C functions
double* getStuff(double *R_list, int items){
double results[items];
double* results_p;
for(int i = 0; i < items; i++){
res = calculation ; \\do some calculation
results[i] = res; }
results_p = results;
printf("C allocated address %p \n", results_p);
return results_p; }
void free_mem(double *a){
printf("freeing address: %p\n", a);
free(a); }
Which I compile with gcc -shared -Wl,-lgsl,-soname, simps -o libsimps.so -fPIC simps.c
And python:
//Python
from ctypes import *
import numpy as np
mydll = CDLL("libsimps.so")
mydll.getStuff.restype = POINTER(c_double)
mydll.getStuff.argtypes = [POINTER(c_double),c_int]
mydll.free_mem.restype = None
mydll.free_mem.argtypes = [POINTER(c_double)]
R = np.logspace(np.log10(0.011),1, 100, dtype = float) #input
tracers = c_int(len(R))
R_c = R.ctypes.data_as(POINTER(c_double))
for_list = mydll.getStuff(R_c,tracers)
print 'Python allocated', hex(for_list)
for_list_py = np.array(np.fromiter(for_list, dtype=np.float64, count=len(R)))
mydll.free_mem(for_list)
Up to the last line the code does what I want it to and the for_list_py values are correct. However, when I try to free the memory, I get a Segmentation fault and on closer inspection the address associated with for_list --> hex(for_list) is different to the one allocated to results_p within C part of the code.
As pointed out in this question, Python ctypes: how to free memory? Getting invalid pointer error , for_list will return the same address if mydll.getStuff.restype is set to c_void_p. But then I struggle to put the actual values I want into for_list_py. This is what I've tried:
cast(for_list, POINTER(c_double) )
for_list_py = np.array(np.fromiter(for_list, dtype=np.float64, count=len(R)))
mydll.free_mem(for_list)
where the cast operation seems to change for_list into an integer. I'm fairly new to C and very confused. Do I need to free that chunk of memory? If so, how do I do that whilst also keeping the output in a numpy array? Thanks!
Edit: It appears that the address allocated in C and the one I'm trying to free are the same, though I still recieve a Segmentation fault.
C allocated address 0x7ffe559a3960
freeing address: 0x7ffe559a3960
Segmentation fault
If I do print for_list I get <__main__.LP_c_double object at 0x7fe2fc93ab00>
Conclusion
Just to let everyone know, I've struggled with c_types for a bit.
I've ended up opting for SWIG instead of c_types. I've found that the code runs faster on the whole (compared to the version presented here). I found this documentation on dealing with memory deallocation in SWIG very useful https://scipy-cookbook.readthedocs.io/items/SWIG_Memory_Deallocation.html as well as the fact that SWIG gives you a very easy way of dealing with numpy n-dimensional arrays.
After getStuff function exits, the memory allocated to results array is not available any more, so when you try to free it, it crashes the program.
Try this instead:
double* getStuff(double *R_list, int items)
{
double* results_p = malloc(sizeof((*results_p) * (items + 1));
if (results_p == NULL)
{
// handle error
}
for(int i = 0; i < items; i++)
{
res = calculation ; \\do some calculation
results_p[i] = res;
}
printf("C allocated address %p \n", results_p);
return results_p;
}

Passing C++ double array to Python Results in a Crash

I'm running into an issue while trying to pass a double array from C++ to Python. I run a script to create a binary file with data, then read that data back into an array and am trying to pass the array to Python. I've followed advice here: how to return array from c function to python using ctypes among other pages I have found through google. I can write a generic example that works fine (like a similar array to the link above), but when I try to pass the array read from a binary file (code below), the program crashes with "Unhandled exception at ADDR (ucrtbase.dll) in python.exe: An invalid parameter was passed to a function that considers invalid parameters fatal." So, I'm wondering if anyone has any insight.
A word on methodology:
Right now, I'm just trying to learn - that's why I'm going through the convoluted process of saving to disk, loading, and passing to Python. Eventaully, I will use this in scientific simulations where the data read from disk needs to be generated by distributed computing/a super computer. I would like to use Python for its ease of plotting (matplotlib) and C++ for its speed (iterative calculations, etc).
So, on to my code. This generates the binary file:
for (int zzz = 0; zzz < arraysize; ++zzz)
{
for (int yyy = 0; yyy < arraysize; ++yyy)
{
for (int xxx = 0; xxx < arraysize; ++xxx)
{//totalBatP returns a 3 element std::vector<double> - dblArray3_t is basically that with a few overloaded operators (+,-,etc)
dblArray3_t BatP = B.totalBatP({ -5 + xxx * stepsize, -5 + yyy * stepsize, -5 + zzz * stepsize }, 37);
for (int bbb = 0; bbb < 3; ++bbb)
{
dataarray[loopind] = BatP[bbb];
++loopind;
...(end braces here)
FILE* binfile;
binfile = fopen("MBdata.bin", "wb");
fwrite(dataarray, 8, 3 * arraysize * arraysize * arraysize, binfile);
The code that reads the file:
DLLEXPORT double* readDblBin(const std::string filename, unsigned int numOfDblsToRead)
{
char* buffer = new char[numOfDblsToRead];
std::ifstream binFile;
binFile.open(filename, std::ios::in | std::ios::binary);
binFile.read(buffer, numOfDblsToRead);
double* dataArray = (double*)buffer;
binFile.close();
return dataArray;
}
And the Python Code that receives the array:
def readBDataWrapper(filename, numDblsToRead):
fileIO = ctypes.CDLL('./fileIO.dll')
fileIO.readDblBin.argtypes = (ctypes.c_char_p, ctypes.c_uint)
fileIO.readDblBin.restype = ctypes.POINTER(ctypes.c_double)
return fileIO.readDblBin(filename, numDblsToRead)
One possible problem is here
char* buffer = new char[numOfDblsToRead];
Here you allocate numOfDblsToRead bytes. You probably want numOfDblsToRead * sizeof(double).
Same with the reading from the file, you only read numOfDblsToRead bytes.
I figured it out - at least it appears to be working. The problem was with the binary files that were generated with the first code block. I swapped the c-style writing with ofstream. My assumption is perhaps I was using the code to write to disk wrong somehow. Anyway, it appears to work now.
Replaced:
FILE* binfile;
binfile = fopen("MBdata.bin", "wb");
fwrite(dataarray, 8, 3 * arraysize * arraysize * arraysize, binfile);
With:
std::ofstream binfile;
binfile.open("MBdata.bin", std::ios::binary | std::ios::out);
binfile.write(reinterpret_cast<const char*>(dataarray), std::streamsize(totaliter * sizeof(double)));
binfile.close();

Accessing bitfields while reading/writing binary data structures

I'm writing a parser for a binary format. This binary format involves different tables which are again in binary format containing varying field sizes usually (somewhere between 50 - 100 of them).
Most of these structures will have bitfields and will look something like these when represented in C:
struct myHeader
{
unsigned char fieldA : 3
unsigned char fieldB : 2;
unsigned char fieldC : 3;
unsigned short fieldD : 14;
unsigned char fieldE : 4
}
I came across the struct module but realized that its lowest resolution was a byte and not a bit, otherwise the module pretty much was the right fit for this work.
I know bitfields are supported using ctypes, but I'm not sure how to interface ctypes structs containing bitfields here.
My other option is to manipulate the bits myself and feed it into bytes and use it with the struct module - but since I have close to 50-100 different types of such structures, writing the code for that becomes more error-prone. I'm also worried about efficiency since this tool might be used to parse large gigabytes of binary data.
Thanks.
Using bitstring (which you mention you're looking at) it should be easy enough to implement. First to create some data to decode:
>>> myheader = "3, 2, 3, 14, 4"
>>> a = bitstring.pack(myheader, 1, 0, 5, 1000, 2)
>>> a.bin
'00100101000011111010000010'
>>> a.tobytes()
'%\x0f\xa0\x80'
And then decoding it again is just
>>> a.readlist(myheader)
[1, 0, 5, 1000, 2]
Your main concern might well be the speed. The library is well optimised Python, but that's not nearly as fast as a C library would be.
I haven't rigorously tested this, but it seems to work with unsigned types (edit: it works with signed byte/short types, too).
Edit 2: This is really hit or miss. It depends on the way the library's compiler packed the bits into the struct, which is not standardized. For example, with gcc 4.5.3 it works as long as I don't use the attribute to pack the struct, i.e. __attribute__ ((__packed__)) (so instead of 6 bytes it gets packed into 4 bytes, which you can check with __alignof__ and sizeof). I can make it almost work by adding _pack_ = True to the ctypes Structure definition, but it fails for fieldE. gcc notes: "Offset of packed bit-field ‘fieldE’ has changed in GCC 4.4".
import ctypes
class MyHeader(ctypes.Structure):
_fields_ = [
('fieldA', ctypes.c_ubyte, 3),
('fieldB', ctypes.c_ubyte, 2),
('fieldC', ctypes.c_ubyte, 3),
('fieldD', ctypes.c_ushort, 14),
('fieldE', ctypes.c_ubyte, 4),
]
lib = ctypes.cdll.LoadLibrary('C/bitfield.dll')
hdr = MyHeader()
lib.set_header(ctypes.byref(hdr))
for x in hdr._fields_:
print("%s: %d" % (x[0], getattr(hdr, x[0])))
Output:
fieldA: 3
fieldB: 1
fieldC: 5
fieldD: 12345
fieldE: 9
C:
typedef struct _MyHeader {
unsigned char fieldA : 3;
unsigned char fieldB : 2;
unsigned char fieldC : 3;
unsigned short fieldD : 14;
unsigned char fieldE : 4;
} MyHeader, *pMyHeader;
int set_header(pMyHeader hdr) {
hdr->fieldA = 3;
hdr->fieldB = 1;
hdr->fieldC = 5;
hdr->fieldD = 12345;
hdr->fieldE = 9;
return(0);
}

What is the fastest way to read in a large data file of text columns?

I have a data file of almost 9 million lines (soon to be more than 500 million lines) and I'm looking for the fastest way to read it in. The five aligned columns are padded and separated by spaces, so I know where on each line to look for the two fields that I want.
My Python routine takes 45 secs:
import sys,time
start = time.time()
filename = 'test.txt' # space-delimited, aligned columns
trans=[]
numax=0
for line in open(linefile,'r'):
nu=float(line[-23:-11]); S=float(line[-10:-1])
if nu>numax: numax=nu
trans.append((nu,S))
end=time.time()
print len(trans),'transitions read in %.1f secs' % (end-start)
print 'numax =',numax
whereas the routine I've come up with in C is a more pleasing 4 secs:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define BPL 47
#define FILENAME "test.txt"
#define NTRANS 8858226
int main(void) {
size_t num;
unsigned long i;
char buf[BPL];
char* sp;
double *nu, *S;
double numax;
FILE *fp;
time_t start,end;
nu = (double *)malloc(NTRANS * sizeof(double));
S = (double *)malloc(NTRANS * sizeof(double));
start = time(NULL);
if ((fp=fopen(FILENAME,"rb"))!=NULL) {
i=0;
numax=0.;
do {
if (i==NTRANS) {break;}
num = fread(buf, 1, BPL, fp);
buf[BPL-1]='\0';
sp = &buf[BPL-10]; S[i] = atof(sp);
buf[BPL-11]='\0';
sp = &buf[BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
} while (num == BPL);
fclose(fp);
end = time(NULL);
fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax);
fprintf(stdout, "that took %.1f secs\n", difftime(end,start));
} else {
fprintf(stderr, "Error opening file %s\n", FILENAME);
free(nu); free(S);
return EXIT_FAILURE;
}
free(nu); free(S);
return EXIT_SUCCESS;
}
Solutions in Fortran, C++ and Java take intermediate amounts of time (27 secs, 20 secs, 8 secs).
My question is: have I made any outrageous blunders in the above (particularly the C-code)? And is there any way to speed up the Python routine? I quickly realised that storing my data in an array of tuples was better than instantiating a class for each entry.
Some points:
Your C routine is cheating; it is being tipped off with the filesize, and is pre-allocating ...
Python: consider using array.array('d') ... one each for S and nu. Then try pre-allocation.
Python: write your routine as a function and call it -- accessing function-local variables is rather faster than accessing module-global variables.
An approach that could probably be applied to the C, C++ and python version would be to use memory map the file. The most signficant benefit is that it can reduce the amount of double-handling of data as it is copied from one buffer to another. In many cases there are also benefits due to the reduction in the number of system calls for I/O.
In the C implementation, you could try swapping the fopen()/fread()/fclose() library functions for the lower-level system calls open()/read()/close(). A speedup may come from the fact that fread() does a lot of buffering, whereas read() does not.
Additionally, calling read() less often with bigger chunks will reduce the number of system calls and therefore you'll have less switching between userspace and kernelspace. What the kernel does when you issue a read() system call (doesn't matter if it was invoked from the fread() library function) is read the data from the disk and then copy it to the userspace. The copying part becomes expensive if you issue the system call very often in your code. By reading in larger chunks you'll end up with less context switches and less copying.
Keep in mind though that read() isn't guaranteed to return a block of the exact number of bytes you wanted. This is why in a reliable and proper implementation you always have to check the return value of the read().
You have the 1 and the BPL arguments the wrong way around in fread() (the way you have it, it could read a partial line, which you don't test for). You should also be testing the return value of fread() before you try and use the returned data.
You can might be able to speed the C version up a bit by reading more than a line at a time
#define LINES_PER_READ 1000
char buf[LINES_PER_READ][BPL];
/* ... */
while (i < NTRANS && (num = fread(buf, BPL, LINES_PER_READ, fp)) > 0) {
int line;
for (line = 0; i < NTRANS && line < num; line++)
{
buf[line][BPL-1]='\0';
sp = &buf[line][BPL-10]; S[i] = atof(sp);
buf[line][BPL-11]='\0';
sp = &buf[line][BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
}
}
On systems supporting posix_fadvise(), you should also do this upfront, after opening the file:
posix_fadvise(fileno(fp), 0, 0, POSIX_FADV_SEQUENTIAL);
Another possible speed-up, given the number of times you need to do it, is to use pointers to S and nu instead of indexing into arrays, e.g.,
double *pS = S, *pnu = nu;
...
*pS++ = atof(sp);
*pnu = atof(sp);
...
Also, since you are always converting from char to double at the same locations in buf, pre-compute the addresses outside of your loop instead of computing them each time in the loop.

Categories

Resources