Reading large binary files (>2GB) with python - python

I am writing a program to process some binary files.
I used to use numpy.fromfile and everything worked fine until I came across some big binary file (>2gb) since numpy can't read them (memory problems) after trying unsuccesfully with h5py since I didn't get how to convert my files to h5 files. I was trying to use open(), read() and struct.unpack_from, in order to reconstruct the data as I would have done in c++.
My binary files represent 32 bit floats that are to be paired into 64bit complex.
The problem at the moment is that even if from the info I gathered struct.unpack_from() should return a tuple with all the datas of the specified type in the file it only returns the first element of the file:
The code:
f1 = open(IQ_File, 'rb')
a1 = f1.read()
f = struct.unpack_from('f', a1)
print(f)
What I am expecting here is an output with the binary back to floats, however my output is only:
(-0.057812511920928955,)
-- a tuple containing only the first float of the file.
I really don't understand what I am doing wrong here.
What should I be doing differently?

Pack/unpack format strings can have each item prefixed with a number to have that many items packed/unpacked. Just divide the data size by the size of float and put that number in the format:
nf = len(a1) // struct.calcsize('f')
f = struct.unpack(f"{nf}f", a1)
Mind that tuples are very ineffective way to store numeric array data in Python. On 64-bit systems (e.g., macOS) with CPython, a tuple of N floats uses 24+N*8 bytes (sizeof(PyObject_VAR_HEAD) + N pointers) for the tuple itself plus N*24 bytes (sizeof(PyObject_HEAD) + one double) for the floats (stored internally as doubles), or 24+N*32 bytes in total. That's 8 times more than the size of the binary data!
A better option is to use numpy.fromfile() and explicitly provide the count and possibly offset arguments in order to read the file in chunks. If you need to know in advance how many floats in total are there in the file, use os.stat():
nf = os.stat(IQ_File).st_size // struct.calcsize('f')

unpack_from('f', data) reads a single float from data. You are probably looking for
for f in iter_unpack('f', a1):
print(a1)
You can probably make this more efficient by reading only a small amount of the file (say, 64Kb at a time) in a separate loop.

Related

Reading file with pandas.read_csv: why is a particular column read as a string rather than float?

A little background:
I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.
But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.
Main query:
During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:
def read_evo_history(EvoHist, zipped, z):
ehists = []
for i in range( len(EvoHist) ):
if zipped == True:
try:
ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
except pd.errors.EmptyDataError:
pass
return ehists
outdir = "plots"
indir = "OutputFiles_allsys"
z = zipfile.ZipFile( indir+'.zip' )
EvoHist = []
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
EvoHist.append( filename )
zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z # Cleanup (if there's no further use of it after this)
The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?
P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.
I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.
Edit:
I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.
This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.
The workaround for overcoming a missing value was simple, pandas.read_csv has a parameter called na_values which allows users to pass specified values that they want to be read as NaNs. From the pandas docs:
na_values: scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default, the following values are interpreted as NaN: β€˜β€™, ... β€˜NA’, ...`.
Pandas itself is smart enough to automatically recognize those values without us explicitly stating it. But in my case, the file had nan values as 'nan ' (yeah, with a space!) which is why I was facing this issue. A minute change in the code fixed this,
pd.read_csv(z.open(EvoHist[i]), delimiter = "\t",
compression='gzip', header=None, na_values = 'nan ').to_numpy()

How can I partially read 2d satellite file in Python? (binary, fromfile)

I have a lot of satellite data that is consists of two-dimension.
(I convert H5 to 2d array data that not include latitude information
I made Lat/Lon information data additionally.)
I know real Lat/Lon coordination and grid coordination in one data.
How can I partially read 2d satellite file in Python?
"numpy.fromfile" was usually used to read binary file.
if I use option as count in numpy.fromfile, I can read binary file partially.
However I want to skip front records in one data for save memory.
for example, i have 3x3 2d data as follow:
python
a= [[1,2,3]
[4,5,6]
[7,8,9]]
I just read a[3][0] in Python. (result = 7)
When I read file in Fortran, I used "recl, rec".
Fortran
open(1, file='exsmaple.bin', access='direct', recl=4) ! recl=4 means 4 btype
read(1, rec=lat*x-lon) filename
close(1)
lat means position of latitude in data.
(lat = 3 in above exsample ; start number is 1 in Fortran.)
lon means position of longitude in data.
(lon = 1 in above exsample ; start number is 1 in Fortran.)
x is no. rows.
(x = 3, above example, array is 3x3)
I can read file, and use only 4 byte of memory.
I want to know similar method in Python.
Please give me special information to save time and memory.
Thank you for reading my question.
2016.10.28.
Solution
python
Data = [1,2,3,4,5,6,7,8,9], dtype = int8, filename=name
a = np.memmap(name, dtype='int8', mode='r', shape=(1), offset=6)
print a[0]
result : 7
To read .h5 files :
import h5py
ds = h5py.File(filename, "r")
variable = ds['variable_name']
It's hard to follow your description. Some proper code indentation would help over come your English language problems.
So you have data on a H5 file. The simplest approach is to h5py to load it into a Python/numpy session, and select the necessary data from those arrays.
But it sounds as though you have written a portion of this data to a 'plain' binary file. It might help to know how you did it. Also in what way is this 2d?
np.fromfile reads a file as though it was 1d. Can you read this file, up to some count? And with a correct dtype?
np.fromfile accepts an open file. So I think you can open the file, use seek to skip forward, and then read count items from there. But I haven't tested that idea.

get specific content from file python

I have a file test.txt which has an array:
array = [3,5,6,7,9,6,4,3,2,1,3,4,5,6,7,8,5,3,3,44,5,6,6,7]
Now what I want to do is get the content of array and perform some calculations with the array. But the problem is when I do open("test.txt") it outputs the content as the string. Actually the array is very big, and if I do a loop it might not be efficient. Is there any way to get the content without splitting , ? Any new ideas?
I recommend that you save the file as json instead, and read it in with the json module. Either that, or make it a .py file, and import it as python. A .txt file that looks like a python assignment is kind of odd.
Does your text file need to look like python syntax? A list of comma separated values would be the usual way to provide data:
1,2,3,4,5
Then you could read/write with the csv module or the numpy functions mentioned above. There's a lot of documentation about how to read csv data in efficiently. Once you had your csv reader data object set up, data could be stored with something like:
data = [ map( float, row) for row in csvreader]
If you want to store a python-like expression in a file, store only the expression (i.e. without array =) and parse it using ast.literal_eval().
However, consider using a different format such as JSON. Depending on the calculations you might also want to consider using a format where you do not need to load all data into memory at once.
Must the array be saved as a string? Could you use a pickle file and save it as a Python list?
If not, could you try lazy evaluation? Maybe only process sections of the array as needed.
Possibly, if there are calculations on the entire array that you must always do, it might be a good idea to pre-compute those results and store them in the txt file either in addition to the list or instead of the list.
You could also use numpy to load the data from the file using numpy.genfromtxt or numpy.loadtxt. Both are pretty fast and both have the ability to do the recasting on load. If the array is already loaded though, you can use numpy to convert it to an array of floats, and that is really fast.
import numpy as np
a = np.array(["1", "2", "3", "4"])
a = a.astype(np.float)
You could write a parser. They are very straightforward. And much much faster than regular expressions, please don't do that. Not that anyone suggested it.
# open up the file (r = read-only, b = binary)
stream = open("file_full_of_numbers.txt", "rb")
prefix = '' # end of the last chunk
full_number_list = []
# get a chunk of the file at a time
while True:
# just a small 1k chunk
buffer = stream.read(1024)
# no more data is left in the file
if '' == buffer:
break
# delemit this chunk of data by a comma
split_result = buffer.split(",")
# append the end of the last chunk to the first number
split_result[0] = prefix + split_result[0]
# save the end of the buffer (a partial number perhaps) for the next loop
prefix = split_result[-1]
# only work with full results, so skip the last one
numbers = split_result[0:-1]
# do something with the numbers we got (like save it into a full list)
full_number_list += numbers
# now full_number_list contains all the numbers in text format
You'll also have to add some logic to use the prefix when the buffer is blank. But I'll leave that code up to you.
OK, so the following methods ARE dangerous. Since they are used to attack systems by injecting code into them, used them at your own risk.
array = eval(open("test.txt", 'r').read().strip('array = '))
execfile('test.txt') # this is the fastest but most dangerous.
Safer methods.
import ast
array = ast.literal_eval(open("test.txt", 'r').read().strip('array = ')).
...
array = [float(value) for value in open('test.txt', 'r').read().strip('array = [').strip('\n]').split(',')]
The eassiest way to serialize python objects so you can load them later is to use pickle. Assuming you dont want a human readable format since this adds major head, either-wise, csv is fast and json is flexible.
import pickle
import random
array = random.sample(range(10**3), 20)
pickle.dump(array, open('test.obj', 'wb'))
loaded_array = pickle.load(open('test.obj', 'rb'))
assert array == loaded_array
pickle does have some overhead and if you need to serialize large objects you can specify the compression ratio, the default is 0 no compression, you can set it to pickle.HIGHEST_PROTOCOL pickle.dump(array, open('test.obj', 'wb'), pickle.HIGHEST_PROTOCOL)
If you are working with large numerical or scientific data sets then use numpy.tofile/numpy.fromfile or scipy.io.savemat/scipy.io.loadmat they have little overhead, but again only if you are already using numpy/scipy.
good luck.

Fastest way to write HDF5 files with Python?

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?
I'd like to use the h5py module if possible.
In the toy example below, I've found an incredibly slow and incredibly fast way to write data to HDF5. Would it be best practice to write to HDF5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?
import h5py
n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')
# this is terribly slow
for i in xrange(n):
dset[i] = i
# instantaneous
dset[...] = 42
I would avoid chunking the data and would store the data as series of single array datasets (along the lines of what Benjamin is suggesting). I just finished loading the output of an enterprise app I've been working on into HDF5, and was able to pack about 4.5 Billion compound datatypes as 450,000 datasets, each containing a 10,000 array of data. Writes and reads now seem fairly instantaneous, but were painfully slow when I initially tried to chunk the data.
Just a thought!
Update:
These are a couple of snippets lifted from my actual code (I'm coding in C vs. Python, but you should get the idea of what I'm doing) and modified for clarity. I'm just writing long unsigned integers in arrays (10,000 values per array) and reading them back when I need an actual value
This is my typical writer code. In this case, I'm simply writing long unsigned integer sequence into a sequence of arrays and loading each array sequence into hdf5 as they are created.
//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i < NUMDATASETS; i++){
for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){
kValues[j] = k;
k += 1UL;
}
//Create the data set.
dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
//Write data to the data set.
H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
//Close the data set.
H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dSpace);
//Close the data files.
H5Fclose(ssdb);
This is a slightly modified version of my reader code. There are more elegant ways of doing this (i.e., I could use hyperplanes to get the value), but this was the cleanest solution with respect to my fairly disciplined Agile/BDD development process.
unsigned long int getValueByIndex(unsigned long int nnValue){
//NUMPERDATASET = 10,000
unsigned long int ssValue[NUMPERDATASET];
//MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
//to avoid index out of range error
unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
//Open the data file in read-write mode.
hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
//Create the data set. In this case, each dataset consists of a array of 10,000
//unsigned long int and is named according to its integer division value of i divided
//by the number per data set.
hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
//Read the data set array.
H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
//Close the data set.
H5Dclose(dSet);
//Close the data file.
H5Fclose(db);
//Return the indexed value by using the modulus of i divided by the number per dataset
return ssValue[i % NUMPERDATASET];
}
The main take-away is the inner loop in the writing code and the integer division and mod operations to get the index of the dataset array and index of the desired value in that array. Let me know if this is clear enough so you can put together something similar or better in h5py. In C, this is dead simple and gives me significantly better read/write times vs. a chunked dataset solution. Plus since I can't use compression with compound datasets anyway, the apparent upside of chunking is a moot point, so all my compounds are stored the same way.
using the flexibility of numpy.loadtxt will get the data from file into a numpy array, which in turn is perfect to initialize the hdf5 dataset.
import h5py
import numpy as np
d = np.loadtxt('data.txt')
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
I'm not sure if this is the most efficient way (and I've never used it; I'm just pulling together some tools I've used independently), but you could read the csv file into a numpy recarray using the matplotlib helper methods for csv.
You can probably find a way to read the csv files in chunks as well to avoid loading the whole thing to disk. Then use the recarray (or slices therein) to write the whole (or large chunks of it) to the h5py dataset. I'm not exactly sure how h5py handles recarrays, but the documentation indicates that it should be ok.
Basically if possible, try to write big chunks of data at once instead of iterating over individual elements.
Another possibility for reading the csv file is just numpy.genfromtxt
You can grab the columns you want using the keyword usecols, and then only read in a specified set of lines by properly setting the skip_header and skip_footer keywords.

Fast 'Record Update' To Binary Files?

I have 3000 binary files (each of size 40[MB]) of known format (5,000,000 'records' of 'int32,float32' each). they were created using numpy tofile() method.
A method that I use, WhichShouldBeUpdated(), determines which file (out of the 3000) should be updated, and also, which records in this file should be changed. The method's output is the following:
(1) path_to_file_name_to_update
(2) a numpy record array with N records (N is the number of records to update), in the following format: [(recordID1, newIntValue1, newFloatValue1), (recordID2, newIntValue2, newFloatValue2), .....]
As can be seen:
(1) the file to update is known only at running time
(2) the records to update are also only known at running time
what would be the most efficient approach to updating the file with the new values for the records?
Since the records are of fixed length you can just open the file and seek to the position, which is a multiple of the record size and record offset. To encode the ints and floats as binary you can use struct.pack. Update: Given that the files are originally generated by numpy, the fastest way may be numpy.memmap.
You're probably not interested in data conversion, but I've had very good experiences with HDF5 and pytables for large binary files. HDF5 is designed for large scientific data sets, so it is quick and efficient.

Categories

Resources