Numpy savez / load thousands of arrays, but not in one step - python

I would like to store approx 4000 numpy arrays (of 1.5 MB each) in a serialized uncompressed file (approx 6 GB of data). Here is an example with 2 small arrays :
import numpy
d1 = { 'array1' : numpy.array([1,2,3,4]), 'array2': numpy.array([5,4,3,2]) }
numpy.savez('myarrays', **d1)
d2 = numpy.load('myarrays.npz')
for k in d2:
print d2[k]
It works, but I would like to do the same thing not in a single step :
When saving, I would like to be able to save 10 arrays, then do some other task (than may use some seconds), then write 100 other arrays, then do something else, then write some other 50 arrays, etc.
When loading : idem, I would like to be able to load some arrays, then do some other task, then continue the loading.
How to do with this numpy.savez / numpy.load ?

I don't think you can do this with np.savez. This, however, is the perfect use-case for hdf5. See either:
http://www.h5py.org
or
http://www.pytables.org
As an example of how to do this in h5py:
h5f = h5py.File('test.h5', 'w')
h5f.create_dataset('array1', data=np.array([1,2,3,4]))
h5f.create_dataset('array2', data=np.array([5,4,3,2]))
h5f.close()
# Now open it back up and read data
h5f = h5py.File('test.h5', 'r')
a = h5f['array1'][:]
b = h5f['array2'][:]
h5f.close()
print a
print b
# [1 2 3 4]
# [5 4 3 2]
And of course there are more sophisticated ways of doing this, organizing arrays via groups, adding metadata, pre-allocating space in the hdf5 file and then filling it later, etc.

savez in the current numpy is just writing the arrays to temp files with numpy.save, and then adds them to a zip file (with or without compression).
If you're not using compression, you might as well skip step 2 and just save your arrays one by one, and keep all 4000 of them in a single folder.

Related

Adding big matrices stored in HDF5 datasets

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?
Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5
Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.
import h5py
import numpy as np
import random
import sys
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
h5fw.create_dataset('data_1',data=arr)
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
h5fw.create_dataset('data_2',data=arr)
h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape
if (f1shape!=f2shape):
print ('Datasets shapes do not match')
h5fr1.close()
h5fr2.close()
sys.exit('Exiting due to error.')
else:
with h5py.File('file3.h5','w') as h5fw :
ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
for i in range(f1shape[2]):
arr1_slice = h5fr1['data_1'][:,:,i]
arr2_slice = h5fr2['data_2'][:,:,i]
arr3_slice = arr1_slice + arr2_slice
ds3[:,:,i] = arr3_slice
# alternately, you can slice and sum in 1 line
# ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
# h5fr2['data_2'][:,:,i]
print ('Done.')
h5fr1.close()
h5fr2.close()

Fastest way to read a binary file with a defined format?

I have large binary data files that have a predefined format, originally written by a Fortran program as little endians. I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?.
The problem is the pre-defined format is non-homogeneous. It looks something like this:
['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']
with each integer i taking up 4 bytes, and each double d taking 8 bytes.
Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?
Use struct. In particular, struct.unpack.
result = struct.unpack("<2i5d...", buffer)
Here buffer holds the given binary data.
It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.
If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:
from functools import partial
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles
with open('input', 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d...", record)
data.extend(values)
Under assumption you are allowed to cast all your ints to doubles and willing to accept increase in allocated memory size (22% increase for your record from the question).
If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of doubles (like above) and write it back to another file from which you can later read with array.fromfile():
data = array.array('d')
with open('preprocessed', 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
Update. Thanks to a nice benchmark by #martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).
A faster (and a more standard) variation of record-by-record reading in #martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark.
Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.
To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)
Here are the snippets of code that were compared:
#TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
structures = []
with open(test_filenames[0], 'rb') as inp:
size = fmt1.size
while True:
buffer = inp.read(size)
if len(buffer) != size: # EOF?
break
structures.append(fmt1.unpack(buffer))
return structures
#TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
offset, unpack, size = 0, fmt1.unpack, fmt1.size
structures = []
with open(test_filenames[0], 'rb') as inp:
buffer = inp.read() # read entire file
while True:
chunk = buffer[offset: offset+size]
if len(chunk) != size: # EOF?
break
structures.append(unpack(chunk))
offset += size
return structures
#TESTCASE('Convert to array (#randomir part 1)')
def convert_to_array():
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles (standard sizes)
with open(test_filenames[0], 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
data.extend(values)
return data
#TESTCASE('Read array file (#randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
data = array.array('d')
with open(test_filenames[1], 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
return data
def create_preprocessed_file():
""" Save array created by convert_to_array() into a separate test file. """
test_filename = test_filenames[1]
if not os.path.isfile(test_filename): # doesn't already exist?
data = convert_to_array()
with open(test_filename, 'wb') as file:
data.tofile(file)
And here were the results running them on my system:
Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower)
Convert to array (#randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)
Interestingly, most of the snippets are actually faster in Python 2...
Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
Convert to array (#randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
Take a look at the documentation for numpy's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing
Simplest example:
import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))
Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#
There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.
line_cols = 20 #For example
line_rows = 40000 #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5 #For example
with open(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]
Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.
This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.
In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.

Efficiently store list of matrices

I have a large list of images stored as numpy matrices. The images have different sizes e.g.
import numpy as np
from numpy.random import rand
data = [rand(100,200), rand(1024, 768)]
I am looking for a way to store this list of matrices such that it can be read fast (writing the data can be slow). I tried pickle/numpy.savez, but reading the data was slower than loading the raw images again.
I think hdf5 may be fast, however I cannot figure out how to store this list. Not mandatory, but useful would data format which allows to append data such that the list does not have to in memory as a whole.
Edit:
Based on the answers so far I tried to time some suggestions
data = [rand(1024, 768) for i in np.arange(100)]
def timenp():
np.savez("test.npz",*data)
d=np.load('test.npz')
loaded = [d[f] for f in d]
def timebinary():
with file("tmp.bin","wb") as f:
np.save(f, len(data))
for img in data:
np.save(f,img)
with file("tmp.bin","rb") as f:
n = np.load(f)
loaded = []
for i in np.arange(n):
loaded.append(np.load(f))
import h5py
def timeh5py():
with h5py.File('foo.hdf5','w') as f:
dt = h5py.special_dtype(vlen=np.dtype('float32'))
dset = f.create_dataset('data', (len(data),), dtype=dt)
shapes = f.create_dataset('shapes', (len(data), 2), dtype='int32')
dset[...] = [img.flatten() for img in data]
shapes[...] = [img.shape for img in data]
with h5py.File('foo.hdf5','r') as f:
loaded=[]
for (img, shape) in zip(f['data'],f['shapes']):
loaded.append(np.reshape(img,shape))
python -m cProfile timenp.py
452906 function calls (451141 primitive calls) in 9.256 seconds
python -m cProfile timebinary.py
73085 function calls (71340 primitive calls) in 4.945 seconds
python -m cProfile timeh5py.py
33151 function calls (32568 primitive calls) in 4.384 seconds
Try using the numpy savez function , which comes in both compressed and uncompressed versions.
In [276]: alist=[np.arange(10), np.arange(3), np.arange(100)]
If I save this as np.savez('test',alist), it saves the list as one object. If instead I expand the list with *, then it puts each list element is a separate file in the archive.
In [277]: np.savez('test',*alist)
In [278]: d=np.load('test.npz')
In [279]: list(d.keys())
Out[279]: ['arr_2', 'arr_1', 'arr_0']
In [280]: d['arr_0']
Out[280]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
With np.save (and by extension savez), arrays are stored in their own compact format, which consists of a header block with shape information, and the data block, which is essentially a byte copy of its data buffer. So a np.save of an array should be as efficient any other method.
If you give np.save an non-array object it will use that object's pickle method. But note that the pickle method for an array is the save method I just described. So a pickle of an array should still be efficient.
Keep in mind that npz files are lazy-loaded.
With h5py, arrays are saved to named data sets. In sense it is like the savez above - the elements of the list have to have names, whether generated automatically or by your code.
I don't know how h5py speeds compare with a save(z).
h5py can handle arrays that are ragged in one dimension. I've explored that in previous SO questions. Storing multidimensional variable length array with h5py
How to save list of numpy.arrays of different shape with h5py?

Comparing two methods for writing a numpy array to disk

I compare two simple methods for writing a numpy array into a raw binary file :
# method 1
import numpy
A = numpy.random.randint(1000, size=512*1024*1024) # 2 GB
with open('blah.bin', 'wb') as f:
f.write(A)
and
# method 2
import numpy
A = numpy.random.randint(1000, size=512*1024*1024) # 2 GB
raw_input()
B = A.tostring() # check memory usage of the current process here : 4 GB are used !!
raw_input()
with open('blah.bin', 'wb') as f:
f.write(B)
With the second method, the memory usage is doubled (4 GB here) !
Why is .tostring() often used for writing numpy arrays to file ?
(in http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tofile.html, it is explained that numpy.ndarray.tofile() may be equivalent to file.write(a.tostring()))
Is method 1 as correct as method 2 for writing such an array to disk ?
The documentation does not say that .tofile() is equivalent to file.write(a.tostring()), it only mentions the latter to explain how the argument sep will behave if it's value is "".
In the second method you are creating a copy of the array A, storing in B, and after that you write in the file, while in the first method this intermediate copy is avoided.
You should also have a look in:
np.savetxt()

Big Satellite Image Processing

Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is
that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on
the whole image I get a memory error.
Would the use of something like pyTables help me with this?
What Mort Canty's code tries to do is that it loads the images using gdal and then stores them
in an 10 x 25,000,000 array.
# initial weights
wt = ones(cols*rows)
# data array (transposed so observations are columns)
dm = zeros((2*bands,cols*rows))
k = 0
for b in pos:
band1 = inDataset1.GetRasterBand(b+1)
band1 = band1.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[k,:] = ravel(band1)
band2 = inDataset2.GetRasterBand(b+1)
band2 = band2.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[bands+k,:] = ravel(band2)
k += 1
Even just creating a 10 x 25,000,000 numpy array of floats throws a memory error. Anyone have a good idea of how to get around this? This is my first post ever so any advice on how to post would also be welcome.
Greetings
numpy uses float64 per default, so your dm-array takes up 2GB of memory (8*10*25000000), the other arrays probably about 200MB (~8*5000*5000) each.
astype(float) returns a new array, so you need memory for that as well - and is probably not even needed as the type is implicitly converted when copying the data to the result array.
when the memory used in the for-loop is freed depends on garbage collection. and this doesn't consider the memory overhead of GetRasterBand, ReadAsArray.
are your sure your input data uses 64-bit floats? if it uses 32-bit floats, you could easyliy half the memory usage by specifying dtype='f' on your arrays.

Categories

Resources