Memory error while reading a large .h5 file - python

I have created a .h5 from numpy array
h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")
HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8"
But I am getting a memory error while reading the .h5 file to a numpy array
filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')
h5.keys()
[u'JZ3WPpxpypz']
data = h5['JZ3WPpxpypz']
If I try to see the array it gives me memory error
data[:]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
560 single_element = selection.mshape == ()
561 mshape = (1,) if single_element else selection.mshape
--> 562 arr = numpy.ndarray(mshape, new_dtype, order='C')
563
564 # HDF5 has a bug where if the memory shape has a different rank
MemoryError:
Is there any memory efficient way to read .h5 file into numpy array?
Thanks,
Debo.

You don't need to call numpy.ndarray() to get an array.
Try this:
arr = h5['JZ3WPpxpypz'][()]
# or
arr = data[()]
Adding [()] returns the entire array (different from your data variable -- it simply references the HDF5 dataset). Either method should give you an array of the same dtype and shape as the original array. You can also use numpy slicing operations to get subsets of the array.
A clarification is in order. I overlooked that numpy.ndarray() was called as part of the process to print data[()].
Here are type checks to show the difference in the returns from the 2 calls:
# check type for each variable:
data = h5['JZ3WPpxpypz']
print (type(data))
# versus
arr = data[()]
print (type(arr))
Output will look like this:
<class 'h5py._hl.dataset.Dataset'>
<class 'numpy.ndarray'>
In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print the dataset contents with this call (data[()]), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray(). It would have worked if you had a smaller dataset or sufficient memory.
My takeaway: calling arr = h5['JZ3WPpxpypz'][()] creates the numpy array with a process that does not call numpy.ndarray().
When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][()] because the dataset is too large to fit into memory as a numpy array. When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example:
data = h5['JZ3WPpxpypz']
arr1 = data[0:100000]
arr2 = data[100000:200000])
# etc

Related

ValueError: could not broadcast input array from shape (13,41) into shape (13)

I am trying to use tslearn library to analyze an audio numpy file. The file has a 45K row (45K audio samples) and 1 column, but each row has a nested object of (N,13). So the length of each sample is different while the features are the same (13 features). I want to stretch them all to 93 rows, which means if I print the shape of any of them, it will return (93,13).
data example:
first nested object in the dataset, shape (43,13)
second nested object in the dataset, shape (30,13)
I tried to use this tslearn library: https://tslearn.readthedocs.io/en/latest/gen_modules/preprocessing/tslearn.preprocessing.TimeSeriesResampler.html#tslearn.preprocessing.TimeSeriesResampler
but it will only change the column instead of the row. so basically if I have an array that is (44,13), it will change the array shape to (44,93) instead of (93.13). So I tried to rotate the array for 90 degrees and redo the analysis, but since the dataset itself is only 1D with 45K nested object, I had to make an empty list, use for loop to take out each object, rotate them 90 degrees and put them back to the list. Then I tried to change the list back to an array since the tslearn.preprocessing.TimeSeriesResampler only accepts array as parameters. However, it tells me that 'ValueError: could not broadcast input array from shape (13,41) into shape (13)' while trying to transfer the list back to an array.
import numpy as np
spoken_train = np.load("spoken_train.npy", allow_pickle=True)
lis = []
for i in range(len(spoken_train)):
lis.append(np.rot90(spoken_train[i]))
myarray = np.asarray(lis)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-65-440f2eba9eba> in <module>
2 for i in range(len(spoken_train)):
3 lis.append(np.rot90(spoken_train[i]))
----> 4 myarray = np.asarray(lis)
/anaconda3/lib/python3.7/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not broadcast input array from shape (13,41) into shape (13)
What should I do? If there is any easier way to rotate the nested array, please let me know as well. Thank you!
Does this fit the bill:
lis = []
for i in range(len(spoken_train)):
item = spoken_train[i]
lis.append( item + np.zeros((1,item.shape[-1])))
myarray = np.concatenate(lis)
The item in the loop must have same number of columns though. According to your examples, all arrays in spoken_train must have the last dimension of 13.
lis = np.copy(z) #since they have the same number of arrays
for i in range(len(spoken_train)):
lis[i] = np.rot90(spoken_train[i])

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)
According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) array
arr = ds[:] # should load the whole dataset into memory.
arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
group
sample
36 x 2048
may help in indexing speed.

Numpy Reshape Memory error

I want to reshape list of images and the train_data size is 639976.
this is how I am importing images,
train_data=[]
for img in tqdm(os.listdir('Images/train/images')):
path=os.path.join ('Images/train/images/',img)
image=cv2.imread(path,cv2.IMREAD_COLOR)
image=cv2.resize(image, (28,28)).astype('float32')/255
train_data.append(image)
return train_data
np.reshape(train_data,(-1,28,28,3))
I am getting memory error here.
np.reshape(train_data,(-1,28,28,3))
Error:
return array(a, dtype, copy=False, order=order)
MemoryError
Looks like train_data is a large list of small arrays. I'm not familiar with cv2, so I'm guessing that the
image=cv2.resize(image, (28,28)).astype('float32')/255
creates (28,28) or (28,28,3) array of floats. By itself, not very big. Apparently that works.
The error is in:
np.reshape(train_data,(-1,28,28,3))
Since train_data is list, reshape has to first create an array, probably with np.array(train_data). If the all the components are (28,28,3) this array will already be (n,28,28,3) shape. But that's where the memory error occurs. Apparently there are some of these small(ish) arrays, that it doesn't have memory to assemble them into one big array.
I'd experiment with a subset of the files.
In [1]: 639976*28*28*3
Out[1]: 1505223552 # floats
In [2]: _*8
Out[2]: 12041788416 # bytes
What's that, 12gb array? I'm not surprise you get a memory error. The list of arrays takes up more than that space, but they can be scattered in small blocks through out memory and swap. Make an array from the list and you double the memory usage.
Just for fun, try to make a blank array of that size:
np.ones((639976,28,28,3), 'float32')
If that works, try to make two.

How should I read a 9.4GB numpy array without a memory error

I have a precomputed numpy array that takes up just under 9.5 GB. I have saved it as both an npy file, and using h5py an hdf5 file. Although I can read this array in, using either format when working interactively with an interpreter, however, when I read it in when actually running a module, I get a "Memory Error":
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 440, in __getitem__
arr = numpy.ndarray(mshape, new_dtype, order='C')
MemoryError
This happens whether I save/read an npy file or an hdf5 file.
I have tried using numpy.memmap so I could substitute disk memory for RAM, but do not seem to be able to read the array accurately:
>>> import numpy as np
>>> zz=np.load('VGG16_l19_val.npy')
>>> zz.dtype
dtype('float64')
>>> zz.shape
(50000, 25088)
# So, I've read in the array using np.load and know its dtype and shape
>>> from numpy import unravel_index
>>> unravel_index(zz.argmax(), zz.shape)
(41232, 8208)
>>> zz[41232,8208]
937.5606689453125
# I now know the max value of zz and where it occurs
>>> zz2=np.memmap('VGG16_l19_val.npy', mode = 'r', dtype=np.float64, shape= (50000,25088))
>>> zz2.dtype
dtype('float64')
>>> zz2.shape
(50000, 25088)
# I've read a memmap version of the array and have the correct dtype and shape, but ...
>>> zz2[41232,8208]
0.0
>>> zz2.max()
memmap(8.447400968892931e+252)
>>>
# It doesn't appear that zz2 == zz
What don't I understand about np.memmap? Can I use it to read in this numpy array?
If not, what should I do, other than break up the array and save it in several files?
Why can I read the array without a problem when I'm in the interpreter, or in pdb, but can't read it without a MemoryError when I read it within a module?

Memory Error when using float32 in dask array

I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)
You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)

Categories

Resources