I want to reshape list of images and the train_data size is 639976.
this is how I am importing images,
train_data=[]
for img in tqdm(os.listdir('Images/train/images')):
path=os.path.join ('Images/train/images/',img)
image=cv2.imread(path,cv2.IMREAD_COLOR)
image=cv2.resize(image, (28,28)).astype('float32')/255
train_data.append(image)
return train_data
np.reshape(train_data,(-1,28,28,3))
I am getting memory error here.
np.reshape(train_data,(-1,28,28,3))
Error:
return array(a, dtype, copy=False, order=order)
MemoryError
Looks like train_data is a large list of small arrays. I'm not familiar with cv2, so I'm guessing that the
image=cv2.resize(image, (28,28)).astype('float32')/255
creates (28,28) or (28,28,3) array of floats. By itself, not very big. Apparently that works.
The error is in:
np.reshape(train_data,(-1,28,28,3))
Since train_data is list, reshape has to first create an array, probably with np.array(train_data). If the all the components are (28,28,3) this array will already be (n,28,28,3) shape. But that's where the memory error occurs. Apparently there are some of these small(ish) arrays, that it doesn't have memory to assemble them into one big array.
I'd experiment with a subset of the files.
In [1]: 639976*28*28*3
Out[1]: 1505223552 # floats
In [2]: _*8
Out[2]: 12041788416 # bytes
What's that, 12gb array? I'm not surprise you get a memory error. The list of arrays takes up more than that space, but they can be scattered in small blocks through out memory and swap. Make an array from the list and you double the memory usage.
Just for fun, try to make a blank array of that size:
np.ones((639976,28,28,3), 'float32')
If that works, try to make two.
Related
I have created a .h5 from numpy array
h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")
HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8"
But I am getting a memory error while reading the .h5 file to a numpy array
filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')
h5.keys()
[u'JZ3WPpxpypz']
data = h5['JZ3WPpxpypz']
If I try to see the array it gives me memory error
data[:]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
560 single_element = selection.mshape == ()
561 mshape = (1,) if single_element else selection.mshape
--> 562 arr = numpy.ndarray(mshape, new_dtype, order='C')
563
564 # HDF5 has a bug where if the memory shape has a different rank
MemoryError:
Is there any memory efficient way to read .h5 file into numpy array?
Thanks,
Debo.
You don't need to call numpy.ndarray() to get an array.
Try this:
arr = h5['JZ3WPpxpypz'][()]
# or
arr = data[()]
Adding [()] returns the entire array (different from your data variable -- it simply references the HDF5 dataset). Either method should give you an array of the same dtype and shape as the original array. You can also use numpy slicing operations to get subsets of the array.
A clarification is in order. I overlooked that numpy.ndarray() was called as part of the process to print data[()].
Here are type checks to show the difference in the returns from the 2 calls:
# check type for each variable:
data = h5['JZ3WPpxpypz']
print (type(data))
# versus
arr = data[()]
print (type(arr))
Output will look like this:
<class 'h5py._hl.dataset.Dataset'>
<class 'numpy.ndarray'>
In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print the dataset contents with this call (data[()]), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray(). It would have worked if you had a smaller dataset or sufficient memory.
My takeaway: calling arr = h5['JZ3WPpxpypz'][()] creates the numpy array with a process that does not call numpy.ndarray().
When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][()] because the dataset is too large to fit into memory as a numpy array. When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example:
data = h5['JZ3WPpxpypz']
arr1 = data[0:100000]
arr2 = data[100000:200000])
# etc
I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)
According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) array
arr = ds[:] # should load the whole dataset into memory.
arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
group
sample
36 x 2048
may help in indexing speed.
When trying to get code to work using different frameworks and sources, I've stumbled across this multiple times:
Python Numpy arrays A and B that contentwise are the same, but one has A.shape == [x, y] and the other B.shape == [x, y, 1]. From dealing with it several times, I know that I can solve issues with this with squeeze:
A == numpy.squeeze(B)
But currently I have to redesign a lot of code that errors due to "inconsistent" arrays in that regard (some images with len(img.shape) = 2 [1024, 1024] and some images with len(img.shape) = 3 [1024, 1024, 1].
Now I have to pick one and I'm leaning towards [1024, 1024, 1], but since this code should be memory-efficient I'm wondering:
Do arrays with single-dimensional entries consume more memory than squeezed arrays? Or is there any other reason why I should avoid single-dimensional entries?
Do arrays with single-dimensional entries consume more memory than squeezed arrays?
They take the same amount of memory.
NumPy arrays have a property called nbytes that represents the number of bytes used by the array itself. Using this you can easily verify this:
>>> import numpy as np
>>> arr = np.ones((1024, 1024, 1))
>>> arr.nbytes
8388608
>>> arr.squeeze().nbytes
8388608
The reason it takes the same amount of memory is actually easy: NumPy arrays aren't real multi-dimensional arrays. They are one-dimensional arrays that use strides to "emulate" multidimensionality. These strides give the memory offset for a particular dimension:
>>> arr.strides
(8192, 8, 8)
>>> arr.squeeze().strides
(8192, 8)
So by removing the length-one dimension you effectively removed a zero-byte offset.
Or is there any other reason why I should avoid single-dimensional entries?
It depends. In some cases you actually create these yourself to utilize broadcasting with NumPy arrays. However in some cases they are annoying.
Note that there is in fact a small memory difference because NumPy has to store one stride and shape integer for each dimension:
>>> import sys
>>> sys.getsizeof(arr)
8388736
>>> sys.getsizeof(arr.squeeze().copy()) # remove one dimension
8388720
>>> sys.getsizeof(arr[:, None].copy()) # add one dimension
8388752
However 16 bytes per dimension isn't very much compared to the 8kk bytes the array takes and to a view (squeeze returns a view - that's why I had to copy it) which uses ~100 bytes.
I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)
You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)
I have a list of several hundred 10x10 arrays that I want to stack together into a single Nx10x10 array. At first I tried a simple
newarray = np.array(mylist)
But that returned with "ValueError: setting an array element with a sequence."
Then I found the online documentation for dstack(), which looked perfect: "...This is a simple way to stack 2D arrays (images) into a single 3D array for processing." Which is exactly what I'm trying to do. However,
newarray = np.dstack(mylist)
tells me "ValueError: array dimensions must agree except for d_0", which is odd because all my arrays are 10x10. I thought maybe the problem was that dstack() expects a tuple instead of a list, but
newarray = np.dstack(tuple(mylist))
produced the same result.
At this point I've spent about two hours searching here and elsewhere to find out what I'm doing wrong and/or how to go about this correctly. I've even tried converting my list of arrays into a list of lists of lists and then back into a 3D array, but that didn't work either (I ended up with lists of lists of arrays, followed by the "setting array element as sequence" error again).
Any help would be appreciated.
newarray = np.dstack(mylist)
should work. For example:
import numpy as np
# Here is a list of five 10x10 arrays:
x = [np.random.random((10,10)) for _ in range(5)]
y = np.dstack(x)
print(y.shape)
# (10, 10, 5)
# To get the shape to be Nx10x10, you could use rollaxis:
y = np.rollaxis(y,-1)
print(y.shape)
# (5, 10, 10)
np.dstack returns a new array. Thus, using np.dstack requires as much additional memory as the input arrays. If you are tight on memory, an alternative to np.dstack which requires less memory is to
allocate space for the final array first, and then pour the input arrays into it one at a time.
For example, if you had 58 arrays of shape (159459, 2380), then you could use
y = np.empty((159459, 2380, 58))
for i in range(58):
# instantiate the input arrays one at a time
x = np.random.random((159459, 2380))
# copy x into y
y[..., i] = x