I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)
You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)
Related
I am trying to calculate a mean value across large numpy array. Originally, I tried:
data = (np.ones((10**6, 133))
for _ in range(100))
np.stack(data).mean(axis=0)
but I was getting
numpy.core._exceptions.MemoryError: Unable to allocate xxx GiB for an array with shape (100, 1000000, 133) and data type float32
In the original code data is a generator of more meaningful vectors.
I thought about using dask for such an operation, hoping it will split my data into chunks backed by disk.
import dask.array as da
import numpy as np
data = (np.ones((10**6, 133)) for _ in range(100))
x = da.stack(da.from_array(arr, chunks="auto") for arr in data)
x = da.mean(x, axis=0)
y = x.compute()
However, when I run it, the process terminates with "Killed".
How can I resolve this issue on a single machine?
You can try this approach:
agg_sum = np.zeros((10**6, 133))
total = 100
for dt in data:
agg_sum = agg_sum + dt
_mean = agg_sum/total
An alternative solution I found is to store all arrays in disk-backed file, using numpy.memmap.
import numpy as np
total = 100
shape = (10 ** 6, 133)
c = np.memmap(
"total.array", dtype="float64", mode="w+", shape=(total, *shape), order="C"
)
for idx, arr in enumerate(data):
c[idx,:,:] = arr[:]
del arr
c.mean(axis=0)
The important thing here is to del arr to avoid using whole memory before garbage collector reclaims unused arrays.
Note, the solution requires around 100GB of disk space, while the solution of #MSS requires much less space by keeping only the current sum.
I have created a .h5 from numpy array
h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")
HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8"
But I am getting a memory error while reading the .h5 file to a numpy array
filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')
h5.keys()
[u'JZ3WPpxpypz']
data = h5['JZ3WPpxpypz']
If I try to see the array it gives me memory error
data[:]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
560 single_element = selection.mshape == ()
561 mshape = (1,) if single_element else selection.mshape
--> 562 arr = numpy.ndarray(mshape, new_dtype, order='C')
563
564 # HDF5 has a bug where if the memory shape has a different rank
MemoryError:
Is there any memory efficient way to read .h5 file into numpy array?
Thanks,
Debo.
You don't need to call numpy.ndarray() to get an array.
Try this:
arr = h5['JZ3WPpxpypz'][()]
# or
arr = data[()]
Adding [()] returns the entire array (different from your data variable -- it simply references the HDF5 dataset). Either method should give you an array of the same dtype and shape as the original array. You can also use numpy slicing operations to get subsets of the array.
A clarification is in order. I overlooked that numpy.ndarray() was called as part of the process to print data[()].
Here are type checks to show the difference in the returns from the 2 calls:
# check type for each variable:
data = h5['JZ3WPpxpypz']
print (type(data))
# versus
arr = data[()]
print (type(arr))
Output will look like this:
<class 'h5py._hl.dataset.Dataset'>
<class 'numpy.ndarray'>
In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print the dataset contents with this call (data[()]), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray(). It would have worked if you had a smaller dataset or sufficient memory.
My takeaway: calling arr = h5['JZ3WPpxpypz'][()] creates the numpy array with a process that does not call numpy.ndarray().
When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][()] because the dataset is too large to fit into memory as a numpy array. When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example:
data = h5['JZ3WPpxpypz']
arr1 = data[0:100000]
arr2 = data[100000:200000])
# etc
I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)
According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) array
arr = ds[:] # should load the whole dataset into memory.
arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
group
sample
36 x 2048
may help in indexing speed.
I want to reshape list of images and the train_data size is 639976.
this is how I am importing images,
train_data=[]
for img in tqdm(os.listdir('Images/train/images')):
path=os.path.join ('Images/train/images/',img)
image=cv2.imread(path,cv2.IMREAD_COLOR)
image=cv2.resize(image, (28,28)).astype('float32')/255
train_data.append(image)
return train_data
np.reshape(train_data,(-1,28,28,3))
I am getting memory error here.
np.reshape(train_data,(-1,28,28,3))
Error:
return array(a, dtype, copy=False, order=order)
MemoryError
Looks like train_data is a large list of small arrays. I'm not familiar with cv2, so I'm guessing that the
image=cv2.resize(image, (28,28)).astype('float32')/255
creates (28,28) or (28,28,3) array of floats. By itself, not very big. Apparently that works.
The error is in:
np.reshape(train_data,(-1,28,28,3))
Since train_data is list, reshape has to first create an array, probably with np.array(train_data). If the all the components are (28,28,3) this array will already be (n,28,28,3) shape. But that's where the memory error occurs. Apparently there are some of these small(ish) arrays, that it doesn't have memory to assemble them into one big array.
I'd experiment with a subset of the files.
In [1]: 639976*28*28*3
Out[1]: 1505223552 # floats
In [2]: _*8
Out[2]: 12041788416 # bytes
What's that, 12gb array? I'm not surprise you get a memory error. The list of arrays takes up more than that space, but they can be scattered in small blocks through out memory and swap. Make an array from the list and you double the memory usage.
Just for fun, try to make a blank array of that size:
np.ones((639976,28,28,3), 'float32')
If that works, try to make two.
The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.
This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.
It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.