I have an array of shape: (40000, 240, 320)
Its an image array and I want to normalize each pixel value as follows:
X = X/255
When I try to run the above statement, it throws following error:
MemoryError: Unable to allocate array with shape (40000, 240, 320) and data type float64
How to work with large numpy array in such cases ?
You can use augmented assignment with division (/=), which will modify X in-place:
X /= 255
Your current code attempts to allocate a temporary object:
X = X/255
# Is actually executed like:
tmp = X / 255 # new object!
X = tmp
Related
I am getting Type Error: Object dtype dtype('O') has no native HDF5 equivalent.
Here is my python code;
dtype for mel_train, mfcc_train, and y_train are all float32.
Array shapes are: mfcc_train: (6398,) ; mel_train: (6398,) and y_train: (6398, 16).
with h5py.File(train_file,'w') as f:
f['mfcc_train'] = mfcc_train
f['mel_train'] = mel_train
f['y_train'] = y_train
import h5py
import pickle
from utils import change_label, make_dir
# open training, validation, and testing sets from csv files
train_csv = pd.read_csv("training.csv",index_col=0)
valid_csv = pd.read_csv("validation.csv",index_col=0)
test_csv = pd.read_csv("testing.csv",index_col=0)
feature_dir = 'features' # directory to store the extracted features
make_dir(feature_dir) # create directory if it does not exist
mel_train = []
mel_valid = []
mel_test = []
mfcc_train = []
mfcc_valid = []
mfcc_test = []
change = change_label() # to encode string label (class name)
into binary matrix or vice versa
for i in range(train_csv.shape[0]):
sr, audio = wavfile.read(train_csv.iloc[i,0])
audio = pad_input(audio)
mel = normalise_feature(extract_mel(audio))
mfcc = normalise_feature(extract_mfcc(audio))
mel_train.append(mel.T)
mfcc_train.append(mfcc.T)
mel_train = np.asarray(mel_train)
print(mel_train.shape)
mfcc_train = np.asarray(mfcc_train)
print(mfcc_train.shape)
y = train_csv.iloc[:,1].to_list()
y_train = change.str2bin(y)
print(y_train.shape)
train_file = os.path.join(feature_dir,'mel_mfcc_train.h5')
print ("Storing extracted features and associated label
from training set into a file: "+train_file)
with h5py.File(train_file,'w') as f:
f['mel_train'] = mel_train
f['mfcc_train'] = mfcc_train
f['y_train'] = y_train`
OK, I think I know what's going on (educated guess). You extract the audio data to arrays mel and mfcc, then add to lists mel_train and mfcc_train (looping over 6398 audio files). After you exit the loop, you convert the lists to arrays. If every mel and mfcc array has the same shape (say (m,n)) the new arrays would be shape (6398,m,n), where 6398 is len(mel_train). However, I suspect each mel and mfcc array has a different shape. As a result, when you convert the list of differently shaped arrays to a single array, you will get an array shape of (6398,) with dtype=object (where the objects are float32 arrays).
To demonstrate the difference, I created an 2 nearly identical examples:
Creates 5 arrays of identical 2d shape (10,2), adds to a list, then converts the list to an array. Note how the final array is shape (5,10,2) and dtype is float64. You can create a HDF5 dataset directly from this array.
Creates 5 arrays of variable 2d shape, adds to a list, then converts the list to an array. Note how the final array is shape (5,) and dtype is object. You cannot create a HDF5 dataset directly from this array. This is why you get TypeError: Object dtype dtype('O') has no native HDF5 equivalent.
Note: I added dtype=object to the np.asarray() function for the second method to avoid the VisibleDeprecationWarning.
Example 2 shows 2 methods to load data. It continues from Example 1 and loads data into the same HDF5 file. After you run them, you can compare dataset mel_train1, group mel_train2 and dataset mel_train3. Each has a "Note" attribute to describe the data.
Code below:
Example 1 - constant shape arrays:
train_file = 'mel_mfcc_train.h5'
## Example 1 - Create arrays of constant shape
a0, a1, n = 10, 2, 5
mel_train = []
for i in range(n):
arr = np.random.random(a0*a1).reshape(a0,a1)
mel_train.append(arr)
print('\nFor mel_train arrays of constant size:')
print(f'Size of mel_train list: {len(mel_train)}')
mel_train = np.asarray(mel_train)
print(f'For mel_train array: Dtype: {mel_train.dtype}; Shape: {mel_train.shape}')
with h5py.File(train_file,'w') as f:
f['mel_train1'] = mel_train
f['mel_train1'].attrs['Note'] = f'{n} Constant shaped arrays: {a0} x {a1}'
Example 2 - variable shape arrays:
## Example 2 - Create arrays of random shape
mel_train = []
for i in range(n):
a0 = np.random.randint(6,10) # set a0 dimension to random length
##a1 = np.random.randint(3,6)
arr = np.random.random(a0*a1).reshape(a0,a1)
mel_train.append(arr)
print('\nFor mel_train arrays of random size:')
print(f'Size of mel_train list: {len(mel_train)}')
# mel_train = np.asarray(mel_train)
mel_train = np.asarray(mel_train,dtype=object)
print(f'For mel_train array: Dtype: {mel_train.dtype}; Shape: {mel_train.shape}')
for i, arr in enumerate(mel_train):
print(f'\tFor a0= {i}; shape: {arr.shape}')
Loading Example 2 data as-is will throw an exception
# Creating a dataset with arrays of different sizes will throw
# an exception (exception trapped and printed in code below)
try:
with h5py.File(train_file,'a') as f:
f['mel_train2'] = mel_train
except Exception as e:
print(f'\nh5py Exception: {e}\n')
Recommended method to load Example 2 data
## Example 2A
# To avoid exception, write each object/array to seperate datasets in 1 group
with h5py.File(train_file,'a') as f:
grp = f.create_group('mel_train2')
f['mel_train2'].attrs['Note'] = f'1 group and {n} datasets for variable shaped arrays'
for i, arr in enumerate(mel_train):
f[f'mel_train2/dataset_{i:04}'] = arr
Alternate method to load Example 2 data (not recommended)
## Example 2B - for completeness; NOT recommended
# Alternately, size dataset to hold largest array.
# dataset will have zeros where smaller arrays are loaded
ds_dtype = mel_train[0].dtype
ds_a0 = mel_train.shape[0]
ds_a1, ds_a2 = 0, 0
for arr in mel_train:
ds_a1 = max(ds_a1, arr.shape[0])
ds_a2 = max(ds_a2, arr.shape[1])
with h5py.File(train_file,'a') as f:
ds2 = f.create_dataset('mel_train2',dtype=ds_dtype,shape=(ds_a0,ds_a1,ds_a2))
for i, arr in enumerate(mel_train):
j,k = arr.shape[0], arr.shape[1]
ds2[i,0:j,0:k] = arr
Typical output from running code above:
For mel_train arrays of constant size:
Size of mel_train list: 5
For mel_train array: Dtype: float64; Shape: (5, 10, 2)
For mel_train arrays of random size:
Size of mel_train list: 5
For mel_train array: Dtype: object; Shape: (5,)
For a0= 0; shape: (6, 2)
For a0= 1; shape: (7, 2)
For a0= 2; shape: (8, 2)
For a0= 3; shape: (6, 2)
For a0= 4; shape: (9, 2)
h5py Exception: Object dtype dtype('O') has no native HDF5 equivalent
I am trying to calculate a mean value across large numpy array. Originally, I tried:
data = (np.ones((10**6, 133))
for _ in range(100))
np.stack(data).mean(axis=0)
but I was getting
numpy.core._exceptions.MemoryError: Unable to allocate xxx GiB for an array with shape (100, 1000000, 133) and data type float32
In the original code data is a generator of more meaningful vectors.
I thought about using dask for such an operation, hoping it will split my data into chunks backed by disk.
import dask.array as da
import numpy as np
data = (np.ones((10**6, 133)) for _ in range(100))
x = da.stack(da.from_array(arr, chunks="auto") for arr in data)
x = da.mean(x, axis=0)
y = x.compute()
However, when I run it, the process terminates with "Killed".
How can I resolve this issue on a single machine?
You can try this approach:
agg_sum = np.zeros((10**6, 133))
total = 100
for dt in data:
agg_sum = agg_sum + dt
_mean = agg_sum/total
An alternative solution I found is to store all arrays in disk-backed file, using numpy.memmap.
import numpy as np
total = 100
shape = (10 ** 6, 133)
c = np.memmap(
"total.array", dtype="float64", mode="w+", shape=(total, *shape), order="C"
)
for idx, arr in enumerate(data):
c[idx,:,:] = arr[:]
del arr
c.mean(axis=0)
The important thing here is to del arr to avoid using whole memory before garbage collector reclaims unused arrays.
Note, the solution requires around 100GB of disk space, while the solution of #MSS requires much less space by keeping only the current sum.
I am trying to get a 2d array, by randomly generating its rows and appending
import numpy as np
my_nums = np.array([])
for i in range(100):
x = np.random.rand(2, 1)
my_nums = np.append(my_nums, np.array(x))
But I do not get what I want but instead get a 1d array.
What is wrong?
Transposing x did not help either.
You could do this by using np.append(axis=0) or np.vstack. This however requires the rows appended to have the same length as the rows already in the array.
You cannot use the same code to append a row with two values to an empty array, and to append a row to an already existing 2D array: numpy will throw a
ValueError: all the input arrays must have same number of dimensions.
You could initialize my_nums to work around this:
my_nums = np.random.rand(1, 2)
for i in range(99):
x = np.random.rand(1, 2)
my_nums = np.append(my_nums, x, axis=0)
Note the decrease in the range by one due to the initialization row. Also note that I changed the dimensions to (1, 2) to get actual row vectors.
Much easier than appending row-wise will of course be to create the array in the wanted final shape:
my_nums = np.random.rand(100, 2)
I have a numpy array batch of shape (32,5). Each element of the batch consists of a numpy array batch_elem = [s,_,_,_,_] where s = [img,val1,val2] is a 3-dimensional numpy array and _ are simply scalar values.
img is an image (numpy array) with dimensions (84,84,3)
I would like to create a numpy array with the shape (32,84,84,3). Basically I want to extract the image information within each batch and transform it into a 4-dimensional array.
I tried the following:
b = np.vstack(batch[:,0]) #this yields a b with shape (32,3), type: <class 'numpy.ndarray'>
Now I would like to access the images (first index in second dimension)
img_batch = b[:,0] # this returns an array of shape (32,), type: <class 'numpy.ndarray'>
How can I best access the image data and get a shape (32,84,84,3)?
Note:
s = b[0] #first s of the 32 in batch: shape (3,) , type: <class 'numpy.ndarray'>
Edit:
This should be a minimal example:
img = np.zeros([5,5,3])
s = np.array([img,1,1])
batch_elem = np.array([s,1,1,1,1])
batch = np.array([batch_elem for _ in range(32)])
Assuming I understand the problem correctly, you can stack twice on the last axis.
res = np.stack(np.stack(batch[:,0])[...,0])
import numpy as np
# fabricate some data
batch = np.array((32, 1), dtype=object)
for i in range(len(batch)):
batch[i] = [np.random.rand(84, 84, 3), None, None]
# select images
result = np.array([img for img, _, _ in batch])
# double check!
for i in range(len(batch)):
assert np.all(result[i, :, :, :] == batch[i][0])
I have the following code that works like expected, but I'm curious if the loop can be replaced by a native numpy function/method for better performance. What I have is one array holding RGB values that I use as a lookup table and two 2d arrays holding greyscale values (0-255). Each value of these two arrays corresponds to the value of one axis of the lookup table.
As mentioned, what would be really nice is getting rid of the (slow) loop in python and using a faster numpy method.
#!/usr/bin/env python3
from PIL import Image
import numpy as np
dim = (2000, 2000)
rows, cols = dim
# holding a 256x256 RGB color lookup table
color_map = np.random.random_integers(0, 255, (256,256,3))
# image 1 greyscale values
color_map_idx_row = np.random.randint(0, 255, dim)
# image 2 greyscale values
color_map_idx_col = np.random.randint(0, 255, dim)
# output image data
result_data = np.zeros((rows, cols, 3), dtype=np.uint8)
# is there any built in function in numpy that could
# replace this loop?
# -------------------------------------------------------
for i in range(rows):
for j in range(cols):
row_idx = color_map_idx_row.item(i, j)
col_idx = color_map_idx_col.item(i, j)
rgb_color = color_map[row_idx,col_idx]
result_data[i,j] = rgb_color
img = Image.fromarray(result_data, 'RGB')
img.save('result.png')
You can replace the double-for loop with fancy-indexing:
In [33]: result_alt = color_map[color_map_idx_row, color_map_idx_col]
This confirms the result is the same:
In [36]: np.allclose(result_data, result_alt)
Out[36]: True
You can reshape the 3D array into a 2D array with the axis=1 holding the three channels. Then, use row-slicing with the row indices being calculated as linear indices from the row and column indices arrays. Please note that the reshaped array being a view only, won't burden any of the workspace memory. Thus, we would have -
m = color_map.shape[0]
out = color_map.reshape(-1,3)[color_map_idx_row*m + color_map_idx_col]