How to record data under HDF5 file correctly

How to record data under HDF5 file correctly - python

I am getting Type Error: Object dtype dtype('O') has no native HDF5 equivalent.
Here is my python code;
dtype for mel_train, mfcc_train, and y_train are all float32.
Array shapes are: mfcc_train: (6398,) ; mel_train: (6398,) and y_train: (6398, 16).
with h5py.File(train_file,'w') as f:
f['mfcc_train'] = mfcc_train
f['mel_train'] = mel_train
f['y_train'] = y_train

import h5py
import pickle
from utils import change_label, make_dir
# open training, validation, and testing sets from csv files
train_csv = pd.read_csv("training.csv",index_col=0)
valid_csv = pd.read_csv("validation.csv",index_col=0)
test_csv = pd.read_csv("testing.csv",index_col=0)
feature_dir = 'features' # directory to store the extracted features
make_dir(feature_dir) # create directory if it does not exist
mel_train = []
mel_valid = []
mel_test = []
mfcc_train = []
mfcc_valid = []
mfcc_test = []
change = change_label() # to encode string label (class name)
into binary matrix or vice versa
for i in range(train_csv.shape[0]):
sr, audio = wavfile.read(train_csv.iloc[i,0])
audio = pad_input(audio)
mel = normalise_feature(extract_mel(audio))
mfcc = normalise_feature(extract_mfcc(audio))
mel_train.append(mel.T)
mfcc_train.append(mfcc.T)
mel_train = np.asarray(mel_train)
print(mel_train.shape)
mfcc_train = np.asarray(mfcc_train)
print(mfcc_train.shape)
y = train_csv.iloc[:,1].to_list()
y_train = change.str2bin(y)
print(y_train.shape)
train_file = os.path.join(feature_dir,'mel_mfcc_train.h5')
print ("Storing extracted features and associated label
from training set into a file: "+train_file)
with h5py.File(train_file,'w') as f:
f['mel_train'] = mel_train
f['mfcc_train'] = mfcc_train
f['y_train'] = y_train`

OK, I think I know what's going on (educated guess). You extract the audio data to arrays mel and mfcc, then add to lists mel_train and mfcc_train (looping over 6398 audio files). After you exit the loop, you convert the lists to arrays. If every mel and mfcc array has the same shape (say (m,n)) the new arrays would be shape (6398,m,n), where 6398 is len(mel_train). However, I suspect each mel and mfcc array has a different shape. As a result, when you convert the list of differently shaped arrays to a single array, you will get an array shape of (6398,) with dtype=object (where the objects are float32 arrays).
To demonstrate the difference, I created an 2 nearly identical examples:
Creates 5 arrays of identical 2d shape (10,2), adds to a list, then converts the list to an array. Note how the final array is shape (5,10,2) and dtype is float64. You can create a HDF5 dataset directly from this array.
Creates 5 arrays of variable 2d shape, adds to a list, then converts the list to an array. Note how the final array is shape (5,) and dtype is object. You cannot create a HDF5 dataset directly from this array. This is why you get TypeError: Object dtype dtype('O') has no native HDF5 equivalent.
Note: I added dtype=object to the np.asarray() function for the second method to avoid the VisibleDeprecationWarning.
Example 2 shows 2 methods to load data. It continues from Example 1 and loads data into the same HDF5 file. After you run them, you can compare dataset mel_train1, group mel_train2 and dataset mel_train3. Each has a "Note" attribute to describe the data.
Code below:
Example 1 - constant shape arrays:
train_file = 'mel_mfcc_train.h5'
## Example 1 - Create arrays of constant shape
a0, a1, n = 10, 2, 5
mel_train = []
for i in range(n):
arr = np.random.random(a0*a1).reshape(a0,a1)
mel_train.append(arr)
print('\nFor mel_train arrays of constant size:')
print(f'Size of mel_train list: {len(mel_train)}')
mel_train = np.asarray(mel_train)
print(f'For mel_train array: Dtype: {mel_train.dtype}; Shape: {mel_train.shape}')
with h5py.File(train_file,'w') as f:
f['mel_train1'] = mel_train
f['mel_train1'].attrs['Note'] = f'{n} Constant shaped arrays: {a0} x {a1}'
Example 2 - variable shape arrays:
## Example 2 - Create arrays of random shape
mel_train = []
for i in range(n):
a0 = np.random.randint(6,10) # set a0 dimension to random length
##a1 = np.random.randint(3,6)
arr = np.random.random(a0*a1).reshape(a0,a1)
mel_train.append(arr)
print('\nFor mel_train arrays of random size:')
print(f'Size of mel_train list: {len(mel_train)}')
# mel_train = np.asarray(mel_train)
mel_train = np.asarray(mel_train,dtype=object)
print(f'For mel_train array: Dtype: {mel_train.dtype}; Shape: {mel_train.shape}')
for i, arr in enumerate(mel_train):
print(f'\tFor a0= {i}; shape: {arr.shape}')
Loading Example 2 data as-is will throw an exception
# Creating a dataset with arrays of different sizes will throw
# an exception (exception trapped and printed in code below)
try:
with h5py.File(train_file,'a') as f:
f['mel_train2'] = mel_train
except Exception as e:
print(f'\nh5py Exception: {e}\n')
Recommended method to load Example 2 data
## Example 2A
# To avoid exception, write each object/array to seperate datasets in 1 group
with h5py.File(train_file,'a') as f:
grp = f.create_group('mel_train2')
f['mel_train2'].attrs['Note'] = f'1 group and {n} datasets for variable shaped arrays'
for i, arr in enumerate(mel_train):
f[f'mel_train2/dataset_{i:04}'] = arr
Alternate method to load Example 2 data (not recommended)
## Example 2B - for completeness; NOT recommended
# Alternately, size dataset to hold largest array.
# dataset will have zeros where smaller arrays are loaded
ds_dtype = mel_train[0].dtype
ds_a0 = mel_train.shape[0]
ds_a1, ds_a2 = 0, 0
for arr in mel_train:
ds_a1 = max(ds_a1, arr.shape[0])
ds_a2 = max(ds_a2, arr.shape[1])
with h5py.File(train_file,'a') as f:
ds2 = f.create_dataset('mel_train2',dtype=ds_dtype,shape=(ds_a0,ds_a1,ds_a2))
for i, arr in enumerate(mel_train):
j,k = arr.shape[0], arr.shape[1]
ds2[i,0:j,0:k] = arr
Typical output from running code above:
For mel_train arrays of constant size:
Size of mel_train list: 5
For mel_train array: Dtype: float64; Shape: (5, 10, 2)
For mel_train arrays of random size:
Size of mel_train list: 5
For mel_train array: Dtype: object; Shape: (5,)
For a0= 0; shape: (6, 2)
For a0= 1; shape: (7, 2)
For a0= 2; shape: (8, 2)
For a0= 3; shape: (6, 2)
For a0= 4; shape: (9, 2)
h5py Exception: Object dtype dtype('O') has no native HDF5 equivalent

Related

RuntimeError: Sizes of arrays must match except in dimension 1

I have a list of different shapes array that I wish to stack. Of course, np.stack doesn't work here because of the different shapes so is there a way to handle this using np.stack on dim=1?
is it possible to stack these tensors with different shapes along the second dimension so I would have the result array with shape [ -, 2, 5]? I want the result to be 3d.
data = [np.random.randn([2, 5]), np.random.randn([3, 5])]
stacked = np.stack(data, dim=1)
I tried another solution
f, s = data[0].shape, data[1].shape
stacked = np.concatenate((f.unsqueeze(dim=1), s.unsqueeze(dim=1)), dim=1)
where I unsqueeze the dimension but I also get this error:
RuntimeError: Sizes of arrays must match except in dimension 1. Expected size 2 but got size 3 for array number 1 in the list.
another solution that didn't work:
l = torch.cat(f[:, None, :], s[:, None, :])
the expected output should have shape [:, 2, 4]

Stacking 2d arrays as in your example, to become 3d, would require you to impute some missing data. There is not enough info to create the 3d array if the dimensions of your input data don't match.
I see two options:
concatenate along axis = 1 to get shape (5, 5)
a = data[0]
b = data[1]
combined = np.concatenate((a, b)) # shape (5, 5)
add dummy rows to data[0] to be able to create a 3d result
a = data[0]
b = data[1]
a = np.concatenate((a, np.zeros((b.shape[0] - a.shape[0], a.shape[1]))))
combined = np.stack((a, b)) # shape (2, 3, 5)
Another option could be to delete rows from data[1] to do something similar as option 2), but deleting data is in general not recommended.

How to normalize a large numpy array ? - MemoryError

I have an array of shape: (40000, 240, 320)
Its an image array and I want to normalize each pixel value as follows:
X = X/255
When I try to run the above statement, it throws following error:
MemoryError: Unable to allocate array with shape (40000, 240, 320) and data type float64
How to work with large numpy array in such cases ?

You can use augmented assignment with division (/=), which will modify X in-place:
X /= 255
Your current code attempts to allocate a temporary object:
X = X/255
# Is actually executed like:
tmp = X / 255 # new object!
X = tmp

Append numpy one dimensional arrays does not lead to a matrix

I am trying to get a 2d array, by randomly generating its rows and appending
import numpy as np
my_nums = np.array([])
for i in range(100):
x = np.random.rand(2, 1)
my_nums = np.append(my_nums, np.array(x))
But I do not get what I want but instead get a 1d array.
What is wrong?
Transposing x did not help either.

You could do this by using np.append(axis=0) or np.vstack. This however requires the rows appended to have the same length as the rows already in the array.
You cannot use the same code to append a row with two values to an empty array, and to append a row to an already existing 2D array: numpy will throw a
ValueError: all the input arrays must have same number of dimensions.
You could initialize my_nums to work around this:
my_nums = np.random.rand(1, 2)
for i in range(99):
x = np.random.rand(1, 2)
my_nums = np.append(my_nums, x, axis=0)
Note the decrease in the range by one due to the initialization row. Also note that I changed the dimensions to (1, 2) to get actual row vectors.
Much easier than appending row-wise will of course be to create the array in the wanted final shape:
my_nums = np.random.rand(100, 2)

Numpy: creating batch of numpy arrays within another numpy array (reshaping)

I have a numpy array batch of shape (32,5). Each element of the batch consists of a numpy array batch_elem = [s,_,_,_,_] where s = [img,val1,val2] is a 3-dimensional numpy array and _ are simply scalar values.
img is an image (numpy array) with dimensions (84,84,3)
I would like to create a numpy array with the shape (32,84,84,3). Basically I want to extract the image information within each batch and transform it into a 4-dimensional array.
I tried the following:
b = np.vstack(batch[:,0]) #this yields a b with shape (32,3), type: <class 'numpy.ndarray'>
Now I would like to access the images (first index in second dimension)
img_batch = b[:,0] # this returns an array of shape (32,), type: <class 'numpy.ndarray'>
How can I best access the image data and get a shape (32,84,84,3)?
Note:
s = b[0] #first s of the 32 in batch: shape (3,) , type: <class 'numpy.ndarray'>
Edit:
This should be a minimal example:
img = np.zeros([5,5,3])
s = np.array([img,1,1])
batch_elem = np.array([s,1,1,1,1])
batch = np.array([batch_elem for _ in range(32)])

Assuming I understand the problem correctly, you can stack twice on the last axis.
res = np.stack(np.stack(batch[:,0])[...,0])

import numpy as np
# fabricate some data
batch = np.array((32, 1), dtype=object)
for i in range(len(batch)):
batch[i] = [np.random.rand(84, 84, 3), None, None]
# select images
result = np.array([img for img, _, _ in batch])
# double check!
for i in range(len(batch)):
assert np.all(result[i, :, :, :] == batch[i][0])

Optimizing or replacing array iteration with python loop by numpy functionality

I have the following code that works like expected, but I'm curious if the loop can be replaced by a native numpy function/method for better performance. What I have is one array holding RGB values that I use as a lookup table and two 2d arrays holding greyscale values (0-255). Each value of these two arrays corresponds to the value of one axis of the lookup table.
As mentioned, what would be really nice is getting rid of the (slow) loop in python and using a faster numpy method.
#!/usr/bin/env python3
from PIL import Image
import numpy as np
dim = (2000, 2000)
rows, cols = dim
# holding a 256x256 RGB color lookup table
color_map = np.random.random_integers(0, 255, (256,256,3))
# image 1 greyscale values
color_map_idx_row = np.random.randint(0, 255, dim)
# image 2 greyscale values
color_map_idx_col = np.random.randint(0, 255, dim)
# output image data
result_data = np.zeros((rows, cols, 3), dtype=np.uint8)
# is there any built in function in numpy that could
# replace this loop?
# -------------------------------------------------------
for i in range(rows):
for j in range(cols):
row_idx = color_map_idx_row.item(i, j)
col_idx = color_map_idx_col.item(i, j)
rgb_color = color_map[row_idx,col_idx]
result_data[i,j] = rgb_color
img = Image.fromarray(result_data, 'RGB')
img.save('result.png')

You can replace the double-for loop with fancy-indexing:
In [33]: result_alt = color_map[color_map_idx_row, color_map_idx_col]
This confirms the result is the same:
In [36]: np.allclose(result_data, result_alt)
Out[36]: True

You can reshape the 3D array into a 2D array with the axis=1 holding the three channels. Then, use row-slicing with the row indices being calculated as linear indices from the row and column indices arrays. Please note that the reshaped array being a view only, won't burden any of the workspace memory. Thus, we would have -
m = color_map.shape[0]
out = color_map.reshape(-1,3)[color_map_idx_row*m + color_map_idx_col]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to record data under HDF5 file correctly - python

Related

RuntimeError: Sizes of arrays must match except in dimension 1

How to normalize a large numpy array ? - MemoryError

Append numpy one dimensional arrays does not lead to a matrix

Numpy: creating batch of numpy arrays within another numpy array (reshaping)

Optimizing or replacing array iteration with python loop by numpy functionality

Categories

Resources