Copying values from a numpy array to balance the dataset

Copying values from a numpy array to balance the dataset - python

I have a dataset where one of the similar looking class is imbalanced. It is a number dataset where class labels go from 1 to 10.
Grouping by label (y) on the training set gives the following output:
(array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint8), array([13861, 10585, 8497, 7458, 6882, 5727, 5595, 5045, 4659,
4948]))
As could be seen 1 has 13861 data-points and 7 has only 5595 data-points.
To avoid the class imbalance between 1 and 7 I want to put some extra images for 7 class.
Here is train set:
from scipy.io import loadmat
train = loadmat('train.mat')
extra = loadmat('extra.mat')
Both train and extra are dictionaries with 2 keys X and y each.
Here is the shape of train and extra:
train['X'] --> (32, 32, 3, 73257)
# 73257 images of 32x32x3
train['y'] --> (73257,1)
# 73257 labels of corresponding images
extra['X'] --> (32, 32, 3, 531131)
# 531131 images of 32x32x3
extra['y'] --> (531131, 1)
# 531131 labels of corresponding images
Now, I want to update train dataset with labels from extra, primarily taking x% of data with label 7 in extra into train. How could I do this?
I tried the following:
arr, _ = np.where(extra['y'] == 7)
c = np.concatenate(X_train, extra['X'][arr])
But I get an error saying IndexError: index 32 is out of bounds for axis 0 with size 32

Here is a working example on just numpy arrays that easily translates to your case. As you have edited, use numpy.where to find the labels you want on extra['y'] and keep these indices. These are then used together with numpy.append to concatenate (last axis for X and first axis for y) your original dataset with the extra one.
import numpy as np
np.random.seed(100)
# First find the indices of your y_extra with label 7
x_extra = np.random.rand(32, 32, 3, 10)
y_extra = np.random.randint(0, 9, size=(10,1))
indices = np.where(y_extra==7)[0] # indices [3,4] are 7 with seed=100
# Now use this indices to concatenate them in the original datase
np.random.seed(101)
x_original = np.random.rand(32, 32, 3, 10)
y_original = np.random.randint(1, 10, size=(10,1))
print(x_original.shape, x_extra[..., indices].shape) # (32, 32, 3, 10) (32, 32, 3, 2)
print(y_original.shape, y_extra[indices].shape) # (10, 1) (2, 1)
x_final = np.append(x_original, x_extra[..., indices], axis=-1)
y_final = np.append(y_original, y_extra[indices], axis=0)
print(x_final.shape, y_final.shape) # (32, 32, 3, 12) (12, 1)

Related

Iterating over a multi-dimentional array

I have array of shape (3,5,96,96), where channels= 3, number of frames = 5 and height and width = 96
I want to iterate over dimension 5 to get images with size (3,96,96). The code which I have tried is below.
b = frame.shape[1]
for i in range(b):
fr = frame[:,i,:,:]
But this is not working.

You could swap axis (using numpy.swapaxes(a, axis1, axis2) to get the second (frame) in first position
import numpy as np
m = np.zeros((3, 5, 96, 96))
n = np.swapaxes(m, 0, 1)
print(n.shape)
(5, 3, 96, 96)

You need to iterate over the first axis to achieve your desired result, this means you need to move the axis you want to iterate over to the first position. You can achieve this with np.moveaxis
m = np.zeros((3, 5, 96, 96))
np.moveaxis(m, 1, 0).shape
(5, 3, 96, 96)

Multidimensional Tensor slicing

First things first: I'm relatively new to TensorFlow.
I'm trying to implement a custom layer in tensorflow.keras and I'm having relatively hard time when I try to achieve the following:
I've got 3 Tensors (x,y,z) of shape (?,49,3,3,32) [where ? is the batch size]
On each Tensor I compute the sum over the 3rd and 4th axes [thus I end up with 3 Tensors of shape (?,49,32)]
By doing an argmax (A)on the above 3 Tensors (?,49,32) I get a single (?,49,32) Tensor
Now I want to use this tensor to select slices from the initial x,y,z Tensors in the following form:
Each element in the last dimension of A corresponds to the selected Tensor.
(aka: 0 = X, 1 = Y, 2 = Z)
The index of the last dimension of A corresponds to the slice that I would like to extract from the Tensor last dimension.
I've tried to achieve the above using tf.gather but I had no luck. Then I tried using a series of tf.map_fn, which is ugly and computationally costly.
To simplify the above:
let's say we've got an A array of shape (3,3,3,32). Then the numpy equivalent of what I try to achieve is this:
import numpy as np
x = np.random.rand(3,3,32)
y = np.random.rand(3,3,32)
z = np.random.rand(3,3,32)
x_sums = np.sum(np.sum(x,axis=0),0);
y_sums = np.sum(np.sum(y,axis=0),0);
z_sums = np.sum(np.sum(z,axis=0),0);
max_sums = np.argmax([x_sums,y_sums,z_sums],0)
A = np.array([x,y,z])
tmp = []
for i in range(0,len(max_sums)):
tmp.append(A[max_sums[i],:,:,i)
output = np.transpose(np.stack(tmp))
Any suggestions?
ps: I tried tf.gather_nd but I had no luck

This is how you can do something like that with tf.gather_nd:
import tensorflow as tf
# Make example data
tf.random.set_seed(0)
b = 10 # Batch size
x = tf.random.uniform((b, 49, 3, 3, 32))
y = tf.random.uniform((b, 49, 3, 3, 32))
z = tf.random.uniform((b, 49, 3, 3, 32))
# Stack tensors together
data = tf.stack([x, y, z], axis=2)
# Put reduction axes last
data_t = tf.transpose(data, (0, 1, 5, 2, 3, 4))
# Reduce
s = tf.reduce_sum(data_t, axis=(4, 5))
# Find largest sums
idx = tf.argmax(s, 3)
# Make gather indices
data_shape = tf.shape(data_t, idx.dtype)
bb, ii, jj = tf.meshgrid(*(tf.range(data_shape[i]) for i in range(3)), indexing='ij')
# Gather result
output_t = tf.gather_nd(data_t, tf.stack([bb, ii, jj, idx], axis=-1))
# Reorder axes
output = tf.transpose(output_t, (0, 1, 3, 4, 2))
print(output.shape)
# TensorShape([10, 49, 3, 3, 32])

tensorflow equivalent of torch.gather

I have a tensor of shape (16, 4096, 3). I have another tensor of indices of shape (16, 32768, 3). I am trying to collect the values along dim=1. This was initially done in pytorch using gather function as shown below-
# a.shape (16L, 4096L, 3L)
# idx.shape (16L, 32768L, 3L)
b = a.gather(1, idx)
# b.shape (16L, 32768L, 3L)
Please note that the size of output b is the same as that of idx. However, when I apply gather function of tensorflow, I get a completely different output. The output dimension was found mismatching as shown below-
b = tf.gather(a, idx, axis=1)
# b.shape (16, 16, 32768, 3, 3)
I also tried using tf.gather_nd but got in vain. See below-
b = tf.gather_nd(a, idx)
# b.shape (16, 32768)
Why am I getting different shapes of tensors? I want to get the tensor of the same shape as calculated by pytorch.
In other words, I want to know the tensorflow equivalent of torch.gather.

For 2D case,there is a method to do it:
# a.shape (16L, 10L)
# idx.shape (16L,1)
idx = tf.stack([tf.range(tf.shape(idx)[0]),idx[:,0]],axis=-1)
b = tf.gather_nd(a,idx)
However,For ND case,this method maybe very complex

This "should" be a general solution using tf.gather_nd (I've only tested for rank 2 and 3 tensors along the last axis):
def torch_gather(x, indices, gather_axis):
# if pytorch gather indices are
# [[[0, 10, 20], [0, 10, 20], [0, 10, 20]],
# [[0, 10, 20], [0, 10, 20], [0, 10, 20]]]
# tf nd_gather needs to be
# [[0,0,0], [0,0,10], [0,0,20], [0,1,0], [0,1,10], [0,1,20], [0,2,0], [0,2,10], [0,2,20],
# [1,0,0], [1,0,10], [1,0,20], [1,1,0], [1,1,10], [1,1,20], [1,2,0], [1,2,10], [1,2,20]]
# create a tensor containing indices of each element
all_indices = tf.where(tf.fill(indices.shape, True))
gather_locations = tf.reshape(indices, [indices.shape.num_elements()])
# splice in our pytorch style index at the correct axis
gather_indices = []
for axis in range(len(indices.shape)):
if axis == gather_axis:
gather_indices.append(gather_locations)
else:
gather_indices.append(all_indices[:, axis])
gather_indices = tf.stack(gather_indices, axis=-1)
gathered = tf.gather_nd(x, gather_indices)
reshaped = tf.reshape(gathered, indices.shape)
return reshaped

For the last-axis gathering, we can use the 2D-reshape trick for general ND cases, and then employ #LiShaoyuan 2D code above
# last-axis gathering only - use 2D-reshape-trick for Torch's style nD gathering
def torch_gather(param, id_tensor):
# 2d-gather torch equivalent from #LiShaoyuan above
def gather2d(target, id_tensor):
idx = tf.stack([tf.range(tf.shape(id_tensor)[0]),id_tensor[:,0]],axis=-1)
result = tf.gather_nd(target,idx)
return tf.expand_dims(result,axis=-1)
target = tf.reshape(param, (-1, param.shape[-1])) # reshape 2D
target_shape = id_tensor.shape
id_tensor = tf.reshape(id_tensor, (-1, 1)) # also 2D-index
result = gather2d(target, id_tensor)
return tf.reshape(result, target_shape)

Create a map of images in python numpy

I have a numpy array of size image_stack 64x28x28x3 which correspond to 64 images of size 28x28x3. What I want is to construct an image of size 224x224x3 which will contain all my images that are in the initial array. How can I do so in numpy? So far I have the code for stacking the images in the same line, however I want 8 lines of 8 columns instead. My code so far:
def tile_images(image_stack):
"""Given a stacked tensor of images, reshapes them into a horizontal tiling for display."""
assert len(image_stack.shape) == 4
image_list = [image_stack[i, :, :, :] for i in range(image_stack.shape[0])]
tiled_images = np.concatenate(image_list, axis=1)
return tiled_images

Does the following reshape, transpose, reshape trick work?
x.shape # (64, 28, 28, 3)
mosaic = x.reshape(8, 8, 28, 28, 3).transpose((0, 2, 1, 3, 4)).reshape(224, 224, 3)
The first reshape breaks your 64 into lines and columns. Transpose rearranges their order so that we can collapse them in a meaningful way.
Your function would then look like:
def tile_images(x):
dims = x.shape
assert len(dims) == 4
stack_dim = int(np.sqrt(dims[0]))
res = x.reshape(stack_dim, stack_dim, *dims[1:]).transpose((0, 2, 1, 3, 4))
tile_size = res.shape[0] * res.shape[1]
return res.reshape(tile_size, tile_size, -1)

Binning data based on each row in numpy array

a = train_images.reshape(60000, 28, 28)
print(np.shape(a))
a_view = a.reshape(60000, 7, 4, 7, 4)
print(np.shape(a_view))
binned_out = []
for x in range(len(a)) :
bin_v = a_view[x].mean(axis=3).mean(axis=1)
binned_out.append(bin_v)
In the above code, is there any other way to change the for loop to decrease the execution time and getting the binned data for all the elements in numpy array.
My train_images shape is (60000, 784)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Copying values from a numpy array to balance the dataset - python

Related

Iterating over a multi-dimentional array

Multidimensional Tensor slicing

tensorflow equivalent of torch.gather

Create a map of images in python numpy

Binning data based on each row in numpy array

Categories

Resources