Exploding tensor after using Dataset and .batch - python

I have a numpy array of shape (100,4,30). This represents 100 samples of 4 samples of encodings of length 30. The 4 samples, per row, are related.
I want to get a TensorFlow dataset, batched, where related samples are in the same batch.
I'm trying to do:
first, use np.vsplit to get a list of length 100, where each element in the list is a list of the 4 related samples.
Now if I call tf.data.Dataset.from_tensor_slices(...).batch(1) on this list of lists, I get a batch that contains a tensor of shape (4,1,30).
I want this batch to contain 4 tensors of shape (1,30).
How can I achieve this?

I may have missunderstood you, but if you just leave out the "vsplit":
data = np.zeros((100, 4, 30))
data_ds = tf.data.Dataset.from_tensor_slices(data).batch(1)
for element in data_ds.take(1):
print(element.shape)
you will get:
(1, 4, 30)
(so one batch contains all 4 related encodings).
If you really want the dimensions inside a batch to be 4 times (1, 30) you can do:
data = np.expand_dims(data, axis=2)
before dataset creation.
EDIT:
I think I just understood your question. You want every batch to have 4 elements and those are the related encodings? You can achieve this by:
data = np.swapaxes(data, 0, 1)
data = np.reshape(data, (100*4, -1))
data_ds = tf.data.Dataset.from_tensor_slices(data).batch(4)

Related

Tensorflow: Is there a way to create multiple gather() outputs and stack them parallely in a compute-and-memory-efficient manner?

I'm trying to essentially create a 3-D tensor from the indexed rows of a 2-D tensor. For example, assuming I have:
A = tensor(shape=[200, 256]) # 2-D Tensor.
Aidx = tensor(shape=[1000, 10]) # 2-D Tensor holding row indices of A for each of 1000 batches.
I wish to create:
B = tensor(shape=[1000, 10, 256]) # 3-D Tensor with each batch being of dims (10, 256) selected from A.
Right now, I'm doing this in a memory inefficient manner by doing a tf.broadcast() and then using a tf.gather(). This is very fast, but also takes up a lot of RAM:
A = tf.broadcast_to(A, [1000, A.shape[0], A.shape[1]])
A = tf.gather(A, Aidx, axis=1, batch_dims=1)
Is there a more memory efficient way of doing the above operation? Naively, one can make use of a for loop, but that is very compute inefficient for my use case. Thanks in advance!
You have to extract 10,000 rows correct? (10 rows 1000 different times)
make this [1000, 10] array into 1 dimensional array [10000] with reshape
See this answer
How to fetch specific rows from a tensor in Tensorflow?
This will give you output [10000, 256]
Then reshape the output into your final form. [1000, 10, 256]
I haven't tried it.

How to stack matrices with different size

I have a list of matrices with size of (63,32,1,600,600), when I want to stack it with torch.stack(matrices).cpu().detach().numpy() it's raising with error:
"stack expects each tensor to be equal size, but got [32, 1, 600, 600] at entry 0 and [16, 1, 600, 600] at entry 62". Is tried for resizing but it did not work. I appreciate any recommendations.
If I understand correctly what you're trying to do is stack the outputted mini-batches together into a single batch. My bet is that your last batch is partially filled (only has 16 elements instead of 32).
Instead of using torch.stack (creating a new axis), I would simply concatenate with torch.cat on the batch axis (axis=0). Assuming matrices is a list of torch.Tensors.
torch.cat(matrices).cpu().detach().numpy()
As torch.cat concatenates on axis=0 by default.
When we have tensors that differ in size only on the first dimension, as of PyTorch v1.7.0, we can use torch.vstack() to stack it along axis 0. Using torch.stack() fails here because it expects all the tensors to be of same shape.
Here is a reproducible illustration matching your problem description:
# sample tensors (as per your size)
In [65]: t1 = torch.randn([32, 1, 600, 600])
In [66]: t2 = torch.randn([16, 1, 600, 600])
# vertical stacking (i.e., stacking along axis 0)
In [67]: stacked = torch.vstack([t1, t2])
# check shape of output
In [68]: stacked.shape
Out[68]: torch.Size([48, 1, 600, 600])
we get 48 (32 + 16) as the size of first dimension in the result because we're stacking tensors along that dimension.
Note:
You can also initialize the result tensor, say stacked, by explicitly calculating the shape and pass this tensor as a parameter to out= kwarg of torch.vstack() if you want to write the result to a specific tensor, for instance updating the values of existing tensor (of same shape). However, this is optional.
# calculate new shape of stacking
In [80]: newshape = (t1.shape[0] + t2.shape[0], *t1.shape[1:])
# allocate an empty tensor, filled with garbage values
In [81]: stacked = torch.empty(newshape)
# stack it along axis 0 and write the result to `stacked`
In [83]: torch.vstack([t1, t2], out=stacked)
# check shape/size
In [84]: stacked.shape
Out[84]: torch.Size([48, 1, 600, 600])

rolling statistics in numpy or pytroch

I have a tensors data of sensors, each tensor is of shape (4,1500)
This is 1500 timepoints and for each time point I have 4 features.
I want to "smooth" the sequences with rolling average or other rolling statistics. The end goal is to try to improve an lstm autoencoder with rolling statistics instead of the long raw sequence.
I am familiar with rolling windows of pandas and currently I am doing this:
#tensor shape:
data.shape
(4,1500)
#convert data to numpy array and then to dataframe and perform rolling mean
rolled_data=pd.DataFrame(data.numpy().swapaxes(1,0)).rolling(10).mean()[::10]
rolled_data.shape
(150, 4)
# convert back the dataframe to tensor
tensor_rolled_data=torch.Tensor(rolled_data.to_numpy().swapaxes(1,0))
tensor_rolled_data.shape
torch.Size([4, 150])
my question is- is there a better way to do it? a function in numpy/torch that can do rolling statistics in a cleaner or more efficient way?
Since you're striding the output by the size of the window this is actually more akin to downsampling by averaging than to a computing rolling statistics. We can take advantage of the fact that there are no overlaps by simply reshaping the initial tensor.
Using Tensor.reshape
Assuming your data tensor has a shape divisible by 10 then you can just reshape the tensor to shape (4, 150, 10) and compute the statistic along the last dimension. For example
win_size = 10
tensor_rolled_data = data.reshape(data.shape[0], -1, win_size).mean(dim=2)
This solution doesn't give exactly the same results as your tensor_rolled_data since in this solution the first entry will contain the mean of the first 10 samples, the second entry will contain the mean of the second 10 samples, etc... The pandas solution is a "causal filter" so the first entry will contain the mean of the 10 most recent samples up to and including sample 0, the second will contain the 10 most recent samples up to and including sample 10, etc... (Note that the first entry is nan in the pandas solution since less than 10 preceding samples exist).
If this difference is unacceptable you can recreate the pandas result by first padding with 9 nan values and clipping off the last 9 samples.
import torch.nn.functional as F
win_size = 10
# pad with `nan` to match behavior of pandas
data_padded = F.pad(data[None, :, :-(win_size - 1)], (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
# find mean of groups of N samples
tensor_rolled_data = data_padded.reshape(data.shape[0], -1, win_size).mean(dim=2)
Using Tensor.unfold
To address the comment about what to do when there are overlaps. If you're only interested in the mean statistic then there are a number of ways to compute this (e.g. convolution, average pooling, tensor unfolding). That said, Tensor.unfold gives the most general solution since it could be used to compute any statistic over a window. For example
# same as first example above
win_size = 10
tensor_rolled_data = data.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)
or
# same as second example above
import torch.nn.functional as F
win_size = 10
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)
In the above cases, unfolding produces the same result as reshape since size and step are equal. However, unlike reshape, unfolding also supports size != step.
win_size = 10
stride = 2
tensor_rolled_data = data.unfold(1, win_size, stride).mean(dim=2).mean(dim=2)
# produces shape [4, 746]
or you can pad the front of the features with win_size - 1 values to achieve the same result as pandas.
import torch.nn.functional as F
win_size = 10
stride = 2
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(1, win_size, stride).mean(dim=2)
# produces shape [4, 750]
Note In practice you probably don't want to pad with NaN since this will probably become quite a headache. Instead you could use zero padding, 'replicate' padding, or 'mirror' padding.

Get the value at a position from all layers in python

I have 3 numpy arrays of shape (224, 224, 20). I want to go through each of (224, 224) values in all 20 layers (dimensions) and compare them to get the highest among them. For 3 Dimensional, I am able to come up with this:
arr1 = np.array([[[1,2,3],[4,5,6]],[[10,11,12],[15,16,17]]])
for x in range(0,2):
for y in range(0,2):
print(arr1[:,x,y])
But, I somehow couldn't understand how to convert it for (224,224,20) shaped arrays.
I also need the index of the layer which contains the maximum value.
To get max values along one dimension, you can use numpy.amax, checkout:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.amax.html
You can do this with numpy.max instead of a for loop:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html
np.max(arr1, axis=2)
To get the index, use numpy.argmax
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
np.argmax(arr1, axis=2)

Python: How to restructure a vector into multiple 3 dimensional matrices

So I am a little new to using matrices in Python, and I am looking for the best way to perform the following operation.
Say I have a vector of an arbitrary length, like this:
data = np.array(range(255))
And I want to fit this data inside a matrix with a shape like so:
concept = np.zeros((3, 9, 6))
Now, obviously this will not fit, and results in an error:
ValueError: cannot reshape array of size 255 into shape (3,9,6)
What would be the best way to go about fitting as much of the data vector inside the first matrix with the shape (3, 9, 6) while making sure any "overflow" is stored in a second (or third, fourth, etc.) matrix?
Does this make sense?
Basically, I want to be able to take a vector of any size and produce an arbitrary amount of matrices that have the data shaped according to the 3, 9, 6 dimensions.
Thank you for your help.
def each_matrix(a, dims):
size = dims.prod()
padded = np.concatenate([ a, np.zeros(size-1) ])
for i in range(len(padded) / size):
yield padded[i*size : (i+1)*size].reshape(dims)
for matrix in each_matrix(np.array(range(255)),
dims=np.array([ 3, 9, 6 ])):
print(str(matrix) + '\n\n-------\n')
This will fill the last matrix with zeros.
Here is a rough solution to your problem.
def split_padded(a,n):
padding = n - len(data)%n
numOfsplit = int(len(data)/n)+1
print padding, numOfsplit
return np.split(np.concatenate((a,np.zeros(padding))),numOfsplit)
data = np.array(range(255))
splitnum = 3*9*6
splitdata = split_padded(data,splitnum)
for mat in splitdata:
print mat.reshape(3,9,6)
It is very rough and works for 1D input for array.
First, calculating the number of 0 we need to pad in padding and then calculating the number of matrices we can get out of input data in numOfsplit and doing the splitting in last line.

Categories

Resources