Scale a 3d numpy array column wise along axis of first dimension

Scale a 3d numpy array column wise along axis of first dimension - python

I have a 3d numpy array representing time series data, ie [number of samples, time steps, features].
I would like to scale each feature between -1 and 1. However, each feature should be scaled with respect to the maximum and minimum of all samples in the first dimension of my array. For example, my array is of shape:
multi_data.shape
(66, 5004, 2)
I tried the following:
data_min = multi_data.min(axis=1, keepdims=True)
data_max = multi_data.max(axis=1, keepdims=True)
multi_data = (2*(multi_data-data_min)/(data_max-data_min))-1
The problem is this scales each "batch" (the first dimension of my array) independently. What I am trying to do is scale each feature (for which I have two), by the max and min values across all 66 batches and then scale each feature based on those maximum and minimum values, but I can't quite work out how to achieve this. Any pointers would be very welcome.

How about chaining that with another min/max:
data_min = multi_data.min(axis=1, keepdims=True).min(axis=0, keepdims=True)
data_max = multi_data.max(axis=1, keepdims=True).max(axis=0, keepdims=True)
multi_data = (2*(multi_data-data_min)/(data_max-data_min))-1
Or:
data_min = multi_data.min(axis=(0,1), keepdims=True)
data_max = multi_data.max(axis=(0,1), keepdims=True)
multi_data = (2*(multi_data-data_min)/(data_max-data_min))-1
Since you're taking min/max of the first two dimensions, you can just forget keepdims and use broadcasting so you can save quite a bit of memory in this case:
data_min = multi_data.min(axis=(0,1))
data_max = multi_data.max(axis=(0,1))
multi_data = (2*(multi_data-data_min)/(data_max-data_min))-1

Related

rolling statistics in numpy or pytroch

I have a tensors data of sensors, each tensor is of shape (4,1500)
This is 1500 timepoints and for each time point I have 4 features.
I want to "smooth" the sequences with rolling average or other rolling statistics. The end goal is to try to improve an lstm autoencoder with rolling statistics instead of the long raw sequence.
I am familiar with rolling windows of pandas and currently I am doing this:
#tensor shape:
data.shape
(4,1500)
#convert data to numpy array and then to dataframe and perform rolling mean
rolled_data=pd.DataFrame(data.numpy().swapaxes(1,0)).rolling(10).mean()[::10]
rolled_data.shape
(150, 4)
# convert back the dataframe to tensor
tensor_rolled_data=torch.Tensor(rolled_data.to_numpy().swapaxes(1,0))
tensor_rolled_data.shape
torch.Size([4, 150])
my question is- is there a better way to do it? a function in numpy/torch that can do rolling statistics in a cleaner or more efficient way?

Since you're striding the output by the size of the window this is actually more akin to downsampling by averaging than to a computing rolling statistics. We can take advantage of the fact that there are no overlaps by simply reshaping the initial tensor.
Using Tensor.reshape
Assuming your data tensor has a shape divisible by 10 then you can just reshape the tensor to shape (4, 150, 10) and compute the statistic along the last dimension. For example
win_size = 10
tensor_rolled_data = data.reshape(data.shape[0], -1, win_size).mean(dim=2)
This solution doesn't give exactly the same results as your tensor_rolled_data since in this solution the first entry will contain the mean of the first 10 samples, the second entry will contain the mean of the second 10 samples, etc... The pandas solution is a "causal filter" so the first entry will contain the mean of the 10 most recent samples up to and including sample 0, the second will contain the 10 most recent samples up to and including sample 10, etc... (Note that the first entry is nan in the pandas solution since less than 10 preceding samples exist).
If this difference is unacceptable you can recreate the pandas result by first padding with 9 nan values and clipping off the last 9 samples.
import torch.nn.functional as F
win_size = 10
# pad with `nan` to match behavior of pandas
data_padded = F.pad(data[None, :, :-(win_size - 1)], (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
# find mean of groups of N samples
tensor_rolled_data = data_padded.reshape(data.shape[0], -1, win_size).mean(dim=2)
Using Tensor.unfold
To address the comment about what to do when there are overlaps. If you're only interested in the mean statistic then there are a number of ways to compute this (e.g. convolution, average pooling, tensor unfolding). That said, Tensor.unfold gives the most general solution since it could be used to compute any statistic over a window. For example
# same as first example above
win_size = 10
tensor_rolled_data = data.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)
or
# same as second example above
import torch.nn.functional as F
win_size = 10
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(dimension=1, size=win_size, step=win_size).mean(dim=2)
In the above cases, unfolding produces the same result as reshape since size and step are equal. However, unlike reshape, unfolding also supports size != step.
win_size = 10
stride = 2
tensor_rolled_data = data.unfold(1, win_size, stride).mean(dim=2).mean(dim=2)
# produces shape [4, 746]
or you can pad the front of the features with win_size - 1 values to achieve the same result as pandas.
import torch.nn.functional as F
win_size = 10
stride = 2
data_padded = F.pad(data.unsqueeze(0), (win_size - 1, 0), 'constant', float('nan')).squeeze(0)
tensor_rolled_data = data_padded.unfold(1, win_size, stride).mean(dim=2)
# produces shape [4, 750]
Note In practice you probably don't want to pad with NaN since this will probably become quite a headache. Instead you could use zero padding, 'replicate' padding, or 'mirror' padding.

Repmat operation in python

I want to calculate the mean of a 3D array along two axes and subtract this mean from the array.
In Matlab I use the repmat function to achieve this as follows
% A is an array of size 100x50x100
mean_A = mean(mean(A,3),1); % mean_A is 1D of length 50
Am = repmat(mean_A,[100,1,100]) % Am is 3D 100x50x100
flc_A = A - Am % flc_A is 3D 100x50x100
Now, I am trying to do the same with python.
mean_A = numpy.mean(numpy.mean(A,axis=2),axis=0);
gives me the 1D array. However, I cannot find a way to copy this to form a 3D array using numpy.tile().
Am I missing something or is there another way to do this in python?

You could set keepdims to True in both cases so the resulting shape is broadcastable and use np.broadcast_to to broadcast to the shape of A:
np.broadcast_to(np.mean(np.mean(A,2,keepdims=True),axis=0,keepdims=True), A.shape)
Note that you can also specify a tuple of axes along which to take the successive means:
np.broadcast_to(np.mean(A,axis=tuple([2,0]), keepdims=True), A.shape)

numpy.tile is not the same with Matlab repmat. You could refer to this question. However, there is an easy way to repeat the work you have done in Matlab. And you don't really have to understand how numpy.tile works in Python.
import numpy as np
A = np.random.rand(100, 50, 100)
# keep the dims of the array when calculating mean values
B = np.mean(A, axis=2, keepdims=True)
C = np.mean(B, axis=0, keepdims=True) # now the shape of C is (1, 50, 1)
# then simply duplicate C in the first and the third dimensions
D = np.repeat(C, 100, axis=0)
D = np.repeat(D, 100, axis=2)
D is the 3D array you want.

Numpy Hstack dimensions not aligned

I am trying to run a PCA analysis over an dataset representing the 3 bands of an image. The dataset is of size (300000,3) being pixels and 3bands.I find the Eigen values and pairs which are then put into a tuple called eig_pairs. I then calculate the variance to determine how many bands to use for PCA.
I determine that I wish to use 2 bands.
My eig_pairs shape is a list of tuples of size 3.
Following this tutorial I says I need to reshape everything by reducing from original dimension space (3) to how every many a dimension equal to the number of dimensions I wish to use (2). Their example goes for 7 to 4 as shown here:
matrix_w = np.hstack((eig_pairs[0][1].reshape(7,1),
eig_pairs[1][1].reshape(7,1),
eig_pairs[2][1].reshape(7,1),
eig_pairs[3][1].reshape(7,1)))
Following this logic I changed my own to:
matrix_w = np.hstack((eig_pairs0.reshape(3,1),
eig_pairs1.reshape(3,1)))
However I get the error ValueError: shapes (3131892,3) and (2,3) not aligned: 3 (dim 1) != 2 (dim 0)
#read in image
img = cv2.imread('/Volumes/EXTERNAL/Stitched-Photos-for-Chris/p7_0015_20161005-949am-75m-pass-1.jpg.png',1)
row,col = img.shape[:2]
b,g,r = cv2.split(img)
# Pandas dataset
# samples = 3000000, featuress = 3
dataSet = pd.DataFrame({'bBnad':b.flat[:],'gBnad':g.flat[:],'rBnad':r.flat[:]})
print(dataSet.head())
# Standardize the data
X = dataSet.values
X_std = StandardScaler().fit_transform(X) #converts data from unit8 to float64
#Calculating Eigenvectors and eigenvalues of Covariance matrix
meanVec = np.mean(X_std, axis=0)
covarianceMatx = np.cov(X_std.T)
eigVals, eigVecs = np.linalg.eig(covarianceMatx)
# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eigVals[i]),eigVecs[:,i]) for i in range(len(eigVals))]
# Sort from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)
# Determine how many PC going to choose for new feature subspace via
# the explained variance measure which is calculated from eigen vals
# The explained variance tells us how much information (variance) can
# be attributed to each of the principal components
tot = sum(eigVals)
var_exp = [(i / tot)*100 for i in sorted(eigVals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
#convert 3 dimension space to 2 dimensional space therefore getting a 2x3 matrix W
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1),
eig_pairs[1][1].reshape(3,1)))
Appreciate any help.

Demean and standardise a 4d array within a 3d mask?

import numpy as np
ts = np.random.rand(40,45,40,1000)
mask = np.random.randint(2, size=(40,45,40),dtype=bool)
#creating a masked array
ts_m = np.ma.array(ts, mask=ts*~mask[:,:,:,np.newaxis])
#demeaning
ts_md = ts_m - ts_m.mean(axis=3)[:,:,:,np.newaxis]
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=3)[:,:,:,np.newaxis]
I would like to demean ts (along axis 3), and divide by its standard deviation (along axis 3), all within the mask.
Am I doing this correctly ?
Is there a faster method ?

You have a couple of options available to you.
The first is to use masked arrays as you are doing, but provide a proper mask and use the masked functions. Right now, your code is computing all the means and standard deviations, and slapping a mask on the result. To skip masked elements, use np.ma.mean and np.ma.std, and thereby avoid doing a whole lot of extra work.
As you correctly understood, the size of the mask must match that of the data. While multiplying by the data gives you the correct size, it is expensive and gives the wrong result in the general case since the mask will be zero whenever either data or mask is zero. A better approach would be to create a view of the mask repeated along the last (new) dimension. You can use np.broadcast_to if you get the trailing dimensions to match up first:
ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)
#creating a masked array
ts_m = np.ma.array(ts, mask=np.broadcast_to(mask[..., None], ts.shape)
#demeaning
ts_md = ts_m - np.ma.mean(ts_m, axis=3)[..., None]
#standardisation
ts_mds = ts_md / np.ma.std(ts_m, ddof=1,axis=3)[..., None]
The mask is read only, and because it likely has a dimension with zero stride, can sometimes do unexpected things. The broadcasted version here is roughly equivalent to
np.lib.stride_tricks.as_strided(mask, ts.shape, (*mask.strides, 0), writeable=False)
Both versions create views to the original data, so are very fast. They just allocate a new array object that points to the existing data, which is not copied. Keep in mind that np.lib.stride_tricks.as_strided is a sledgehammer that should be used with the utmost care. It will crash your interpreted any day if you let it.
Note: The mask in a masked array is interpreted as True being masked, while Boolean indexing arrays are interpreted with False masked. Depending on how it's obtained and it's meaning in your real code, you may want to invert the mask
mask=np.broadcast_to(~mask[..., None], ...)
Another option is to implement the masking yourself. There are two ways you can do that. If you do it up-front, the mask will be applied to the leading dimensions of your data:
ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)
#creating a masked array
mask = ~mask # optional, see note above
ts_m = ts[mask]
#demeaning
ts_md = ts_m - ts_m.mean(axis=-1)
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=-1)
# reshaping
result = np.empty_like(ts) # alternatively, np.zeros_like
result[mask] = ts_mds
This option may be cheaper than a masked array because the initial masking step creates a 40*45*40-mask_size x 1000 array, and only replaces it into the masked area of the result when finished, instead of operating on the full sized data and preserving shape.
The third option is only really useful if you have only a small number of elements masked out. It's essentially what your original code is doing: perform all the commutations, and apply the mask to the result.
More Tips
Ellipsis is a special object that means "all the remaining dimensions". It's usually abbreviated ... in slice notation. np.newaxis is an alias for None. Combine those pieces of information, and you get that [: :, :, np.newaxis] can be written more cleanly and elegantly as [..., None]. The latter is more general since it works for an arbitrary number of dimensions.
Numpy allows for negative axis indices. A nicer way to say "last axis" is generally axis=-1.

import numpy as np
ts = np.random.rand(40,45,40,1000)
mask = np.random.randint(2, size=(40,45,40)).astype(bool)
#creating a masked array
ts_m = np.ma.array(ts, mask=np.broadcast_to(~mask.reshape(40,45,40,1),ts.shape))
#demeaning
ts_md = ts_m - ts_m.mean(axis=3)[:,:,:,np.newaxis]
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=3)[:,:,:,np.newaxis]

Multi-dimensional filtering using scipy.ndimage_generic_filter

I would like to use a generic filter to calculate the mean of values within a given window (or kernel), for values that fulfill a couple of conditions. I expected the following code to produce a mean filter of the first array in a 3-layer window, using the other two arrays to mask values from the mean calculation.
from scipy import ndimage
import numpy as np
#some test data
tstArr = np.random.rand(3,7,7)
tstArr = tstArr*10
tstArr = np.int_(tstArr)
tstArr[1] = tstArr[1]*100
tstArr[2] = tstArr[2] *1000
#mean function
def testFun(tstData,processLayer,nLayers,kernelSize):
funData= tstData.reshape((nLayers,kernelSize,kernelSize))
meanLayer = funData[processLayer]
maskedData = meanLayer[(funData[1]>1)&(funData[2]<9000)]
returnMean = np.mean(maskedData)
return returnMean
#number of layers in the array
nLayers = np.shape(tstArr)[0]
#window size
kernelSize = 5
#create a sampling window of 5x5 elements from each array
footprnt = np.ones((nLayers,kernelSize,kernelSize),dtype = np.int)
# calculate the mean of the first layer in the array (other two are for masking)
processLayer = 0
tstOut = ndimage.generic_filter(tstArr, testFun, footprint=footprnt, extra_arguments = (processLayer,nLayers,kernelSize))
I thought this would yield a 7x7 array of masked mean values from the first layer in the input array. The output is a 3x7x7 array, and I don't understand what the values represent. I'm not sure how to produce the "masked" mean-filtered array, or how to interpret the output as given.

Your code produce a mean filter of the first array in a 3-layer window, using the over two arrays to mask values from the mean calculation. You will find the result in tstOut[1].
What is going on ? When you call ndimage.generic_filter with tstArr of shape (3, 7, 7) and footprint=np.ones((3, 5, 5)) then for all i from 0 to 2, for all j from 0 to 6 and for all k from 0 to 6, testFun is called with the subarray of tstArr centered in (i, j, k) and of shape (3, 5, 5) (the array is reflected at the boundary to supply missing values).
In the end:
tstOut[0] is the mean filter of tstArr[0] with tstArr[0] and tstArr[1] as masks
tstOut[1] is the mean filter of tstArr[0] with tstArr[1] and tstArr[2] as masks
tstOut[2] is the mean filter of tstArr[1] with tstArr[2] and tstArr[2] as masks
Again, the wanted result is in tstOut[1].
I hope this will help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scale a 3d numpy array column wise along axis of first dimension - python

Related

rolling statistics in numpy or pytroch

Repmat operation in python

Numpy Hstack dimensions not aligned

Demean and standardise a 4d array within a 3d mask?

Multi-dimensional filtering using scipy.ndimage_generic_filter

Categories

Resources