I have a data set called DATA which regroup several 3D tables from N=173 files of individual shape (4, 4, 64) so at the end the numpy array called DATA has shape (173, 4, 4, 64). In each individual file I have a column which is a boolean column to specify if the data is good or bad. In order to filter my data I use then boolean conditions:
cond = DATA[:,3,:,:]==False
DATA_filtered = DATA[:,1,:,:][cond]
with the following shapes:
np.shape(DATA)
Out[854]: (173, 4, 4, 64)
np.shape(cond)
Out[855]: (173, 4, 64)
But since I use this technique at then end I have a 1D array and all the structure of the initial DATA set is lost. One technique is to use the reshape function used for numpy.array but this technique works only if at then end the dimension remains the same. In the case where the boolean conditions induce tables of variable size, we do no longer can predict and ask for a reshape. So is there a way to filter data but keeping the global shape of the data with size which can vary depending on the flag used in the data?
Here is a minimal example:
TEST = np.ones((173,4,4,64))
FLAG = np.random.choice(a=[False, True], size=(173,4,64))
cond = FLAG==False
data = TEST[:,0,:,:][cond]
Output :
np.shape(data)
Out[868]: (22167,)
Expected Output:
np.shape(data)
Out[868]: (173,4,)
with for example data[:,1,:], a subset with non equals arrays size between 0 and 64 accross the 174 table depending of the data filtering which have been flagged or not.
Thank you in advance
Masked Array is your solution
In many circumstances, datasets can be incomplete or tainted by the presence of invalid data. For example, a sensor may have failed to record a data, or recorded an invalid value. The numpy.ma module provides a convenient way to address this issue, by introducing masked arrays.
A masked array is the combination of a standard numpy.ndarray and a mask
import numpy as np
import numpy.ma as ma
x = np.array([1, 2, 3, -1, 5])
mx = ma.masked_array(x, mask=[0, 0, 0, 1, 0])
mx.mean() # without taking the invalid data into account
Output
2.75
All the above taken from Masked array
So you might as well read it form there
Related
I have data in the following shape: (127260, 2, 1250)
The type of this data is <HDF5 dataset "data": shape (127260, 2, 1250), type "<f8">
The first dimension (127260) is the number of signals, the second dimension (2) is the type of signal, and the third dimension (1250) is the amount of points in each of the signals.
What I wanted to do is reduce the amount of points for each signal, cut them in half, leave 625 points on each signal, and then have double the amount of signals.
How to convert HDF5 dataset to something like numpy array and how to do this reshape?
If I understand, you want a new dataset with shape: (2*127260, 2, 625). If so, it's fairly simple to read 2 slices of the dataset into 2 NumPy arrays, create a new array from the slices, then write to a new dataset. Note: reading slices is simple and fast. I would leave the data as-is and do this on-the-fly unless you have a compelling reason to create a new dataset
Code to do this (where h5f is the h5py file object):
new_arr = np.empty((2*127260, 2, 625))
arr1 = h5f['dataset_name'][:,:, :625]
arr2 = h5f['dataset_name'][:,:, 625:]
new_arr[:127260,:,:] = arr1
new_arr[127260:,:,:] = arr2
h5f.create_dataset('new_dataset_name',data=new_arr)
Alternately you can do this (and combine 2 steps):
new_arr = np.empty((2*127260, 2, 625))
new_arr[:127260,:,:] = h5f['dataset_name'][:,:, :625]
new_arr[127260:,:,:] = h5f['dataset_name'][:,:, 625:]
h5f.create_dataset('new_dataset_name',data=new_arr)
Here is a 3rd method. It is the most direct way, and reduces the memory overhead. This is important when you have very large datasets that won't fit in memory.
h5f.create_dataset('new_dataset_name',shape=(2*127260, 2, 625),dtype=float)
h5f['new_dataset_name'][:127260,:,:] = h5f['dataset_name'][:,:, :625]
h5f['new_dataset_name'][127260:,:,:] = h5f['dataset_name'][:,:, 625:]
Whichever method you choose, I suggest adding an attribute to note the data source for future reference:
h5f['new_dataset_name'].attrs['Data Source'] = 'data sliced from dataset_name'
As I was going through pytorch documentation I came across a term layout = torch.strided in many of the functions. Can anyone help me in understanding where is it used and how. The description says it's the the desired layout of returned Tensor. What does layout mean and how many types of layout are there ?
torch.rand(*sizes, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False)
strides is number of steps (or jumps) that is needed to go from one element to next element, in a given dimension. In computer memory, the data is stored linearly in a contiguous block of memory. What we view is just a (re)presentation.
Let's take an example tensor for understanding this:
# a 2D tensor
In [62]: tensor = torch.arange(1, 16).reshape(3, 5)
In [63]: tensor
Out[63]:
tensor([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
With this tensor in place, the strides are:
# get the strides
In [64]: tensor.stride()
Out[64]: (5, 1)
What this resultant tuple (5, 1) says is:
to traverse along the 0th dimension/axis (Y-axis), let's say we want to jump from 1 to 6, we should take 5 steps (or jumps)
to traverse along the 1st dimension/axis (X-axis), let's say we want to jump from 7 to 8, we should take 1 step (or jump)
The order (or index) of 5 & 1 in the tuple represents the dimension/axis. You can also pass the dimension, for which you want the stride, as an argument:
# get stride for axis 0
In [65]: tensor.stride(0)
Out[65]: 5
# get stride for axis 1
In [66]: tensor.stride(1)
Out[66]: 1
With that understanding, we might have to ask why is this extra parameter needed when we create the tensors? The answer to that is for efficiency reasons. (How can we store/read/access the elements in the (sparse) tensor most efficiently?).
With sparse tensors (a tensor where most of the elements are just zeroes), so we don't want to store these values. we only store the non-zero values and their indices. With a desired shape, the rest of the values can then be filled with zeroes, yielding the desired sparse tensor.
For further reading on this, the following articles might be of help:
numpy.ndarray.strides
torch.layout
torch.sparse
P.S: I guess there's a typo in the torch.layout documentation which says
Strides are a list of integers ...
The composite data type returned by tensor.stride() is a tuple, not a list.
For quick understanding, layout=torch.strided corresponds to dense tensors while layout=torch.sparse_coo corresponds to sparse tensors.
From another perspective, we can understand it together with torch.tensor.view.
A tensor can be viewed indicates it is contiguous. If we change the view of a tensor, the strides will change accordingly, but the data will keep the same. More specifically, view returns a new tensor with the same data but different shape, and strides is compatible with the view to indicate how to access the data in the memory.
For example
In [1]: import torch
In [2]: a = torch.arange(15)
In [3]: a.data_ptr()
Out[3]: 94270437164688
In [4]: a.stride()
Out[4]: (1,)
In [5]: a = a.view(3, 5)
In [6]: a.data_ptr() # share the same data pointer
Out[6]: 94270437164688
In [7]: a.stride() # the stride changes as the view changes
Out[7]: (5, 1)
In addition, the idea of torch.strided is basically the same as strides in numpy.
View this question for more detailed understanding.
How to understand numpy strides for layman?
As per the official pytorch documentation here,
A torch.layout is an object that represents the memory layout of a
torch.Tensor. Currently, we support torch.strided (dense Tensors) and
have experimental support for torch.sparse_coo (sparse COO Tensors).
torch.strided represents dense Tensors and is the memory layout that
is most commonly used. Each strided tensor has an associated
torch.Storage, which holds its data. These tensors provide
multi-dimensional, strided view of a storage. Strides are a list of
integers: the k-th stride represents the jump in the memory necessary
to go from one element to the next one in the k-th dimension of the
Tensor. This concept makes it possible to perform many tensor
operations efficiently.
Example:
>>> x = torch.Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
>>> x.stride()
(5, 1)
>>> x.t().stride()
(1, 5)
layout means the way memory organize the element in that tensor,I think currently there are 2 types of layout to store tensor
one is torch.strided and another is torch.sparse_coo
strided means the element is arranged one by one in a very dense way, think about a strided troops, squares,so each soldier actually has neighbours.
while for sparse_coo I think should be deal with sparse matrix, the exact storage structure I am not sure but I guess it just stores non-zero elements' indices and values
It need to separate for those two types because for sparse matrix no need to arrange element one by one in a dense form, because it will take maybe one hundred steps for the non-zero element get to its next non-zero elements
In a related question I learned that if I have an array of shape MxMxN, and I want to select based on a boolean matrix of shape MxM, I can simply do
data[select, ...]
and be done with it. Unfortunately, now I have my data in a different order:
import numpy as np
data = np.arange(36).reshape((3, 4, 3))
select = np.random.choice([0, 1], size=9).reshape((3, 3)).astype(bool)
For each element in data indexed i0, i1, i2, it should be selected, if select[i0, i2] == True.
How can I proceed with my selection without having to do something inefficient like
data.flatten()[np.repeat(select[:, None, :], 4, axis=1).flatten()]
One way would be to simply use np.broadcast_to to broadcast without actual replication and use that broadcasted mask directly for masking required elements -
mask = np.broadcast_to(select[:,None,:], data.shape)
out = data[mask]
Another way and probably faster one would be to get the indices and then index with those. The elements thus obtained would be ordered by axis=1. The implementation would look something like this -
idx = np.argwhere(select)
out = data[idx[:,0], :, idx[:,1]]
I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy).
I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point.
This is my function for getting the coefficients of the cubic polynomial (the time_axis axis values are a global 1D numpy array defined elsewhere):
def my_polyfit(data):
return numpy.polyfit(data.squeeze(), time_axis, 3)
(So in this case, numpy.polyfit returns a list of length 4)
and this is the command I thought I'd need to apply it to each chunk:
dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()
Whereby the time axis is now gone (hence drop_axis=0) and there's a new coefficient axis in it's place (of length 4).
When I run this command I get IndexError: tuple index out of range, so I'm wondering where/how I've misunderstood the use of map_blocks?
I suspect that your experience will be smoother if your function returns an array of the same dimension that it consumes. E.g. you might consider defining your function as follows:
def my_polyfit(data):
return np.polyfit(data.squeeze(), ...)[:, None, None, None]
Then you can probably ignore the new_axis, drop_axis bits.
Performance-wise you might also want to consider using a larger chunksize. At 6000 numbers per chunk you have over a million chunks, which means you'll probably spend more time in scheduling than in actual computation. Generally I shoot for chunks that are a few megabytes in size. Of course, increasing chunksize would cause your mapped function to become more complex.
Example
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
return np.polyfit(b.squeeze(), np.arange(5), 3)[:, None, None, None]
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: x.map_blocks(f, chunks=(4, 1, 1, 1)).compute()
Out[5]:
array([[[[ -1.29058580e+02, 2.21410738e+02, 1.00721521e+01],
[ -2.22469851e+02, -9.14889627e+01, -2.86405832e+02],
[ 1.40415805e+02, 3.58726232e+02, 6.47166710e+02]],
...
Kind of late to the party, but figured this could use an alternative answer based on new features in Dask. In particular, we added apply_along_axis, which behaves basically like NumPy's apply_along_axis except for Dask Arrays instead. This results in somewhat simpler syntax. Also it avoids the need to rechunk your data before applying your custom function to each 1-D piece and makes no real requirements of your initial chunking, which it tries to preserve in the end result (excepting the axis that is either reduced or replaced).
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
...: return np.polyfit(b, np.arange(len(b)), 3)
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: da.apply_along_axis(f, 0, x).compute()
Out[5]:
array([[[[ 2.13570599e+02, 2.28924503e+00, 6.16369231e+01],
[ 4.32000311e+00, 7.01462518e+01, -1.62215514e+02],
[ 2.89466687e+02, -1.35522215e+02, 2.86643721e+02]],
...
I have the following numpy array:
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import numpy as np
# NumPy array comprising associate metrics
# i.e. Open TA's, Open SR's, Open SE's
associateMetrics = np.array([[11, 28, 21],
[27, 17, 20],
[19, 31, 3],
[17, 24, 17]]).astype(np.float64)
print("raw metrics=", associateMetrics)
Now, I want to assign different weights to every column in the above array & later normalize this. For eg. lets say i want to assign higher weight to 1st column by multiplying by 5, multiple column 2 by 3 and the last column by 2.
How do i do this in python? Sorry a bit new to python and numpy.
I have tried this for just 1 column but it wont work:
# Assign weights to metrics
weightedMetrics = associateMetrics
np.multiply(2, weightedMetrics[:,0])
print("weighted metrics=", weightedMetrics)
You should make use of numpy's array broadcasting. This means that lower-dimensional arrays can be automatically expanded to perform a vectorized operation with an array of higher (but compatible) dimensions. In your specific case, you can multiply your (4,3)-shaped array with a 1d weight array of shape (3,) and obtain what you want:
weightedMetrics = associateMetrics * np.array([5,3,2])
The trick is that you can imagine numpy ndarrays to have leading singleton dimensions, along which broadcasting is automatic. By this I mean that your 1d numpy weight array of shape (3,) can be thought to have a leading singleton dimension (but only from the point of view of broadcasting!). And it's easy to see how the array of shape (4,3) and (1,3) should be multiplied: each element of the latter has to be used for full columns of the former.
In the very general case, you can even use arithmetic operations on, say, an array of shape (3,1,3,1,4) and one of shape (2,3,4,4). What's important that dimensions that meet should either agree, or one of the arrays should have a singleton dimension at that place, and one of the arrays is allowed to be longer (in the front).
i found my answer. This is what i used:
print("weighted metrics=", np.multiply([ 1, 2, 3], associateMetrics))