Related
import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))
First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])
I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.
This question already has answers here:
Convert NumPy array to 0 or 1 based on threshold
(2 answers)
Closed 4 years ago.
If I have a given 2d numpy array how can I efficiently make a mask of this array using 0s and 1s depending on where the values of this array are over a given threshold?
So far I made a working code that do this job like this:
import numpy as np
def maskedarray(data, threshold):
#creating an array of zeros:
zeros = np.zeros((np.shape(data)[0], np.shape(data)[1]))
#going over each index of the data
for i in range(np.shape(data)[0]):
for j in range(np.shape(data)[1]):
if data[i][j] > threshold:
zeros[i][j] = 1
return(zeros)
#creating a test array
test = np.random.rand(5,5)
#using the function above defined
mask = maskedarray(test,0.5)
I refuse myself to believe that there isn't a smarter way to do it without needing to use two nested FOR loops.
Thanks
The fastest way is simply:
def masked_array(data, threshold):
return (data > threshold).astype(int)
Example:
data = np.random.random((5,5))
threshold = 0.5
>>> data
array([[0.42966975, 0.94785801, 0.31750045, 0.75944551, 0.05430315],
[0.91475934, 0.65683185, 0.09019139, 0.85717157, 0.63074349],
[0.33160746, 0.82455941, 0.50801804, 0.81087228, 0.01561161],
[0.6932717 , 0.12741425, 0.17863726, 0.36682108, 0.95817187],
[0.88320599, 0.51243802, 0.90219452, 0.78954102, 0.96708252]])
>>> masked_array(data, threshold)
array([[0, 1, 0, 1, 0],
[1, 1, 0, 1, 1],
[0, 1, 1, 1, 0],
[1, 0, 0, 0, 1],
[1, 1, 1, 1, 1]])
I have a list of numpy vectors of the format:
[array([[-0.36314615, 0.80562619, -0.82777381, ..., 2.00876354,2.08571887, -1.24526026]]),
array([[ 0.9766923 , -0.05725135, -0.38505339, ..., 0.12187988,-0.83129255, 0.32003683]]),
array([[-0.59539878, 2.27166874, 0.39192573, ..., -0.73741573,1.49082653, 1.42466276]])]
here, only 3 vectors in the list are shown. I have 100s..
The maximum number of elements in one vector is around 10 million
All the arrays in the list have unequal number of elements but the maximum number of elements is fixed.
Is it possible to create a sparse matrix using these vectors in python such that I have zeros in place of elements for the vectors which are smaller than the maximum size?
Try this:
from scipy import sparse
M = sparse.lil_matrix((num_of_vectors, max_vector_size))
for i,v in enumerate(vectors):
M[i, :v.size] = v
Then take a look at this page: http://docs.scipy.org/doc/scipy/reference/sparse.html
The lil_matrix format is good for constructing the matrix, but you'll want to convert it to a different format like csr_matrix before operating on them.
In this approach you replace the elements below your thresold by 0 and then create a sparse matrix out of them. I am suggesting the coo_matrix since it is the fastest to convert to the other types according to your purposes. Then you can scipy.sparse.vstack() them to build your matrix accounting all elements in the list:
import scipy.sparse as ss
import numpy as np
old_list = [np.random.random(100000) for i in range(5)]
threshold = 0.01
for a in old_list:
a[np.absolute(a) < threshold] = 0
old_list = [ss.coo_matrix(a) for a in old_list]
m = ss.vstack( old_list )
A little convoluted, but I would probably do it like this:
>>> import scipy.sparse as sps
>>> a = [np.arange(5), np.arange(7), np.arange(3)]
>>> lens = [len(j) for j in a]
>>> cols = np.concatenate([np.arange(j) for j in lens])
>>> rows = np.concatenate([np.repeat(j, len_) for j, len_ in enumerate(lens)])
>>> data = np.concatenate(a)
>>> b = sps.coo_matrix((data,(rows, cols)))
>>> b.toarray()
array([[0, 1, 2, 3, 4, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 0, 0, 0, 0]])
How do I transform a Boolean array into an iterable of indexes?
E.g.,
import numpy as np
import itertools as it
x = np.array([1,0,1,1,0,0])
y = x > 0
retval = [i for i, y_i in enumerate(y) if y_i]
Is there a nicer way?
Try np.where or np.nonzero.
x = np.array([1, 0, 1, 1, 0, 0])
np.where(x)[0] # returns a tuple hence the [0], see help(np.where)
# array([0, 2, 3])
x.nonzero()[0] # in this case, the same as above.
See help(np.where) and help(np.nonzero).
Possibly worth noting that in the np.where page it mentions that for 1D x it's basically equivalent to your longform in the question.