perform numpy mean over matrix using labels as indicators - python

import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))

First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])

I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))

Related

Complex indexing of a multidimensional array with indices of lower dimensional arrays in Python

Problem:
I have a numpy array of 4 dimensions:
x = np.arange(1000).reshape(5, 10, 10, 2 )
If we print it:
I want to find the indices of the 6 largest values of the array in the 2nd axis but only for the 0th element in the last axis (red circles in the image):
indLargest2ndAxis = np.argpartition(x[...,0], 10-6, axis=2)[...,10-6:]
These indices have a shape of (5,10,6) as expected.
I want to obtain the values of the array for these indices in the 2nd axis but now for the 1st element in the last axis (yellow circles in the image). They should have a shape of (5,10,6). Without vectorizing, this could be done with:
np.array([ [ [ x[i, j, k, 1] for k in indLargest2ndAxis[i,j]] for j in range(10) ] for i in range(5) ])
However, I would like to achieve it vectorizing. I tried indexing with:
x[indLargest2ndAxis, 1]
But I get IndexError: index 5 is out of bounds for axis 0 with size 5. How can I manage this indexing combination in a vectorized way?
Ah, I think I now get what you are after. Fancy indexing is documented here in detail. Be warned though that - in its full generality - this is quite heavy stuff. In a nutshell, fancy indexing allows you to take elements from a source array (according to some idx) and place them into a new array (fancy indexing allways returns a copy):
source = np.array([10.5, 21, 42])
idx = np.array([0, 1, 2, 1, 1, 1, 2, 1, 0])
# this is fancy indexing
target = source[idx]
expected = np.array([10.5, 21, 42, 21, 21, 21, 42, 21, 10.5])
assert np.allclose(target, expected)
What is nice about this is that you can control the shape of the resulting array using the shape of the index array:
source = np.array([10.5, 21, 42])
idx = np.array([[0, 1], [1, 2]])
target = source[idx]
expected = np.array([[10.5, 21], [21, 42]])
assert np.allclose(target, expected)
assert target.shape == (2,2)
Where things get a little more interesting is if source has more than one dimension. In this case, you need to specify the indices of each axis so that numpy knows which elements to take:
source = np.arange(4).reshape(2,2)
idxA = np.array([0, 1])
idxB = np.array([0, 1])
# this will take (0,0) and (1,1)
target = source[idxA, idxB]
expected = np.array([0, 3])
assert np.allclose(target, expected)
Observe that, again, the shape of target matches the shape of the index used. What is awesome about fancy indexing is that index shapes are broadcasted if necessary:
source = np.arange(4).reshape(2,2)
idxA = np.array([0, 0, 1, 1]).reshape((4,1))
idxB = np.array([0, 1]).reshape((1,2))
target = source[idxA, idxB]
expected = np.array([[0, 1],[0, 1],[2, 3],[2, 3]])
assert np.allclose(target, expected)
At this point, you can understand where your exception comes from. Your source.ndim is 4; however, you try to index it with a 2-tuple (indLargest2ndAxis, 1). Numpy will interpret this as you trying to index the first axis using indLargest2ndAxis, the second axis using 1, and all other axis using :. Clearly, this doesn't work. All values of indLargest2ndAxis would have to be between 0 and 4 (inclusive), since they would have to refer to positions along the first axis of x.
What my suggestion of x[..., indLargest2ndAxis, 1] does is tell numpy that you wish to index the last two axes of x, i.e., you wish to index the third axis using indLargest2ndAxis, the fourth axis using 1, and : for anything else.
This will produce a result since all elements of indLargest2ndAxis are in [0, 10), but will produce a shape of (5, 10, 5, 10, 6) (which is not what you want). Being a bit hand-wavy, the first part of the shape (5, 10) comes from the ellipsis (...), aka. select everything, the middle part (5, 10, 6) comes from indLargest2ndAxis selecting elements along the third axis of x according to the shape of indLargest2ndAxis and the final part (which you don't see because it is squeezed) comes from selecting index 1 along the fourth axis.
Moving on to your actual problem, you can entirely dodge the fancy indexing bullet and do the following:
x = np.arange(1000).reshape(5, 10, 10, 2)
order = x[..., 0]
values = x[..., 1]
idx = np.argpartition(order, 4)[..., 4:]
result = np.take_along_axis(values, idx, axis=-1)
Edit: Of course, you can also use fancy indexing; however, it is more cryptic and doesn't scale as nicely to different shapes:
x = np.arange(1000).reshape(5, 10, 10, 2)
indLargest2ndAxis = np.argpartition(x[..., 0], 4)[..., 4:]
result = x[np.arange(5)[:, None, None], np.arange(10)[None, :, None], indLargest2ndAxis, 1]

How to concatenate an array into first position of n-d array in numpy?

I need to add extra element to first position of n-d array in numpy.
Here is the code:
tmp_boxes3d = [None,6]
tmp_scores = [None]
Now I need to add tmp_scores element into first position of tmp_boxes3d.
For better understanding
boxes = np.array([[1,2,3,4],[5,6,7,8]])
scores = np.array([0,0])
boxes_scores = [[0,1,2,3,4],[0,5,6,7,8]] # result
So I need to do for [None,6] shape.
Can anyone help me with this?
Try np.concatenate
boxes = np.array([[1,2,3,4],[5,6,7,8]])
scores = np.array([0,0])
np.concatenate((scores[:, np.newaxis], boxes), axis=1)
array([[0, 1, 2, 3, 4],
[0, 5, 6, 7, 8]])
Or, np.hstack
np.hstack((scores[:, np.newaxis], boxes))

Measure correlation without counting some values

I have an array:
a = np.array([[1,2,3], [0,0,3], [1,2,0],[0,2,3]])
which looks like:
array([[1, 2, 3],
[0, 0, 3],
[1, 2, 0],
[0, 2, 3]])
I need to calculate paired correlations, but without taking 0s in considerations. So, for example correlations between "1" and "2" should be calculated between arrays:
array([[1, 2],
[1, 2]])
Problem: Numpy and pandas method will consider zeros and i can't remind them.
So, I need a faster, willingly built-n method for this.
Though, i wrote mine algorithm, but it works really slow on large arrays.
correlations = np.zeros((1000,1000))
for i, column_i in enumerate(np.transpose(array_data)):
for j, column_j in enumerate(np.transpose(array_data[:,i+1:])):
if i != j:
column_i = np.reshape(column_i,(column_i.shape[0], 1))
column_j = np.reshape(column_j,(column_j.shape[0], 1))
values = np.concatenate([column_i, column_j],axis=1)
values = [row for row in values if (row[0] != 0) & (row[1] != 0)]
values = np.array(values)
correlation = np.corrcoef(values[:,0], values[:,1])[0][1]
correlations[i,j] = correlation
Actually, i decieded to change all zeros in data to np.nan
for i,e_i in enumerate(array_data):
for j, e_j in enumerate(e_i):
if e_j == 0:
array_data[i,j] = np.NaN
and then, pandas.corr() worked fine...

Numpy array multiple mask

Trying to slice and average a numpy array multiple times, based on an integer mask array:
i.e.
import numpy as np
data = np.arange(11)
mask = np.array([0, 1, 1, 1, 0, 2, 2, 3, 3, 3, 3])
results = list()
for maskid in range(1,4):
result = np.average(data[mask==maskid])
results.append(result)
output = np.array(result)
Is there a way to do this faster, aka without the "for" loop?
One approach using np.bincount -
np.bincount(mask, data)/np.bincount(mask)
Another one with np.unique for a generic case when the elements in mask aren't necessarily sequential starting from 0 -
_,ids, count = np.unique(mask, return_inverse=1, return_counts=1)
out = np.bincount(ids, data)/count

Computing average for numpy array

I have a 2d numpy array (6 x 6) elements. I want to create another 2D array out of it, where each block is the average of all elements within a blocksize window. Currently, I have the foll. code:
import os, numpy
def avg_func(data, blocksize = 2):
# Takes data, and averages all positive (only numerical) numbers in blocks
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/blocksize))
width = int(numpy.floor(dimensions[1]/blocksize))
averaged = numpy.zeros((height, width))
for i in range(0, height):
print i*1.0/height
for j in range(0, width):
block = data[i*blocksize:(i+1)*blocksize,j*blocksize:(j+1)*blocksize]
if block.any():
averaged[i][j] = numpy.average(block[block>0])
return averaged
arr = numpy.random.random((6,6))
avgd = avg_func(arr, 3)
Is there any way I can make it more pythonic? Perhaps numpy has something which does it already?
UPDATE
Based on M. Massias's soln below, here is an update with fixed values replaced by variables. Not sure if it is coded right. it does seem to work though:
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/block_size))
width = int(numpy.floor(dimensions[1]/block_size))
t = data.reshape([height, block_size, width, block_size])
avrgd = numpy.mean(t, axis=(1, 3))
To compute some operation slice by slice in numpy, it is very often useful to reshape your array and use extra axes.
To explain the process we'll use here: you can reshape your array, take the mean, reshape it again and take the mean again.
Here I assume blocksize is 2
t = np.array([[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],])
t = t.reshape([6, 3, 2])
t = np.mean(t, axis=2)
t = t.reshape([3, 2, 3])
np.mean(t, axis=1)
outputs
array([[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5]])
Now that it's clear how this works, you can do it in one pass only:
t = t.reshape([3, 2, 3, 2])
np.mean(t, axis=(1, 3))
works too (and should be quicker since means are computed only once - I guess). I'll let you substitute height/blocksize, width/blocksize and blocksize accordingly.
See #askewcan nice remark on how to generalize this to any dimension.

Categories

Resources