Related
import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))
First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])
I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))
Note that this question is not about multiple conditions within a single np.where(), see this thread for that.
I have a numpy array arr1 with some numbers (without a particular structure):
arr0 = \
np.array([[0,3,0],
[1,3,2],
[1,2,0]])
and a list of all the entries in this array:
entries = [0,1,2,3]
I also have another array, arr1:
arr1 = \
np.array([[4,5,6],
[6,2,4],
[3,7,9]])
I would like to perform some function on multiple subsets of elements of arr1. A subset consts of numbers which are at the same position as arr0 entries with a cetrain value. Let this function be finding the max value. Performing the function on each subset via a list comprehension:
res = [np.where(arr0==index,arr1,0).max() for index in entries]
res is [9, 6, 7, 5]
As expected: 0 in arr0 is on the top left, top right, bottom right corner, and the biggest number from the top left, top right, bottom right entries of arr1 (ie 4, 6, 9) is 9. Rest follow with a similar logic.
How can I achieve this without iteration?
My actual arrays are much bigger than these examples.
With broadcasting
res = np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
The result of np.where(...) is a (3, 3, 4) array, where slicing [...,0] would give you the same 3x3 array you get by manually doing the np.where with just entries[0], etc. Then taking the max of each 3x3 subarray leaves you with the desired result.
Timings
Apparently this method doesn't scale well for bigger arrays. The other answer using np.unique is more efficient because it reduces the maximum operation down to a few unique value regardless of how big the original arrays are.
import timeit
import matplotlib.pyplot as plt
import numpy as np
def loops():
return [np.where(arr0==index,arr1,0).max() for index in entries]
def broadcast():
return np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
def numpy_1d():
arr0_1D = arr0.ravel()
arr1_1D = arr1.ravel()
arg_idx = np.argsort(arr0_1D)
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
return np.maximum.reduceat(arr1_1D[arg_idx], idx)
sizes = (3, 10, 25, 50, 100, 250, 500, 1000)
lengths = (4, 10, 25, 50, 100)
methods = (loops, broadcast, numpy_1d)
fig, ax = plt.subplots(len(lengths), sharex=True)
for i, M in enumerate(lengths):
entries = np.arange(M)
times = [[] for _ in range(len(methods))]
for N in sizes:
arr0 = np.random.randint(1000, size=(N, N))
arr1 = np.random.randint(1000, size=(N, N))
for j, method in enumerate(methods):
times[j].append(np.mean(timeit.repeat(method, number=1, repeat=10)))
for t in times:
ax[i].plot(sizes, t)
ax[i].legend(['loops', 'broadcasting', 'numpy_1d'])
ax[i].set_title(f'Entries size {M}')
plt.xticks(sizes)
fig.text(0.5, 0.04, 'Array size (NxN)', ha='center')
fig.text(0.04, 0.5, 'Time (s)', va='center', rotation='vertical')
plt.show()
It's more convenient to work in 1D case. You need to sort your arr0 then find starting indices for every group and use np.maximum.reduceat.
arr0_1D = np.array([[0,3,0],[1,3,2],[1,2,0]]).ravel()
arr1_1D = np.array([[4,5,6],[6,2,4],[3,7,9]]).ravel()
arg_idx = np.argsort(arr0_1D)
>>> arr0_1D[arg_idx]
array([0, 0, 0, 1, 1, 2, 2, 3, 3])
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
>>> idx
array([0, 3, 5, 7], dtype=int64)
>>> np.maximum.reduceat(arr1_1D[arg_idx], idx)
array([9, 6, 7, 5], dtype=int32)
I'm trying to mask bad pixels in a dataset taken from a detector. In my attempt to come up with a general way to do this so I can run the same code across different images, I tried a few different methods, but none of them ended up working. I'm pretty new with coding and data analysis in Python, so I could use a hand putting things in terms that the computer will understand.
As an example, consider the matrix
A = np.array([[3,5,50],[30,2,6],[25,1,1]])
What I'm wanting to do is set any element in A that is two standard deviations away from the mean equal to zero. The reason for this is that later in the code, I'm defining a function that only uses the nonzero values for the calculation, since the zeros are part of the mask.
I know this masking technique works, but I tried extending the following code to work with the standard deviation:
mask = np.ones(np.shape(A))
mask.flat[A.flat > 20] = 0
What I tried was:
mask = np.ones(np.shape(A))
for i,j in A:
mask.flat[A[i,j] - 2*np.std(A) < np.mean(A) < A[i,j] + 2*np.std(A)] = 0
Which throws the error:
ValueError: too many values to unpack (expected 2)
If anyone has a better technique to statistically remove bad pixels in an image, I'm all ears. Thanks for the help!
==========
EDIT
After some trial and error, I got to a place that could help clarify my question. The new code is:
for i in A:
for j in i:
mask.flat[ j - 2*np.std(A) < np.mean(A) < j + 2*np.std(A)] = 0
This throws an error saying 'unsupported iterator index'. What I'm wanting to happen is that the for loop iterates across each element in the array, checks if it's less/greater than 2 standard deviations from the mean, and it is, sets it to zero.
Here is an approach that will be sligthly faster on larger images:
import numpy as np
import matplotlib.pyplot as plt
# generate dummy image
a = np.random.randint(1,5, (5,5))
# generate dummy outliers
a[4,4] = 20
a[2,3] = -6
# initialise mask
mask = np.ones_like(a)
# subtract mean and normalise to standard deviation.
# then any pixel in the resulting array that has an absolute value > 2
# is more than two standard deviations away from the mean
cond = (a-np.mean(a))/np.std(a)
# find those pixels and set them to zero.
mask[abs(cond) > 2] = 0
Inspection:
a
array([[ 1, 1, 3, 4, 2],
[ 1, 2, 4, 1, 2],
[ 1, 4, 3, -6, 1],
[ 2, 2, 1, 3, 2],
[ 4, 1, 3, 2, 20]])
np.round(cond, 2)
array([[-0.39, -0.39, 0.11, 0.36, -0.14],
[-0.39, -0.14, 0.36, -0.39, -0.14],
[-0.39, 0.36, 0.11, -2.12, -0.39],
[-0.14, -0.14, -0.39, 0.11, -0.14],
[ 0.36, -0.39, 0.11, -0.14, 4.32]])
mask
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 0]])
You A is three dimensional so you need to unpack using three variables like below.
A = np.array([[3,5,50],[30,2,6],[25,1,1]])
for i in A:
for j in i:
print(j)
Trying to slice and average a numpy array multiple times, based on an integer mask array:
i.e.
import numpy as np
data = np.arange(11)
mask = np.array([0, 1, 1, 1, 0, 2, 2, 3, 3, 3, 3])
results = list()
for maskid in range(1,4):
result = np.average(data[mask==maskid])
results.append(result)
output = np.array(result)
Is there a way to do this faster, aka without the "for" loop?
One approach using np.bincount -
np.bincount(mask, data)/np.bincount(mask)
Another one with np.unique for a generic case when the elements in mask aren't necessarily sequential starting from 0 -
_,ids, count = np.unique(mask, return_inverse=1, return_counts=1)
out = np.bincount(ids, data)/count
I have a 2d numpy array (6 x 6) elements. I want to create another 2D array out of it, where each block is the average of all elements within a blocksize window. Currently, I have the foll. code:
import os, numpy
def avg_func(data, blocksize = 2):
# Takes data, and averages all positive (only numerical) numbers in blocks
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/blocksize))
width = int(numpy.floor(dimensions[1]/blocksize))
averaged = numpy.zeros((height, width))
for i in range(0, height):
print i*1.0/height
for j in range(0, width):
block = data[i*blocksize:(i+1)*blocksize,j*blocksize:(j+1)*blocksize]
if block.any():
averaged[i][j] = numpy.average(block[block>0])
return averaged
arr = numpy.random.random((6,6))
avgd = avg_func(arr, 3)
Is there any way I can make it more pythonic? Perhaps numpy has something which does it already?
UPDATE
Based on M. Massias's soln below, here is an update with fixed values replaced by variables. Not sure if it is coded right. it does seem to work though:
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/block_size))
width = int(numpy.floor(dimensions[1]/block_size))
t = data.reshape([height, block_size, width, block_size])
avrgd = numpy.mean(t, axis=(1, 3))
To compute some operation slice by slice in numpy, it is very often useful to reshape your array and use extra axes.
To explain the process we'll use here: you can reshape your array, take the mean, reshape it again and take the mean again.
Here I assume blocksize is 2
t = np.array([[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],])
t = t.reshape([6, 3, 2])
t = np.mean(t, axis=2)
t = t.reshape([3, 2, 3])
np.mean(t, axis=1)
outputs
array([[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5]])
Now that it's clear how this works, you can do it in one pass only:
t = t.reshape([3, 2, 3, 2])
np.mean(t, axis=(1, 3))
works too (and should be quicker since means are computed only once - I guess). I'll let you substitute height/blocksize, width/blocksize and blocksize accordingly.
See #askewcan nice remark on how to generalize this to any dimension.