Numpy: mean calculation results in nan values - python

I have an array of values x:
x=numpy.array([[-0.11361818 -0.113618185 -0.98787775 -0.09719566],
[-0.11361818 -0.04173076 -0.98787775 -0.09719566],
[-0.11361818 -0.04173076 -0.98787775 -0.09719566],
[-0.62610493 -0.71682393 -0.24673653 -0.18242028],
[-0.62584854 -0.71613061 -0.24904998 -0.18287883],
[-0.62538661 -0.71551038 -0.25160676 -0.18338629]])
and an array of corresponding classes labels y:
y=numpy.array([1, 1, 2, 3, 4, 4])
The first class label 1 in y belongs to the first row in array x, the second class label 1 in y belongs to the second row in array x and so on.
Now I want to calculate the mean values for each class 1-4. For example, row 1 and 2 in x both belong to class 1, so I calculate the mean of row 1 and 2.
I have the following code:
means = numpy.array([x[y == i].mean(axis=0) for i in xrange(4)])
When I do this I end up with this result:
array([[ nan],
[-1.27636606],
[-1.24042235],
[-1.77208567]])
If I take xrange(6), I have this result:
array([[ nan],
[-1.27636606],
[-1.24042235],
[-1.77208567],
[-1.774899 ],
[ nan]])
Why is this the case and how do I get rid off the nans and end up with my 4 mean values only?
I have the code from here, where they took the number of classes as argument for xrange(), and I don't quite see what I did differently.
Thanks in advance for your help!

xrange(4) results in the values [0, 1, 2, 3]. Your first value in means is nan because you don't have a y value equal to zero.
Instead, do:
In [49]: means = numpy.array([x[y == i].mean(axis=0) for i in xrange(1, 5)])
In [50]: means
Out[50]:
array([[-1.27636606],
[-1.24042235],
[-1.77208567],
[-1.774899 ]])

Related

Outer product calculation by numpy einsum

I am trying to dive into the einsum notation. This question and answers have helped me a lot.
But now I can't grasp the machinery of the einsum when calculating outer product:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
np.einsum('i,j->ij', x, y)
array([[ 4., 5., 6.],
[ 8., 10., 12.],
[12., 15., 18.]])
That answer gives a following rule:
By repeating the label i in both input arrays, we are telling einsum
that these two axes should be multiplied together.
I can't understand how this multiplication happened if we hadn't provided any repeated axis label in np.einsum('i,j->ij', x, y)?
Could you please give a steps that np.einsum took in this example?
Or maybe more broader question how einsum works when no matching axis labels are given?
In the output of np.einsum('i,j->ij', x, y), element [i,j] is simply the product of element i in x and element j in y. In other words, np.einsum('i,j->ij', x, y)[i,j] = x[i]*y[j].
Compare it to np.einsum('i,i->i', x, y) were element i of output is x[i]*y[i]:
np.einsum('i,i->i', x, y)
[ 4 10 18]
And if a label in input is missing in output, it means the output has calculated the sum along the missing labels axis. Here is a simple example:
np.einsum('i,j->i', x, y)
[15 30 45]
Here the label j in input is missing in output, which is equivalent to summation along axis=1 (corresponding to label j):
np.sum(np.einsum('i,j->ij', x, y), axis=1)
[15 30 45]
In general, you can understand the einsum first by know exactly the dimension or the shape of the input and output the einsum notation is expected and calculating.
To facilitate the explanation, let say x.shape = [3] , y.shape = [4]
x = np.array([1, 2, 3])
y = np.array([4, 5, 6, 7])
np.einsum('i,j->ij', x, y)
array([[ 4, 5, 6, 7],
[ 8, 10, 12, 14],
[12, 15, 18, 21]])
Dimensionality
For cross product np.einsum('i,j->ij', x, y), the 1st input is a single character i. You can think the number of character is the number of dimenion of that input. So here the first input x has only 1 dimension. Same for j, the 2nd input is also just one character j so it has only 1 dimension. Lastly the output ij has 2 characters, so it has 2 dimension and that dimension must be [3,4], because the number of element in the first input is i which has 3 elements, and the number of element in the 2nd input is j which has 4 elements.
Each Element in the result array
Then, you focus will be on the result notation ij. Now we know that it is a 2D array, or a 3 by 4 matrix, ij is talking about how does ONE element calculated in the location of i row j column. Element must be calculated from product of inputs. Here means that particular element on location [i,j] is the product of input a of it's location i and input b of it's location j
So, element on location [0,0] is the calculated by taking 1st input location 0, which is your x[0] = 1,and the 2nd input location [0], which is y[0] = 4, the result of that ONE element [0,0] = 1 * 4 = 4.
Same, element on the result location [2,3] is taking the x[2] and y[3] = 3 * 7 = 21
In short, think ij of i,j->ij to be i times j of that ONE element of a result of 2 dimensions (because of 2 characters). The actually element you take from input i and input j is according to the location index of ij
You can find the transpose of the outer product in one line
That means, the transpose of the outer product is simply as i,j->ji. Here, we have two characters in the result so it is a 2D array. The number of element of the 1st dimention must be size of j, because it come first. and it is the 2nd input which has 4 elements. Same logic for the 2nd dimension so we know that the resulting array is the shape of (4,3).
Then, the ONE element at location of [3,2] of the result 2D array, is ji, meaning input j times input i, so it is the element 3 of j = y[3] = 7 , and the element 2 of i = x[2] = 3. The result is 7 * 3 = 21
Hence, the result is
np.einsum('i,j->ji', x, y)
array([[ 4, 8, 12],
[ 5, 10, 15],
[ 6, 12, 18],
[ 7, 14, 21]])

Convert numpy array with values into array with frequency for each observation in each row

I have a numpy array as follows:
array = np.random.randint(6, size=(50, 400))
This array has the cluster that each value belongs to, with each row representing a sample and each column representing a feature, but I would like to create a 5 dimensional array with the frequency of each cluster (in each sample, represented as a row in this matrix).
However, in the frequency calculation, I want to ignore 0, meaning that the frequency of all values except 0 (1-5) should add to 1.
Essentially what I want is a array with each row being a cluster (1-5) in this case, and each row still contains a single sample.
How can this be done?
Edit:
small input:
input = np.random.randint(6, size=(2, 5))
array([[0, 4, 2, 3, 0],
[5, 5, 2, 5, 3]])
output:
1 2 3 4 5
0 .33 .33 .33 0
0 .2 .2 0 .6
Where 1-5 are the row names, and the bottom two rows are the desired output in a numpy array.
This is a simple application of bincount. Does this do what you want?
def freqs(x):
counts = np.bincount(x, minlength=6)[1:]
return counts/counts.sum()
frequencies = np.apply_along_axis(freqs, axis=1, arr=array)
If you were wondering about the speed implications of apply_along_axis, this method using tricky indexing is marginally slower in my tests:
counts = (array[:, :, None] == values[None, None, :]).sum(axis=1)
frequencies2 = counts/counts.sum(axis=1)[:, None]

Find the indices of the lowest closest neighbors between two lists in python

Given 2 numpy arrays of unequal size: A (a presorted dataset) and B (a list of query values). I want to find the closest "lower" neighbor in array A to each element of array B. Example code below:
import numpy as np
A = np.array([0.456, 2.0, 2.948, 3.0, 7.0, 12.132]) #pre-sorted dataset
B = np.array([1.1, 1.9, 2.1, 5.0, 7.0]) #query values, not necessarily sorted
print A.searchsorted(B)
# RESULT: [1 1 2 4 4]
# DESIRED: [0 0 1 3 4]
In this example, B[0]'s closest neighbors are A[0] and A[1]. It is closest to A[1], which is why searchsorted returns index 1 as a match, but what i want is the lower neighbor at index 0. Same for B[1:4], and B[4] should be matched with A[4] because both values are identical.
I could do something clunky like this:
desired = []
for b in B:
id = -1
for a in A:
if a > b:
if id == -1:
desired.append(0)
else:
desired.append(id)
break
id+=1
print desired
# RESULT: [0, 0, 1, 3, 4]
But there's gotta be a prettier more concise way to write this with numpy. I'd like to keep my solution in numpy because i'm dealing with large data sets, but i'm open to other options.
You can introduce the optional argument side and set it to 'right' as mentioned in the docs. Then, subtract the final indices result by 1 for the desired output, like so -
A.searchsorted(B,side='right')-1
Sample run -
In [63]: A
Out[63]: array([ 0.456, 2. , 2.948, 3. , 7. , 12.132])
In [64]: B
Out[64]: array([ 1.1, 1.9, 2.1, 5. , 7. ])
In [65]: A.searchsorted(B,side='right')-1
Out[65]: array([0, 0, 1, 3, 4])
In [66]: A.searchsorted(A,side='right')-1 # With itself
Out[66]: array([0, 1, 2, 3, 4, 5])
Here's one way to do this. np.argmax stops at the first True it encounters, so as long as A is sorted this provides the desired result.
[np.argmax(A>b)-1 for b in B]
Edit: I got the inequality wrong initially, it works now.

Sum identical elements of an array based on their indicies

How would be elegant solution for summing all 2's from an array based on their indices?
I have this array x = [2 2 2 3 2 2 2 2 3 3 2 3 2 2 3 3 2]
Then I found their positions with
y = where(isclose(x,2))
and get another array like this y = (array([ 0, 1, 2, 4, 5, 6, 7, 10, 12, 13, 16])
So how I can use with numpy to calculate sum of elements in x based on indices in y.
You can simply use a simple indexing to get the corresponding items then use np.sum:
>>> np.sum(x[np.where(x==2)[0]])
22
Also note that you don't need alclose within np.where you can just use x=2.And as said in comment this is not the proper way to doing this task if this is the only problem.
You don't need to use np.where for this - an array of booleans, like the one returned by np.isclose or the various comparison operators, works as an index to another array (provided the sizes match). This means you get all of the 2's with:
>>> x[np.isclose(x, 2)]
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
And sum them directly:
>>> x[np.isclose(x, 2)].sum()
22
If x contains only non-negative ints, you could sum the occurrences of each value with
total = np.bincount(x, weights=x)
# array([ 0., 0., 22., 18.])
The value of total[2] is 22 since there are 11 two's in x.
The value of total[3] is 18 since there are 3 three's in x.

Finding the row with the highest average in a numpy array

Given the following array:
complete_matrix = numpy.array([
[0, 1, 2, 4],
[1, 0, 3, 5],
[2, 3, 0, 6],
[4, 5, 6, 0]])
I would like to identify the row with the highest average, excluding the diagonal zeros.
So, in this case, I would be able to identify complete_matrix[:,3] as being the row with the highest average.
Note that the presence of the zeros doesn't affect which row has the highest mean because all rows have the same number of elements. Therefore, we just take the mean of each row, and then ask for the index of the largest element.
#Take the mean along the 1st index, ie collapse into a Nx1 array of means
means = np.mean(complete_matrix, 1)
#Now just get the index of the largest mean
idx = np.argmax(means)
idx is now the index of the row with the highest mean!
You don't need to worry about the 0s, they shouldn't effect how the averages compare since there will presumably be one in each row. Hence, you can do something like this to get the index of the row with the highest average:
>>> import numpy as np
>>> complete_matrix = np.array([
... [0, 1, 2, 4],
... [1, 0, 3, 5],
... [2, 3, 0, 6],
... [4, 5, 6, 0]])
>>> np.argmax(np.mean(complete_matrix, axis=1))
3
Reference:
numpy.mean
numpy.argmax
As pointed out by a lot of people, presence of zeros isn't an issue as long as you have the same number of zeros in each column. Just in case your intention was to ignore all the zeros, preventing them from participating in the average computation, you could use weights to suppress the contribution of the zeros. The following solution assigns 0 weight to zero entries, 1 otherwise:
numpy.argmax(numpy.average(complete_matrix,axis=0, weights=complete_matrix!=0))
You can always create a weight matrix where the weight is 0 for diagonal entries, and 1 otherwise.
You will see that this answer actually would fit better to your other question that was marked as duplicated to this one (and don't know why because it is not the same question...)
The presence of zeros can indeed affect the columns' or rows' average, for instance:
a = np.array([[ 0, 1, 0.9, 1],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5]])
Without eliminating the diagonals, it would tell that the column 3 has the highest average, but eliminating the diagonals the highest average belongs to column 1 and now column 3 has the least average of all columns!
You can correct the calculated mean using the lcm (least common multiple) of the number of lines with and without the diagonals, by guaranteeing that where a diagonal element does not exist the correction is not applied:
correction = column_sum/lcm(len(column), len(column)-1)
new_mean = mean + correction
I copied the algorithm for lcm from this answer and proposed a solution for your case:
import numpy as np
def gcd(a, b):
"""Return greatest common divisor using Euclid's Algorithm."""
while b:
a, b = b, a % b
return a
def lcm(a, b):
"""Return lowest common multiple."""
return a * b // gcd(a, b)
def mymean(a):
if len(a.diagonal()) < a.shape[1]:
tmp = np.hstack((a.diagonal()*0+1,0))
else:
tmp = a.diagonal()*0+1
return np.mean(a, axis=0) + np.sum(a,axis=0)*tmp/lcm(a.shape[0],a.shape[0]-1)
Testing with the a given above:
mymean(a)
#array([ 0.95 , 1. , 0.95 , 0.83333333])
With another example:
b = np.array([[ 0, 1, 0.9, 0],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5],
[0.9, 0.2, 1, 0],
[ 1, 1, 0.7, 0.5]])
mymean(b)
#array([ 0.95, 0.8 , 0.9 , 0.5 ])
With the corrected average you just use np.argmax() to get the column index with the highest average. Similarly, np.argmin() to get the index of the column with the least average:
np.argmin(mymean(a))

Categories

Resources