numpy.nanmean of subsets of elements - python

I want to take subsets of elements and quickly apply nanmean to the associated columns, without looping.
For specificity, consider the reduction array r=[0,2,3], and the data array
a=np.array([
[2,3,4],
[3,np.NaN,5],
[16,66,666],
[2,2,5],
[np.NaN,3,4],
[np.NaN,4,5],
[np.NaN,5,4],
[3,6,4.5],
])
then I want to get back
b = np.array([
[2.5,3,4.5],
[16,66,666],
[2.5,4,4.5],
])
The top answer to this question solves the problem (for a single column) by using reduceat. Unfortunately for me, since nanmean is not a ufunc that trick does not work.

I don't think there's a one-liner to do this, because there are no nan-aware ufuncs in numpy.
But you can do something based on reduceat, after (temporarily) replacing all the nans in a:
For example, here's a quick function that accomplishes what you want:
def nanmean_reduceat(x, indices):
mask = np.isnan(x)
# use try-finally to make sure x is reset
# to its original state even if an error is raised.
try:
x[mask] = 0
return np.add.reduceat(x, indices) / np.add.reduceat(~mask, indices)
finally:
x[mask] = np.nan
then you can call
>>> nanmean_reduceat(a, [0, 2, 3])
array([[ 2.5, 3. , 4.5],
[ 16. , 66. , 666. ],
[ 2.5, 4. , 4.5]])
Hope that helps!
Edit: for brevity, I removed the empty except block and moved the return statement inside the try block. Because of the way finally statements work, the resetting of x is still executed!

Related

Using `numpy.vectorize` to create multidimensional array results in ValueError: setting an array element with a sequence

This problem only seems to arise when my dummy function returns an array and thus, a multidimensional array is being created.
I reduced the issue to the following example:
def dummy(x):
y = np.array([np.sin(x), np.cos(x)])
return y
x = np.array([0, np.pi/2, np.pi])
The code I want to optimize looks like this:
y = []
for x_i in x:
y_i = dummy(x_i)
y.append(y_i)
y = np.array(y)
So I thought, I could use vectorize to get rid of the slow loop:
y = np.vectorize(dummy)(x)
But this results in
ValueError: setting an array element with a sequence.
Where even is the sequence, which the error is talking about?!
Your function returns an array when given a scalar:
In [233]: def dummy(x):
...: y = np.array([np.sin(x), np.cos(x)])
...: return y
...:
...:
In [234]: dummy(1)
Out[234]: array([0.84147098, 0.54030231])
In [235]: f = np.vectorize(dummy)
In [236]: f([0,1,2])
...
ValueError: setting an array element with a sequence.
vectorize constructs a empty result array, and tries to put the result of each calculation in it. But a cell of the target array cannot accept an array.
If we specify a otypes parameter, it does work:
In [237]: f = np.vectorize(dummy, otypes=[object])
In [238]: f([0,1,2])
Out[238]:
array([array([0., 1.]), array([0.84147098, 0.54030231]),
array([ 0.90929743, -0.41614684])], dtype=object)
That is, each dummy array is put in a element of a shape (3,) result array.
Since the component arrays all have the same shape, we can stack them:
In [239]: np.stack(_)
Out[239]:
array([[ 0. , 1. ],
[ 0.84147098, 0.54030231],
[ 0.90929743, -0.41614684]])
But as noted, vectorize does not promise a speedup. I suspect we could also use the newer signature parameter, but that's even slower.
vectorize makes some sense if your function takes several scalar arguments, and you'd like to take advantage of numpy broadcasting when feeding sets of values. But as replacement for a simple iteration over a 1d array, it isn't an improvement.
I don't really understand the error either, but with python 3.6.3 you can just write:
y = dummy(x)
so it is automatically vectorized.
Also in the official documentation there is written the following:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
I hope this was at least a little help.

How to threshold values in python without if statement (to zero if below threshold, same if above)

I want to do an inline comparison without writing 'If statements' in Python. If the value meets the threshold condition, it should be unchanged. If it doesn't the value should be set to 0.
In Python I don't seem to be allowed to apply a boolean operator to a list directly. In Matlab, it's convenient that 'True' gives a '1' and 'False' gives a zero in array operations. This is matlab-like, but won't work in python (maybe would with numpy?). Pseudocode example:
a = [1.5, 1.3, -1.4, -1.2]
a_test_positive = a>0 # Gives [1, 1, 0, 0]
positive_a_only = a.*a>0
Desired result:
positive_a_only>> [1.5, 1.3, 0, 0]
What is the best way to do this in python?
You need -
a = [1.5, 1.3, -1.4, -1.2]
positive_a_only = [i if i>0 else 0 for i in a]
print(positive_a_only)
Output
[1.5, 1.3, 0, 0]
This is known as a List Comprehension
According to your input and expected output, this is a "pythonic" way to do this
List comprehensions provide a concise way to create lists. Common
applications are to make new lists where each element is the result of
some operations applied to each member of another sequence or
iterable, or to create a subsequence of those elements that satisfy a
certain condition.
You use case is kind of made for this :)
It may worth looking at Numpy if you are working with numerical arrays.
import numpy as np
a = np.array([1.5, 1.3, -1.4, -1.2])
a[a < 0] = 0
# [ 1.5 1.3 0. 0. ]
The best answer I have found so far is to enumerate and loop through the array, using the python operator for the threshold or comparison logic.
The key is to multiply the index element by the logical comparison. e.g.
a = 1.5
a_positive = a * (a>0)
print(a)
Returns the value of 1.5 as expected, and returns 0 if a is negative.
Here's the example then with the full list:
a = [1.5, 1.3 -1.4, -1.2]
for i, element in enumerate(a):
a[i] = element*(element>0)
print(a)
[1.5, -0.0, -0.0]
Hope that helps someone!

Python, numpy.array slicing, altering array values with slices

I have a Task for numerical integration in which we approximate integral with quadrature formula. My problem is that the task needs me to avoid loops and use vectorized variant, which would be a slice?!
I have np.array object with n values and i have to alter each value of this array using a specific formula. The problem is that the value of this array at point i ist used in the formula to alter the position in. With a for loop it would be easy:
x = np.array([...])
for i in range(0,n):
x[i]=f(x[i]+a)*b`
(a,b some othe variables)
How do i do this with slices? I Have to do this for all elements of the array so it would be something like:
x[:]=f(x[???]+a)*b
And how do i get the right position from my array in to the formula? A slicing instruction like x[:] just runs through my whole object. Is there a way to somehow save the index i am currently at?
I tried to search but found nothing. The other problem is that i do not even know how to properly put the search request...
You may be confusing two issues
modifying all elements of an array
calculating values for all elements of an array
In
x = np.array([...])
for i in range(0,n):
x[i]=f(x[i]+a)*b`
you change elements of x one by one, and also pass them one by one to f.
x[:] = ... lets you change all elements of x at once, but the source (the right hand side of the equation) has to generate all those values. But usually you don't need to assign values. Instead just use x = .... It's just as fast and memory efficient.
Using x[:] on the RHS does nothing for you. If x is a list this makes a copy; if x is an array is just returns a view, an array with the same values.
The key question is, what does your f(...) function accept? If it uses operations like +, * and functions like np.sin, you can give it an array, and it will return an array.
But if it only works with scalars (that includes using functions like math.sin), the you have to feed it scalars, i.e. x[i].
Let's try to unpack that comment (which might be better as an edit to the original question)
I have an interval which has to be cut in picies.
x = np.linspace(start,end,pieceAmount)
function f
quadrature formula
b (weights or factors)
c (function values)
b1*f(x[i]+c1)+...+bn*f(x[i]+cn)
For example
In [1]: x = np.arange(5)
In [2]: b = np.arange(3)
In [6]: c = np.arange(4,7)*.1
We can do the x[i]+c for all x and c with broadcasting
In [7]: xc = x + c[:,None]
In [8]: xc
Out[8]:
array([[ 0.4, 1.4, 2.4, 3.4, 4.4],
[ 0.5, 1.5, 2.5, 3.5, 4.5],
[ 0.6, 1.6, 2.6, 3.6, 4.6]])
If f is a function like np.sin that takes any array, we can pass xc to that, getting back a like sized array.
Again with broadcasting we can do the b[n]*f(x[i]+c[n]) calculation
In [9]: b[:,None]* np.sin(xc)
Out[9]:
array([[ 0. , 0. , 0. , -0. , -0. ],
[ 0.47942554, 0.99749499, 0.59847214, -0.35078323, -0.97753012],
[ 1.12928495, 1.99914721, 1.03100274, -0.88504089, -1.98738201]])
and then we can sum, getting back an array shaped just like x:
In [10]: np.sum(_, axis=0)
Out[10]: array([ 1.60871049, 2.99664219, 1.62947489, -1.23582411, -2.96491212])
That's the dot or matrix product:
In [11]: b.dot(np.sin(xc))
Out[11]: array([ 1.60871049, 2.99664219, 1.62947489, -1.23582411, -2.96491212])
And as I noted earlier we can complete the action with
x = b.dot(f(x+c[:,None])
The key to a simple expression like this is f taking an array.

Trying to Remove for-Loops from Python code, Performing Operations with a Look-up Table On Matrices

I feel like this is a similar problem to the one I asked before, but I can't figure it out. How can I convert these two lines of code into one line with no for-loop?
for i in xrange(X.shape[0]):
dW[:,y[i]] -= X[i]
In English, every row in matrix X should be subtracted from a corresponding column in matrix dW given by the vector y.
I should mention dW is DXC and X is NXD, so the transpose of X does not have the same shape as W, otherwise I could re-order the the rows of X, and take the transpose directly. However, it is possible for the columns in dW to have multiple corresponding rows which need to be subtracted.
I feel like I do not have a firm grasp of how indexing in python is supposed to work, which makes it difficult to remove unnecessary for-loops, or even to know what for-loops are possible to remove.
The straightforward way to vectorize would be:
dW[:,y] -= X.T
Except, though not very obvious or well-documented, this will give problems with repeated indices in y. For these situations there is the ufunc.at method (elementwise operations in numpy are implemented as "ufunc's" or "universal functions"). Quote from the docs:
ufunc.at(a, indices, b=None)
Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once. For example, a[[0,0]] += 1 will only increment the first element once because of buffering, whereas add.at(a, [0,0], 1) will increment the first element twice.
So in your case:
np.subtract.at(dW.T, y, X)
Unfortunately, ufunc.at is relatively inefficient as far as vectorization techniques go, so the speedup compared to the loop might not be that impressive.
Approach #1 Here's a one-liner vectorized approach with matrix-multiplication using np.dot and NumPy broadcasting -
dWout -= (np.arange(dW.shape[1])[:,None] == y).dot(X).T
Explanation : Take a small example to understand what's going on -
Inputs :
In [259]: X
Out[259]:
array([[ 0.80195208, 0.40566743, 0.62585574, 0.53571781],
[ 0.56643339, 0.4635662 , 0.4290103 , 0.14457036],
[ 0.31823491, 0.12329964, 0.41682841, 0.09544716]])
In [260]: y
Out[260]: array([1, 2, 2])
First off, we create the 2D mask of y indices spread across the length of dW's second axis.
Let dW be a 4 x 5 shaped array. So, the mask would be :
In [261]: mask = (np.arange(dW.shape[1])[:,None] == y)
In [262]: mask
Out[262]:
array([[False, False, False],
[ True, False, False],
[False, True, True],
[False, False, False],
[False, False, False]], dtype=bool)
This is using NumPy broadcasting here to create a 2D mask.
Next up, we use matrix-multiplication to sum-aggregate the same indices from y -
In [264]: mask.dot(X)
Out[264]:
array([[ 0. , 0. , 0. , 0. ],
[ 0.80195208, 0.40566743, 0.62585574, 0.53571781],
[ 0.8846683 , 0.58686584, 0.84583872, 0.24001752],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
Thus, corresponding to the third row of the mask that has True values at second and third columns, we would sum up the second and third rows from X with that matrix-multiplication. This would be put as the third row in the multiplication output.
Since, in the original loopy code we are updating dW across columns, we need to transpose the multiplication result and then update.
Approach #2 Here's another vectorized way, though not a one-liner using np.add.reduceat -
sidx = y.argsort()
unq,shift_idx = np.unique(y[sidx],return_index=True)
dWout[:,unq] -= np.add.reduceat(X[sidx],shift_idx,axis=0).T

Getting index of numpy.ndarray

I have a one-dimensional array of the type numpy.ndarray and I want to know the index of it's max entry. After finding the max, I used
peakIndex = numpy.where(myArray==max)
to find the peak's index. But instead of the index, my script spits out
peakIndex = (array([1293]),)
I want my code to spit out just the integer 1293. How can I clean up the output?
Rather than using numpy.where, you can use numpy.argmax.
peakIndex = numpy.argmax(myArray)
numpy.argmax returns a single number, the flattened index of the first occurrence of the maximum value. If myArray is multidimensional you might want to convert the flattened index to an index tuple:
peakIndexTuple = numpy.unravel_index(numpy.argmax(myArray), myArray.shape)
To find the max value of an array, you can use the array.max() method. This will probably be more efficient than the for loop described in another answer, which- in addition to not being pythonic- isn't actually written in python. (if you wanted to take items out of the array one by one to compare, you could use ndenumerate, but you'd be sacrificing some of the performance benefits of arrays)
The reason that numpy.where() yields results as tuples is that more than one position could be equal to the max... and it's that edge case that would make something simple (like taking array[0]) prone to bugs. Per Is there a Numpy function to return the first index of something in an array?,
"The result is a tuple with first all the row indices, then all the
column indices".
Your example uses a 1-D array, so you'd get the results you want directly from the array provided. It's a tuple with one element (one array of indices), and although you can iterate over ind_1d[0] directly, I converted it to a list solely for readability.
>>> peakIndex_1d
array([ 1. , 1.1, 1.6, 1. , 1.6, 0.8])
>>> ind_1d = numpy.where( peakIndex_1d == peakIndex_1d.max() )
(array([2, 4]),)
>>> list( ind_1d[0] )
[2, 4]
For a 2-D array with 3 values equal to the max, you could use:
>>> peakIndex
array([[ 0. , 1.1, 1.5],
[ 1.1, 1.5, 0.7],
[ 0.2, 1.2, 1.5]])
>>> indices = numpy.where( peakIndex == peakIndex.max() )
>>> ind2d = zip(indices[0], indices[1])
[(0, 2), (1, 1), (2, 2)]

Categories

Resources