I have two 5D matrices which I would like to add elementwise. The matrices have the exact same dimensions and number of elements, but they both contain randomly distributed NaN values.
I would like to add these two matrices elementwise in an efficient way. I am currently adding them by looping through them elementwise, but this loop takes about 40 minutes and I just thought there must be a more efficient way of doing it.
What I think would be an efficient way is if it was possible to use numpy.nansum to add them, but from what I can find, numpy.nansum only works with 1D arrays.
I would prefer it if the adding went down as it does with numpy.nansum (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nansum.html). Namely, (1) if two values are added I want the sum to be a value, (2) if a value and a NaN are added I want the sum to be the value and (3) if two NaN are added I want the sum to be NaN.
Below is an exaplary code:
import numpy as np
# Creating fake data
A = np.arange(0,720,1).reshape(2,3,4,5,6)
B = np.arange(720,1440,1).reshape(2,3,4,5,6)
# Assigning some elements as NaN
A[0,1,2,3,4] = np.nan
A[1,2,3,4,5] = np.nan
B[1,2,3,4,5] = np.nan
So, if I now add A and B (lets say C = A + B), I want element C[0,1,2,3,4] to be the value of B[0,1,2,3,4], element C[1,2,3,4,5] to be NaN and all other elements in C to be the sums of the respectively added elements in A and B.
Does anyone have an efficient solution for this addition?
np.where(np.isnan(A), B, A + np.nan_to_num(B))
We see how this works in two parts:
For the nan part of A, we fill in values from B.
If B and A are nan at the same time, the values stored will be nan. If values in B are not nan while those from A are nan, the values of B will be taken.
For the part of A that is non-nan, we fill in A + np.nan_to_num(B).
np.nan_to_num(B) will turn B's nan part into 0. Thus, A + np.nan_to_num(B) will not be nan when B is nan.
Thanks for Paul Panzer's correction.
I was thinking of something more prosaic
In [22]: A=np.arange(10.) # make sure A is float
In [23]: B=np.arange(100,110.)
In [24]: A[[1,3,9]]=np.nan
In [25]: B[[2,5,9]]=np.nan
In [26]: A
Out[26]: array([ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan])
In [27]: B
Out[27]: array([100., 101., nan, 103., 104., nan, 106., 107., 108., nan])
In [29]: C=A+B
In [30]: C
Out[30]: array([100., nan, nan, nan, 108., nan, 112., 114., 116., nan])
In [31]: mask1 = np.isnan(A) & ~np.isnan(B)
In [32]: C[mask1] = B[mask1]
In [33]: mask2 = np.isnan(B) & ~np.isnan(A)
In [34]: C[mask2] = A[mask2]
In [35]: C
Out[35]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
I like the stack and nansum approach, but I'm not sure it's faster:
In [36]: s=np.stack((A,B))
In [37]: C1 = np.nansum(s, axis=0)
In [38]: C1
Out[38]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., 0.])
In [40]: C1[np.all(np.isnan(s), axis=0)] = np.nan
In [41]: C1
Out[41]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
Look at s if this approach is puzzling:
In [42]: s
Out[42]:
array([[ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan],
[100., 101., nan, 103., 104., nan, 106., 107., 108., nan]])
s is a new array, with a new 0 dimension. sum on that dimension is the same as A+B. This stacking lets us take advantage of the nansum. Unfortunately you still want to keep some nan, so we still have to do a masked assignment to handle that detail.
s = np.stack((A, B))
C = np.nansum(s, axis=0)
C[np.all(np.isnan(s), axis=0)] = np.nan
This will treat np.nan as 0.0 for purposes of summing, and then the final line adds back the places where np.nan existed for all entries along the new "depth" axis that spans across A and B.
Note that this last operation is necessary for NumPy versions > 1.8, as it says in the documentation:
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
If you can guarantee NumPy version <= 1.8, then just the nansum part alone would suffice.
Just add a new axe before summing :
np.nansum(np.concatenate((A[None,:],B[None,:])),axis=0)
Related
I'm trying to obtain the ranks in a 2D array, along axis=1, with no repeated ranks.
Suppose I have the array below:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
I would expect the following result, for a 'hi-lo' rank:
array([[ 2., 1., 3., 4.],
[ 3., nan, 1., 2.],
[nan, 1., 2., 3.]])
And the following result, for a 'lo-hi' rank:
array([[ 2., 4., 3., 1.],
[ 1., nan, 3., 2.],
[nan, 3., 1., 2.]])
I've been using scipy.stats.rankdata but this solution is very time consuming (for large arrays). Also, the code I'm using (shown below) relies on np.apply_along_axis, which I know is not very efficient. I know that scipy.stats.rankdata accepts an axis argument but the code behind it uses exactly np.apply_along_axis (See here).
def f(array, order='hi-lo'):
array = np.asarray(array)
lo_hi_rank = np.apply_along_axis(rankdata, 1, array, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(array)] = np.NaN
if order == 'lo-hi':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Does anyone know a faster implementation?
Update
I've compared the execution time of all the options suggested so far.
Option 1 below is the explicit loop version of the code I suggested above (repeated below as Option 2)
def option1(a, order='ascending'):
ranks = np.empty_like(a)
for row in range(ranks.shape[0]):
lo_hi_rank = rankdata(a[row], method='ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a[row])] = np.NaN
if order == 'ascending':
ranks[row] = lo_hi_rank.copy()
else:
ranks[row] = np.nanmax(lo_hi_rank) - lo_hi_rank + 1
return ranks
def option2(a, order='ascending'):
a = np.asarray(a)
lo_hi_rank = np.apply_along_axis(rankdata, 1, a, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a)] = np.NaN
if order == 'ascending':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Options 3-6 were suggested by Divakar:
def option3(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
out = b.argsort(1,'stable').argsort(1)+1. - sm
out[out<=0] = np.nan
return out
def option4(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
idx = b.argsort(1,'stable')
m,n = idx.shape
sidx = np.empty((m,n), dtype=float)
np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
out = sidx - sm
out[out<=0] = np.nan
return out
def option5(a, order='descending'):
b = -a if order=='descending' else a
out = b.argsort(1,'stable').argsort(1)+1.
return np.where(np.isnan(a), np.nan, out)
def option6(a, order='descending'):
b = -a if order=='descending' else a
idx = b.argsort(1,'stable')
m,n = idx.shape
out = np.empty((m,n), dtype=float)
np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
return np.where(np.isnan(a), np.nan, out)
Option 6 seems to be the cleanest and is indeed the fastest (~40% improvement vs Option 2). See below the average execution time for 100 iterations, with array.shape=(5348,1225)
>> TIME COMPARISON
>> 100 iterations | array.shape=(5348, 1225)
>> Option1: 0.4838 seconds
>> Option2: 0.3404 seconds
>> Option3: 0.3355 seconds
>> Option4: 0.2331 seconds
>> Option5: 0.3145 seconds
>> Option6: 0.2114 seconds
It can also be extend to generic axis and generic n-dim array, as proposed by Divakar. However, it is still too time consuming for what I'm trying to achieve (since I'll have to run this function millions of times within a loop). Is there a faster alternative? Or have we reached what's feasible with Python?
Method #1
Here's one way -
def rank_with_nans(a, order='descending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
out = b.argsort(1,'stable').argsort(1)+1. - sm
out[out<=0] = np.nan
return out
We can optimize on the double argsort part with a variation based on this post, shown below -
def rank_with_nans_v2(a, order='descending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
idx = b.argsort(1,'stable')
m,n = idx.shape
sidx = np.empty((m,n), dtype=float)
np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
out = sidx - sm
out[out<=0] = np.nan
return out
Sample runs -
In [338]: a
Out[338]:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
In [339]: rank_with_nans(a, order='descending')
Out[339]:
array([[ 2., 1., 3., 4.],
[ 3., nan, 1., 2.],
[nan, 1., 2., 3.]])
In [340]: rank_with_nans(a, order='ascending')
Out[340]:
array([[ 2., 4., 3., 1.],
[ 1., nan, 3., 2.],
[nan, 3., 1., 2.]])
Method #2
Without inf conversion, here's with double-argsort -
def rank_with_nans_v3(a, order='descending'):
b = -a if order=='descending' else a
out = b.argsort(1,'stable').argsort(1)+1.
return np.where(np.isnan(a), np.nan, out)
Again, with the argsort-skip trick -
def rank_with_nans_v4(a, order='descending'):
b = -a if order=='descending' else a
idx = b.argsort(1,'stable')
m,n = idx.shape
out = np.empty((m,n), dtype=float)
np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
return np.where(np.isnan(a), np.nan, out)
Bonus : Extend to generic axis and generic n-dim array
We can extend the proposed solutions to incorporate axis so that the ranking could be applied along that axis. The last solution v4 seems would be the most efficient one. Let's use it to make it generic -
def rank_with_nans_along_axis(a, order='descending', axis=-1):
b = -a if order=='descending' else a
idx = b.argsort(axis=axis, kind='stable')
out = np.empty(idx.shape, dtype=float)
indexer = tuple([None if i!=axis else Ellipsis for i in range(a.ndim)])
np.put_along_axis(out, idx, np.arange(1,a.shape[axis]+1, dtype=float)[indexer], axis=axis)
return np.where(np.isnan(a), np.nan, out)
Sample run -
In [227]: a
Out[227]:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
In [228]: rank_with_nans_along_axis(a, order='descending',axis=0)
Out[228]:
array([[ 1., 1., 2., 2.],
[ 2., nan, 1., 1.],
[nan, 2., 3., 3.]])
In [229]: rank_with_nans_along_axis(a, order='ascending',axis=0)
Out[229]:
array([[ 2., 2., 2., 2.],
[ 1., nan, 3., 3.],
[nan, 1., 1., 1.]])
I have a huge 2d numpy array of lists (dtype object) that I want to convert into a 2d numpy array of dtype float, stacking the dimension represented by lists onto the 0th axis (rows). The lists within each row always have the exact same length, and have at least one element.
Here is a minimal reproduction of the situation:
import numpy as np
current_array = np.array(
[[[0.0], [1.0]],
[[2.0, 3.0], [4.0, 5.0]]]
)
desired_array = np.array(
[[0.0, 1.0],
[2.0, 4.0],
[3.0, 5.0]]
)
I looked around for solutions, and stack and dstack functions work only if the first level is a tuple. reshape would require the third level to be a part of the array. I wonder, is there any relatively efficient way to do it?
Currently, I am just counting the dimensions, creating empty array and filling the values one by one, which honestly does not seem like a good solution.
In [321]: current_array = np.array(
...: [[[0.0], [1.0]],
...: [[2.0, 3.0], [4.0, 5.0]]]
...: )
In [322]: current_array
Out[322]:
array([[list([0.0]), list([1.0])],
[list([2.0, 3.0]), list([4.0, 5.0])]], dtype=object)
In [323]: _.shape
Out[323]: (2, 2)
Rework the two rows:
In [328]: current_array[1,:]
Out[328]: array([list([2.0, 3.0]), list([4.0, 5.0])], dtype=object)
In [329]: np.stack(current_array[1,:],1)
Out[329]:
array([[2., 4.],
[3., 5.]])
In [330]: np.stack(current_array[0,:],1)
Out[330]: array([[0., 1.]])
combine them:
In [331]: np.vstack((_330, _329))
Out[331]:
array([[0., 1.],
[2., 4.],
[3., 5.]])
in one line:
In [333]: np.vstack([np.stack(row, 1) for row in current_array])
Out[333]:
array([[0., 1.],
[2., 4.],
[3., 5.]])
Author of the question here.
I found a slightly more elegant (and faster) way than filling the array one by one, which is:
desired = np.array([np.concatenate([np.array(d) for d in lis]) for lis in current.T]).T
print(desired)
'''
[[0. 1.]
[2. 4.]
[3. 5.]]
'''
But it still does quite the number of operations. It transposes the table to be able to stack the neighboring 'dimensions' (one of them is the lists) with np.concatenate, and then converts the result to np.array and transposes it back.
My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)
Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)
I have a bunch of matrices eq1, eq2, etc. defined like
from numpy import meshgrid, sqrt, arange
# from numpy import isnan, logical_not
xs = arange(-7.25, 7.25, 0.01)
ys = arange(-5, 5, 0.01)
x, y = meshgrid(xs, ys)
eq1 = ((x/7.0)**2.0*sqrt(abs(abs(x)-3.0)/(abs(x)-3.0))+(y/3.0)**2.0*sqrt(abs(y+3.0/7.0*sqrt(33.0))/(y+3.0/7.0*sqrt(33.0)))-1.0)
eq2 = (abs(x/2.0)-((3.0*sqrt(33.0)-7.0)/112.0)*x**2.0-3.0+sqrt(1-(abs(abs(x)-2.0)-1.0)**2.0)-y)
where eq1, eq2, eq3, etc. are large square matrices. As you can see, there are many nan elements surrounding a 'block' of plot-able values. I want to remove all the nan values whilst keeping the shape of the block of the valid values in the matrix. Note that these 'blocks' can be located anywhere in the eq1, eq2 matrix.
I've looked at answers given in Removing nan values from an array and Removing NaN elements from a matrix, but these don't seem to be completely relevant to my case.
IIUC, you can use boolean indexing with np.isnan to keep the slices. There are probably slicker ways to do this, but starting from something like:
>>> eq = np.zeros((5,6)) + np.nan
>>> eq[2:4, 1:3].flat = [1,np.nan,3,4]
>>> eq
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, 1., nan, nan, nan, nan],
[ nan, 3., 4., nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
You could select the rows and columns with data using something like
>>> eq = eq[:,~np.isnan(eq).all(0)]
>>> eq = eq[~np.isnan(eq).all(1)]
>>> eq
array([[ 1., nan],
[ 3., 4.]])
Short and sweet,
eq1_c = eq1[~np.isnan(eq1)]
np.isnan returns a bool array that can be used to index your original array. Take its negation and you will get back the non-nan values.
One option is to manually iterate through the grid and check for Nan values. A Nan value is easy to spot because comparing it to itself will result in False. You could use this to set all Nan values to 0.0 for example.
for x in xrange(len(eq1)):
for y in xrange(len(eq1[x])):
v = eq1[x][y]
if v!=v:
eq1[x][y] = 0.0
I have two matrices, A and B:
A = array([[2., 13., 25., 1.], [ 18., 5., 1., 25.]])
B = array([[2, 1], [0, 3]])
I want to index each row of A with each row of B, producing the slice:
array([[25., 13.], [18., 25.]])
That is, I essentially want something like:
array([A[i,b] for i,b in enumerate(B)])
Is there a way to fancy-index this directly? The best I can do is this "flat-hack":
A.flat[B + arange(0,A.size,A.shape[1])[:,None]]
#Ophion's answer is great, and deserves the credit, but I wanted to add some explanation, and offer a more intuitive construction.
Instead of rotating B and then rotating the result back, it's better to just rotate the arange. I think this gives the most intuitive solution, even if it takes more characters:
A[((0,),(1,)), B]
or equivalently
A[np.arange(2)[:, None], B]
This works because what's really going on here, is you're making an i array and a j array, each of which have the same shape as your desired result.
i = np.array([[0, 0],
[1, 1]])
j = B
But you can use just
i = np.array([[0],
[1]])
Because it will broadcast to match B (this is what np.arange(2)[:,None] gives).
Finally, to make it more general (not knowing 2 as the arange size), you could also generate i from B with
i = np.indices(B.shape)[0]
however you build i and j, you just call it like
>>> A[i, j]
array([[ 25., 13.],
[ 18., 25.]])
Not pretty but:
A[np.arange(2),B.T].T
array([[ 25., 13.],
[ 18., 25.]])