I have a 1-D NumPy array where I create a rolling window and then compute the np.nanstd:
import numpy as np
def rolling_window(a, window):
a = np.asarray(a)
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
if __name__ == "__main__":
n = 100_000_000
nan_indices = np.random.choice(np.arange(n), size=1000, replace=False)
T = np.random.rand(n)
T[nan_indices] = np.nan
m = 50
np.nanstd(rolling_window(T, m), axis=T.ndim)
However, I noticed that not only is this extremely time consuming, it also uses a lot of memory. Is there a way to improve both the memory and speed performance (Numba is an option)?
NumPy vectorized
After grueling through the math, here's what I ended up with few np.convolve and some masking to get a vectorized NumPy solution -
def nanstd(a, W):
k = np.ones(W, dtype=int)
m = ~np.isnan(a)
a0 = np.where(m, a,0)
n = np.convolve(m,k,'valid')
c1 = np.convolve(a0, k,'valid')
f2 = c1**2
p2 = f2/n**2
f1 = np.convolve((a0**2)*m,k,'valid')+n*p2
out = np.sqrt((f1 - (2/n)*f2)/n)
return out
Complete Explanation is at the end of this post.
Pandas equivalent
Here's the equivalent pandas version, which isn't too bad on performance -
import pandas as pd
def pdroll(T,m):
return pd.Series(T).rolling(m).std(ddof=0).values[m-1:]
Benchmarking
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
def setup(n):
nan_indices = np.random.choice(np.arange(n), size=10, replace=False)
T = np.random.rand(n)
T[nan_indices] = np.nan
return T
import benchit
f = {'rolling': lambda T,m: np.nanstd(rolling_window(T, m), axis=T.ndim),
'pdroll': pdroll, 'conv':nanstd}
in_={(n,w):(setup(n),w) for n in 10**np.arange(2,6) for w in [5,10,20,50,80,100]}
t = benchit.timings(f, in_, multivar=True)
t.plot(logx=True, sp_ncols=2, save='timings.png', dpi=200)
NumPy one is good on smaller window sizes, while pandas is better on larger ones.
NumPy vectorized : Explanation on NumPy version nanstd
Basically, np.nanstd is computing std ignoring NaNs. Now, std can be computed based on mean.
Thus, for an array a with no NaNs, it would be :
np.sqrt(np.mean((a-np.mean(a))**2)) # (1)
Let's prove it :
In [43]: a = np.arange(1,6).astype(float)
In [44]: np.nanstd(a)
Out[44]: 1.4142135623730951
In [45]: np.sqrt(np.mean((a-np.mean(a))**2))
Out[45]: 1.4142135623730951
Now, let's say, we have a NaN in it :
In [46]: a[2] = np.nan
In [47]: a
Out[47]: array([ 1., 2., nan, 4., 5.])
The std with nanstd would be :
In [48]: np.nanstd(a)
Out[48]: 1.5811388300841898
Let's figure out the equivalent one based on (1).
Let's start with (a-np.mean(a))**2.
This one : ?
In [72]: (a-np.mean(a))**2
Out[72]: array([nan, nan, nan, nan, nan])
No!
This one : ?
In [73]: (a0 - np.sum(a0)/n)**2
Out[73]: array([4., 1., 9., 1., 4.])
No! Because a is :
In [76]: a
Out[76]: array([ 1., 2., nan, 4., 5.])
We need to make the NaN position one 0.
This one : ?
In [75]: m*((a0 - np.sum(a0)/n)**2)
Out[75]: array([4., 1., 0., 1., 4.])
Yes!
Then, what's np.mean((a-np.mean(a))**2)? It would be, sum of those in [75] divided by n :
In [77]: np.sum(m*((a0-np.sum(a0)/n)**2))/n
Out[77]: 2.5
Hence, the final std value :
In [78]: np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)
Out[78]: 1.5811388300841898
Summarizing :
In [55]: m = ~np.isnan(a) # (2)
...: a0 = np.where(m, a,0)
...: n = m.sum()
...: out0 = np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)
In [56]: out0
Out[56]: 1.5811388300841898
Next part is incorporating the sliding nature. So, we need to do (2) in a sliding nature. So, the first two steps remains.
Hence, it starts off with :
m = ~np.isnan(a)
a0 = np.where(m, a,0)
But the last two would change, let's see how.
Let's focus on the final step to compute out0. We have :
m*((a0-np.sum(a0)/n)**2)
Then, we compute the summation :
np.sum(m*((a0-np.sum(a0)/n)**2))
We have : (a-b)**2 = a**2 + b**2 - 2*a*b. So, earlier step becomes
np.sum(m*(a0**2 + (np.sum(a0)/n)**2 - 2*a0*np.sum(a0)/n))
Further re-arranging leads to :
np.sum(m*(a0**2 + (np.sum(a0)/n)**2) - np.sum(2*a0*np.sum(a0)/n))
np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - np.sum(2*a0*np.sum(a0)/n)
np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - 2*np.sum(a0*np.sum(a0))/n
np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - (2/n)*np.sum(a0*np.sum(a0)) # (3)
Let's focus on the first two parts for the summation.
Also, let's take a sample case to make things concrete. We will setup two datasets - One for the complete array setup and another a windowed version of the complete one.
Setup :
#=========================== 1. Complete setup
a = np.arange(1,10).astype(float)
a[[2,5]] = np.nan
W = 5
k = np.ones(W, dtype=int)
m_comp = ~np.isnan(a)
a0_comp = np.where(m_comp, a,0)
n_comp = np.convolve(m_comp,k,'valid')
c1 = np.convolve(a0_comp, k,'valid')
c2 = np.convolve((a0_comp**2)*m_comp,k,'valid')
#=========================== 2. Windowed setup
a1 = np.arange(1,6).astype(float)
a1[2] = np.nan
m = ~np.isnan(a1)
a0 = np.where(m, a1,0)
n = m.sum()
out0 = np.sqrt(np.sum(m*((a0-np.sum(a0)/n)**2))/n)
From windowed setup, we have :
In [51]: np.sum(m*(a0**2 + (np.sum(a0)/n)**2))
Out[51]: 82.0
In [52]: np.sum(m*(a0**2) + m*((np.sum(a0)/n)**2))
Out[52]: 82.0
In [53]: np.sum(m*(a0**2)) + np.sum(m*((np.sum(a0)/n)**2))
Out[53]: 82.0
First summation part :
In [86]: np.sum(m*(a0**2))
Out[86]: 46.0
# complete setup version :
In [87]: c2
Out[87]: array([ 46., 45., 90., 154., 219.])
Second summation part :
In [54]: np.sum(m*((np.sum(a0)/n)**2))
Out[54]: 36.0
# complete setup version :
In [55]: n_comp*(c1/n_comp)**2
Out[55]:
array([ 36. , 40.33333333, 85.33333333, 144. ,
210.25 ])
The remaining piece of puzzle fromn (3) is :
In [79]: (2/n)*np.sum(a0*np.sum(a0))
Out[79]: 72.0
Let's focus on the meat of it :
In [80]: np.sum(a0*np.sum(a0))
Out[80]: 144.0
On the complete setup, it would correspond to :
In [81]: c1**2
Out[81]: array([144., 121., 256., 576., 841.])
Thus, for the entire remaining piece :
In [82]: (2/n)*np.sum(a0*np.sum(a0))
Out[82]: 72.0
# complete setup version :
In [83]: (2/n_comp)*c1**2
Out[83]:
array([ 72. , 80.66666667, 170.66666667, 288. ,
420.5 ])
Hence, (3) and its complete version counterpart would be :
In [89]: np.sum(m*(a0**2 + (np.sum(a0)/n)**2)) - (2/n)*np.sum(a0*np.sum(a0))
Out[89]: 10.0
In [90]: c2 + n_comp*(c1/n_comp)**2 - (2/n_comp)*c1**2
Out[90]: array([10. , 4.66666667, 4.66666667, 10. , 8.75 ])
To get the final std values, we need to divide by the count of valid ones per window and then apply sqrt :
In [99]: np.sqrt((c2 + n_comp*(c1/n_comp)**2 - (2/n_comp)*c1**2)/n_comp)
Out[99]: array([1.58113883, 1.24721913, 1.24721913, 1.58113883, 1.47901995])
Hence, with some cleanup, we end up with the final nanstd version.
Related
I have a matrix A and a tensor b of size (1,3) - so a vector of size 3.
I want to compute
C = b1 * A + b2 * A^2 + b3 * A^3 where ^n is the n-th power of A.
At the end, C should have the same shape as A. How can I do this efficiently?
Let's try:
A = torch.ones(1,2,3)
b_vals = torch.tensor([2,3,4])
powers = torch.tensor([1,2,3])
C = (A[...,None]**powers + b_vals).sum(-1)
Output:
tensor([[[12., 12., 12.],
[12., 12., 12.]]])
This is what a model.predic returns. ¿How can i convert this tuple in columns of a dataframe?
(array([1., 1., 1., ..., 1., 1., 1.]), array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
...,
[0.47139825, 0.52860175],
[0.46367829, 0.53632171],
[0.46586898, 0.53413102]]))
<class 'tuple'>
Nothing of those is working for me
pd.DataFrame(dict(class_pred=tuple[0], prob_0=tuple[1], prob_1=tuple[2]))
pd.DataFrame(np.column_stack(tuple),columns=['class_pred','prob_0','prob_1'])
I would like to obtain something like this:
class_pred prob_0 prob_1
1 0.470728 0.5292713
AniSkywalker solution works perfectly.
type(data)
print(data)
tuple
(array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
[0.46511921, 0.53488079],
[0.46739934, 0.53260066],
[0.47387646, 0.52612354],
[0.4737461 , 0.5262539 ],
[0.47052631, 0.52947369],
[0.47658316, 0.52341684],
[0.47222654, 0.52777346]]))
df_pred = pd.DataFrame(data=dict(pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
print(df_pred)
pred prob_0 prob_1
0 1.0 0.465023 0.534977
1 1.0 0.470729 0.529271
2 1.0 0.469656 0.530344
3 1.0 0.465119 0.534881
4 1.0 0.467399 0.532601
5 1.0 0.473876 0.526124
6 1.0 0.473746 0.526254
7 1.0 0.470526 0.529474
8 1.0 0.476583 0.523417
9 1.0 0.472227 0.527773
I'm assuming your data is of the form ((n), (n, 2)) so that:
import numpy as np
n = 5
data = (np.random.rand(n), np.random.rand(n, 2))
provides a reasonable estimate of what your output looks like.
Let's say that data is:
(array([0.27856312, 0.66255123, 0.47976175, 0.59381106, 0.82096555]), array([[0.53719357, 0.55803381],
[0.5749893 , 0.09712089],
[0.91607789, 0.21579499],
[0.50163898, 0.39188127],
[0.60427654, 0.07801227]]))
Your dict method actually works with one modification:
import pandas as pd
df = pd.DataFrame(data=dict(class_pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
Notice that prob_0 and prob_1 are both derived from the second tuple element, but using Numpy's column indexing we can split the individual arrays as you described.
Let's take data[1][:,0], for example: first, we select the second element of the data tuple, which is the (n, 2) matrix. Then, we select the first column (0) from all rows (:). The result is a vector of the first element of every row in that matrix.
Using my made-up numbers, df.head() should give you:
class_pred prob_0 prob_1
0 0.278563 0.537194 0.558034
1 0.662551 0.574989 0.097121
2 0.479762 0.916078 0.215795
3 0.593811 0.501639 0.391881
4 0.820966 0.604277 0.078012
x = np.arange(0.3, 12.5, 0.6)
print(x)
[ 0.3 0.9 1.5 2.1 2.7 3.3 3.9 4.5 5.1 5.7 6.3 6.9 7.5 8.1 8.7 9.3 9.9 10.5 11.1 11.7 12.3]
x = np.arange(0.3, 12.5, 0.6,int)
print(x)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
When dtype = int is specified, it is converting the start, stop and step into the same.
So, it becomes int(start), int(stop), int(step).
Hence, in your case, when dtype = int is specified, the start and step remain 0 and you get an array full of 0s.
This problem has been discussed with explanation here:
https://github.com/numpy/numpy/issues/2457
First let's skip the complexity of a float step, and use a simple integer start and stop:
In [141]: np.arange(0,5)
Out[141]: array([0, 1, 2, 3, 4])
In [142]: np.arange(0,5, dtype=int)
Out[142]: array([0, 1, 2, 3, 4])
In [143]: np.arange(0,5, dtype=float)
Out[143]: array([0., 1., 2., 3., 4.])
In [144]: np.arange(0,5, dtype=complex)
Out[144]: array([0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j])
In [145]: np.arange(0,5, dtype='datetime64[D]')
Out[145]:
array(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
'1970-01-05'], dtype='datetime64[D]')
Even bool work, within a certain range:
In [149]: np.arange(0,1, dtype=bool)
Out[149]: array([False])
In [150]: np.arange(0,2, dtype=bool)
Out[150]: array([False, True])
In [151]: np.arange(0,3, dtype=bool)
ValueError: no fill-function for data-type.
In [156]: np.arange(0,3).astype(bool)
Out[156]: array([False, True, True])
There are 2 possible boolean values, so asking for more should produce some sort of error.
arange is compiled code, so we can't readily examine its logic (but you are welcome to search for the C code on github).
The examples show that it does, in some sense convert the parameters into the corresponding dtype, and perform an iteration on that. It doesn't simply generate the range and convert to dtype at the end.
I was wondering if there is any pandas equivalent to cumsum() or cummax() etc. for median: e.g. cummedian().
So that if I have, for example this dataframe:
a
1 5
2 7
3 6
4 4
what I want is something like:
df['a'].cummedian()
which should output:
5
6
6
5.5
You can use expanding.median -
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
Timings
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.
We could create nan filled subarrays as rows with a strides based function, like so -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
Sample run -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
Hence, the final solution would be -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64
A faster solution for the specific cumulative median
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273
I am looking for a succinct way to go from:
a = numpy.array([1,4,1,numpy.nan,2,numpy.nan])
to:
b = numpy.array([1,5,6,numpy.nan,8,numpy.nan])
The best I can do currently is:
b = numpy.insert(numpy.cumsum(a[numpy.isfinite(a)]), (numpy.argwhere(numpy.isnan(a)) - numpy.arange(len(numpy.argwhere(numpy.isnan(a))))), numpy.nan)
Is there a shorter way to accomplish the same? What about doing a cumsum along an axis of a 2D array?
Pandas is a library build on top of numpy. It's
Series class has a cumsum method, which preserves the nan's and is considerably faster than the solution proposed by DSM:
In [15]: a = arange(10000.0)
In [16]: a[1] = np.nan
In [17]: %timeit a*0 + np.nan_to_num(a).cumsum()
1000 loops, best of 3: 465 us per loop
In [18] s = pd.Series(a)
In [19]: s.cumsum()
Out[19]:
0 0
1 NaN
2 2
3 5
...
9996 49965005
9997 49975002
9998 49985000
9999 49994999
Length: 10000
In [20]: %timeit s.cumsum()
10000 loops, best of 3: 175 us per loop
How about (for not-too-big arrays):
In [34]: import numpy as np
In [35]: a = np.array([1,4,1,np.nan,2,np.nan])
In [36]: a*0 + np.nan_to_num(a).cumsum()
Out[36]: array([ 1., 5., 6., nan, 8., nan])
Masked arrays are for just this type of situation.
>>> import numpy as np
>>> from numpy import ma
>>> a = np.array([1,4,1,np.nan,2,np.nan])
>>> b = ma.masked_array(a,mask = (np.isnan(a) | np.isinf(a)))
>>> b
masked_array(data = [1.0 4.0 1.0 -- 2.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c = b.cumsum()
>>> c
masked_array(data = [1.0 5.0 6.0 -- 8.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c.filled(np.nan)
array([ 1., 5., 6., nan, 8., nan])