Sum elements of list ignoring NaN without numpy - python

I can't use numpy, how can I emulate .nansum method? Example
x = [1, NaN, 2, 10]

You can pass a generator expression to sum that uses math.isnan to ignore NaN elements.
from math import isnan
print(sum(e for e in x if not isnan(e)))
Alternatively, using the fact that NaN is not equal to itself:
print(sum(e for e in x if e == e))

Related

Cleaning outliers inside a column with interpolation

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !
Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

np.logical_and operator not working as it should (numpy)

I have a dataset (ndarray, float 32), for example:
[-3.4028235e+38 -3.4028235e+38 -3.4028235e+38 ... 1.2578617e-01
1.2651859e-01 1.3053264e-01] ...
I want to remove all values below 0, greater than 1, so I use:
with rasterio.open(raster_file) as src:
h = src.read(1)
i = h[0]
i[np.logical_and(i >= 0.0, i <= 1.0)]
Obviously the first entries (i.e. -3.4028235e+38) should be removed but they still appear after the operator is applied. I'm wondering if this is related to the scientific notation and a pre-step is required to be performed, but I can't see what exactly. And ideas?
To simplify this, here is the code again:
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 1.2578617e-01, 1.2651859e-01, 1.3053264e-01]
pp[np.logical_and(pp => 0.0, pp <= 1.0)]
print (pp)
And the result
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 0.12578617, 0.12651859, 0.13053264]
So the first 3 entries still remain.
The problem is that you are not removing the indices you selected. You are just selecting them.
If you want to remove them. You should probably convert them to nans as such
from numpy import random, nan, logical_and
a = random.randn(10, 3)
print(a)
a[logical_and(a > 0, a < 1)] = nan
print(a)
Output example
[[-0.95355719 nan nan]
[-0.21268393 nan -0.24113676]
[-0.58929128 nan nan]
[ nan -0.89110972 nan]
[-0.27453321 1.07802157 1.60466863]
[-0.34829213 nan 1.51556019]
[-0.4890989 nan -1.08481203]
[-2.17016962 nan -0.65332871]
[ nan 1.58937678 1.79992471]
[ nan -0.91716538 1.60264461]]
Alternatively you can look into masked array
Silly mistake, I had to wrap the array in a numpy array, then assign a variable to the new constructed array, like so:
j = np.array(pp)
mask = j[np.logical_and(j >= 0.0, j <= 1.0)]

Replace (every) element in a list by the median of the nearest neighbors

I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.
On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)
I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN

Interpolation on DataFrame in pandas

I have a DataFrame, say a volatility surface with index as time and column as strike. How do I do two dimensional interpolation? I can reindex but how do i deal with NaN? I know we can fillna(method='pad') but it is not even linear interpolation. Is there a way we can plug in our own method to do interpolation?
You can use DataFrame.interpolate to get a linear interpolation.
In : df = pandas.DataFrame(numpy.random.randn(5,3), index=['a','c','d','e','g'])
In : df
Out:
0 1 2
a -1.987879 -2.028572 0.024493
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
g -1.632493 0.938456 0.492695
In : df2 = df.reindex(['a','b','c','d','e','f','g'])
In : df2
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b NaN NaN NaN
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f NaN NaN NaN
g -1.632493 0.938456 0.492695
In : df2.interpolate()
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b 0.052363 -1.729055 0.114652
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f -1.330113 1.134579 0.000958
g -1.632493 0.938456 0.492695
For anything more complex, you need to roll-out your own function that will deal with a Series object and fill NaN values as you like and return another Series object.
Old thread but thought I would share my solution with 2d extrapolation/interpolation, respecting index values, which also works on demand. Code ended up a bit weird so let me know if there is a better solution:
import pandas
from numpy import nan
import numpy
dataGrid = pandas.DataFrame({1: {1: 1, 3: 2},
2: {1: 3, 3: 4}})
def getExtrapolatedInterpolatedValue(x, y):
global dataGrid
if x not in dataGrid.index:
dataGrid.ix[x] = nan
dataGrid = dataGrid.sort()
dataGrid = dataGrid.interpolate(method='index', axis=0).ffill(axis=0).bfill(axis=0)
if y not in dataGrid.columns.values:
dataGrid = dataGrid.reindex(columns=numpy.append(dataGrid.columns.values, y))
dataGrid = dataGrid.sort_index(axis=1)
dataGrid = dataGrid.interpolate(method='index', axis=1).ffill(axis=1).bfill(axis=1)
return dataGrid[y][x]
print getExtrapolatedInterpolatedValue(2, 1.4)
>>2.3

Categories

Resources