Pandas groupby mean() not ignoring NaNs

Pandas groupby mean() not ignoring NaNs - python

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.

By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN

There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.

Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))

Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf

There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan

I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

Related

Pandas pct_change with moving average

I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?

here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8

How to apply rolling function when all variables in window from multiple columns are required

I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.

You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.

I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938

Why does max() sometimes return nan and sometimes ignores it?

This question is motivated by an answer I gave a while ago.
Let's say I have a dataframe like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})
a b c
0 1.0 3.0 NaN
1 2.0 NaN 5.0
2 NaN 10.0 34.0
and I want to replace the NaN by the maximum of the row, I can do
df.apply(lambda row: row.fillna(row.max()), axis=1)
which gives me the desired output
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 34.0 10.0 34.0
When I, however, use
df.apply(lambda row: row.fillna(max(row)), axis=1)
for some reason it is replaced correctly only in two of three cases:
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 NaN 10.0 34.0
Indeed, if I check by hand
max(df.iloc[0, :])
max(df.iloc[1, :])
max(df.iloc[2, :])
Then it prints
3.0
5.0
nan
When doing
df.iloc[0, :].max()
df.iloc[1, :].max()
df.iloc[2, :].max()
it prints the expected
3.0
5.0
34.0
My question is why max() fails in 1 of three cases but not in all 3. Why are the NaN sometimes ignored and sometimes not?

The reason is that max works by taking the first value as the "max seen so far", and then checking each other value to see if it is bigger than the max seen so far. But nan is defined so that comparisons with it always return False --- that is, nan > 1 is false but 1 > nan is also false.
So if you start with nan as the first value in the array, every subsequent comparison will be check whether some_other_value > nan. This will always be false, so nan will retain its position as "max seen so far". On the other hand, if nan is not the first value, then when it is reached, the comparison nan > max_so_far will again be false. But in this case that means the current "max seen so far" (which is not nan) will remain the max seen so far, so the nan will always be discarded.

In the first case you are using the numpy max function, which is aware of how to handle numpy.nan.
In the second case you are using the builtin max function from python. This is not aware of how to handle numpy.nan. Presumably this effect is due to the fact that any comparison (>, <, == etc.) of numpy.nan with a float leads to False. An obvious way to implement max would be to iterate the iterable (the row in this case) and check if each value is larger than the previous, and store it as the maximum value if so. Since this larger than comparison will always be False when one of the compared values is numpy.nan, whether the recorded maximum is the number you want or numpy.nan depends entirely on whether the first value is numpy.nan or not.

This is due to the ordering of the elements in the list. First off, if you type
max([1, 2, np.nan])
The result is 2, while
max([np.nan, 2, 3])
gives np.nan. The reason for this is that the max function goes through the values in the list one by one with a comparison like this:
if a > b
now if we look at what we get when comparing to nan, both np.nan > 2 and 1 > np.nan both give False, so in one case the running maximum is replaced with nan and in the other it is not.

the two are different: max() vs df.max().
max(): python built-in function, it must be a non-empty iterable. Check here:
https://docs.python.org/2/library/functions.html#max
While pandas dataframe -- df.max(skipna=..), there is a parameter called skipna, the default value is True, which means the NA/null values are excluded. Check here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html

If possibly it's inf issue, try to replace it as well as nan.
df[column] = df[column].replace([np.inf, -np.inf], 0.0)
df[column] = df[column].replace([np.nan, -np.nan], 0.0)

pandas groupby and rolling_apply ignoring NaNs

I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.
For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.
I've tried the following but it doesn't seem to work:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
If I even try this:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: 1)
I'm getting NaN in the output so it must be something to do with how pandas works in the background.
Any ideas?
EDIT:
Here is a code sample with what I'm trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
The result is:
0 NaN
1 NaN
2 2.0
3 NaN
4 2.5
5 NaN
6 3.0
7 2.0
while I wanted to have the following:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0

As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.
The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:
g1 = df.groupby(['var1'])['value'] # group values
g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values
s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation
s.reset_index(level=0, drop=True).sort_index() # drop/sort index
The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
Name: value, dtype: float64
Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.
It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.
The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.
The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.
For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:
def rolling_nanvar(df, window):
"""
Group df by 'var1' values and then calculate rolling variance,
adjusting for the number of NaN values in the window.
Note: user may wish to edit this function to control degrees of
freedom (n), depending on their overall aim.
"""
g1 = df.groupby(['var1'])['value']
g2 = df.fillna(0).groupby(['var1'])['value']
# fill missing values with 0, square values and groupby
g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])
n = g1.rolling(window).count()
mean_of_squares = g3.rolling(window).sum() / n
square_of_mean = (g2.rolling(window).sum() / n)**2
variance = mean_of_squares - square_of_mean
return variance.reset_index(level=0, drop=True).sort_index()
Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.
Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:
>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop
>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop
Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.

Can this result match your expectations?
I slightly changed your solution with min_periods parameter and right filter for nan.
In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1)
Out[164]:
0 1.0
1 2.0
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
dtype: float64

Here is an alternative implementation without list comprehension, but it also fails to populate the first entry of the output with np.nan
means = df.groupby('var1')['value'].apply(
lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean))

Interpolation on DataFrame in pandas

I have a DataFrame, say a volatility surface with index as time and column as strike. How do I do two dimensional interpolation? I can reindex but how do i deal with NaN? I know we can fillna(method='pad') but it is not even linear interpolation. Is there a way we can plug in our own method to do interpolation?

You can use DataFrame.interpolate to get a linear interpolation.
In : df = pandas.DataFrame(numpy.random.randn(5,3), index=['a','c','d','e','g'])
In : df
Out:
0 1 2
a -1.987879 -2.028572 0.024493
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
g -1.632493 0.938456 0.492695
In : df2 = df.reindex(['a','b','c','d','e','f','g'])
In : df2
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b NaN NaN NaN
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f NaN NaN NaN
g -1.632493 0.938456 0.492695
In : df2.interpolate()
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b 0.052363 -1.729055 0.114652
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f -1.330113 1.134579 0.000958
g -1.632493 0.938456 0.492695
For anything more complex, you need to roll-out your own function that will deal with a Series object and fill NaN values as you like and return another Series object.

Old thread but thought I would share my solution with 2d extrapolation/interpolation, respecting index values, which also works on demand. Code ended up a bit weird so let me know if there is a better solution:
import pandas
from numpy import nan
import numpy
dataGrid = pandas.DataFrame({1: {1: 1, 3: 2},
2: {1: 3, 3: 4}})
def getExtrapolatedInterpolatedValue(x, y):
global dataGrid
if x not in dataGrid.index:
dataGrid.ix[x] = nan
dataGrid = dataGrid.sort()
dataGrid = dataGrid.interpolate(method='index', axis=0).ffill(axis=0).bfill(axis=0)
if y not in dataGrid.columns.values:
dataGrid = dataGrid.reindex(columns=numpy.append(dataGrid.columns.values, y))
dataGrid = dataGrid.sort_index(axis=1)
dataGrid = dataGrid.interpolate(method='index', axis=1).ffill(axis=1).bfill(axis=1)
return dataGrid[y][x]
print getExtrapolatedInterpolatedValue(2, 1.4)
>>2.3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby mean() not ignoring NaNs - python

By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False: In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)}) Out[215]: a b 1 1.5 2 NaN

Use the skipna option - c.groupby('b').apply(lambda g: g.mean(skipna=False))

Another approach would be to use a value that is not ignored by default, for example np.inf: >>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]}) >>> c.groupby('b').mean() a b 1 1.500000 2 inf

Related

Pandas pct_change with moving average

How to apply rolling function when all variables in window from multiple columns are required

Why does max() sometimes return nan and sometimes ignores it?

pandas groupby and rolling_apply ignoring NaNs

Interpolation on DataFrame in pandas

Categories

Resources