I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.
For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.
I've tried the following but it doesn't seem to work:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
If I even try this:
df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3, lambda x: 1)
I'm getting NaN in the output so it must be something to do with how pandas works in the background.
Any ideas?
EDIT:
Here is a code sample with what I'm trying to do:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))
The result is:
0 NaN
1 NaN
2 2.0
3 NaN
4 2.5
5 NaN
6 3.0
7 2.0
while I wanted to have the following:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.
The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:
g1 = df.groupby(['var1'])['value'] # group values
g2 = df.fillna(0).groupby(['var1'])['value'] # fillna, then group values
s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation
s.reset_index(level=0, drop=True).sort_index() # drop/sort index
The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:
0 NaN
1 NaN
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
Name: value, dtype: float64
Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.
It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.
The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.
The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.
For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:
def rolling_nanvar(df, window):
"""
Group df by 'var1' values and then calculate rolling variance,
adjusting for the number of NaN values in the window.
Note: user may wish to edit this function to control degrees of
freedom (n), depending on their overall aim.
"""
g1 = df.groupby(['var1'])['value']
g2 = df.fillna(0).groupby(['var1'])['value']
# fill missing values with 0, square values and groupby
g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])
n = g1.rolling(window).count()
mean_of_squares = g3.rolling(window).sum() / n
square_of_mean = (g2.rolling(window).sum() / n)**2
variance = mean_of_squares - square_of_mean
return variance.reset_index(level=0, drop=True).sort_index()
Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.
Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:
>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop
>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop
Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.
Can this result match your expectations?
I slightly changed your solution with min_periods parameter and right filter for nan.
In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2, lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1)
Out[164]:
0 1.0
1 2.0
2 2.0
3 2.0
4 2.5
5 3.0
6 3.0
7 2.0
dtype: float64
Here is an alternative implementation without list comprehension, but it also fails to populate the first entry of the output with np.nan
means = df.groupby('var1')['value'].apply(
lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean))
Related
I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN
I want to compute the expanding window of just the last few elements in a group...
df = pd.DataFrame({'B': [np.nan, np.nan, 1, 1, 2, 2, 1,1], 'A': [1, 2, 1, 2, 1, 2,1,2]})
df.groupby("A")["B"].expanding().quantile(0.5)
this gives:
1 0 NaN
2 1.0
4 1.5
6 1.0
2 1 NaN
3 1.0
5 1.5
7 1.0
I only really want the last two rows for each group though. the result should be:
1 4 1.5
6 1.0
2 5 1.5
7 1.0
I can easily calculate it all and then just get the sections I want. but this is very slow if my dataframe is 1000s of elements long and I dont want to roll across the whole window... just the last two "rolls"
EDIT: I have ammended the title; A lot of people are correctly answering part of the question, but ignoring what is IMO the important part (I should have been more clear)
The issue here is the time it takes. I could just "tail" the answer to get the last two; but then it involves calculating the first two "expanding windows" and then throwing away those results. If my dataframe was instead 1000s of rows long and I just needed the answer for the last few entries, much of this calculation would be wasting time. This is the main problem I have.
As I stated:
"I can easily calculate it all and then just get the sections I want" => through using tail.
Sorry for the confusion.
Also potentially using tail doesnt involve calculating the lot, but it still seems like it does from the timings that I have done... maybe this is not correct, it is an assumption I have made.
EDIT2: The other Option I have tried was using the min_windows in rolling to force it to not calculate the initial sections of the group, but this has many pitfalls such as: -if the array includes NaNs this doesnt work, -if the groupbys are not the same length.
EDIT3:
As a simpler problem and reasoning: Its a limitation of the expanding/or rolling window I think... say we had an array [1,2,3,4,5] the expanding windows are [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5], and if we run the max over that we get: 1,2,3,4,5 (the max of each array). But if I just want the max of the last two expanding windows. I just need max[1,2,3,4] = 4 and max[1,2,3,4,5]. Intuitively I don't need to calculate max of the first 3 expanding window results to get the last two. But Pandas Implementation might be that it calculates max[1,2,3,4] as max[max[1,2,3],max[4]] = 4 in which case the calculation of the entire window is necessary... this might be the same for the quantile example. There might be an alternate way to do this however without using expanding... not sure... this is what I cant work out.
Maybe try using tail: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.core.groupby.GroupBy.tail.html
df.groupby('A')['B'].rolling(4, min_periods=1).quantile(0.5).reset_index(level=0).groupby('A').tail(2)
Out[410]:
A B
4 1 1.5
6 1 1.0
5 2 1.5
7 2 1.0
rolling and expanding are similar
How about this (edited 06/12/2018):
def last_two_quantile(row, q):
return pd.Series([row.iloc[:-1].quantile(q), row.quantile(q)])
df.groupby('A')['B'].apply(last_two_quantile, 0.5)
Out[126]:
A
1 0 1.5
1 1.0
2 0 1.5
1 1.0
Name: B, dtype: float64
If this (or something like it) doesn't do what you desire I think you should provide a real example of your use case.
Is this you want?
df[-4:].groupby("A")["B"].expanding().quantile(0.5)
A
1 4 2.0
6 1.5
2 5 2.0
7 1.5
Name: B, dtype: float64
Hope can help you.
Solution1:
newdf = df.groupby("A")["B"].expanding().quantile(0.5).reset_index()
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution2:
newdf2 = df.groupby("A")["B"].expanding().quantile(0.5)
for i in range(newdf2.index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution3:
for i in range(df.groupby("A")["B"].expanding().quantile(0.5).index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
output:
Empty DataFrame
Columns: [A, level_1, B]
Index: []
A level_1 B
2 1 4 1.5
3 1 6 1.0
A level_1 B
6 2 5 1.5
7 2 7 1.0
new solution:
newdf = pd.DataFrame(columns={"A", "B"})
for i in range(len(df["A"].unique())):
newdf = newdf.append(pd.DataFrame(df[df["A"]==i+1][:-2].sum()).T)
newdf["A"] = newdf["A"]/2
for i in range(len(df["A"].unique())):
newdf = newdf.append(df[df["A"]==df["A"].unique()[i]][-2:])
#newdf = newdf.reset_index(drop=True)
newdf["A"] = newdf["A"].astype(int)
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i].groupby("A")["B"].expanding().quantile(0.5)[-2:])
output:
Series([], Name: B, dtype: float64)
A
1 4 1.5
6 1.0
Name: B, dtype: float64
A
2 5 1.5
7 1.0
Name: B, dtype: float64
This question is motivated by an answer I gave a while ago.
Let's say I have a dataframe like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})
a b c
0 1.0 3.0 NaN
1 2.0 NaN 5.0
2 NaN 10.0 34.0
and I want to replace the NaN by the maximum of the row, I can do
df.apply(lambda row: row.fillna(row.max()), axis=1)
which gives me the desired output
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 34.0 10.0 34.0
When I, however, use
df.apply(lambda row: row.fillna(max(row)), axis=1)
for some reason it is replaced correctly only in two of three cases:
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 NaN 10.0 34.0
Indeed, if I check by hand
max(df.iloc[0, :])
max(df.iloc[1, :])
max(df.iloc[2, :])
Then it prints
3.0
5.0
nan
When doing
df.iloc[0, :].max()
df.iloc[1, :].max()
df.iloc[2, :].max()
it prints the expected
3.0
5.0
34.0
My question is why max() fails in 1 of three cases but not in all 3. Why are the NaN sometimes ignored and sometimes not?
The reason is that max works by taking the first value as the "max seen so far", and then checking each other value to see if it is bigger than the max seen so far. But nan is defined so that comparisons with it always return False --- that is, nan > 1 is false but 1 > nan is also false.
So if you start with nan as the first value in the array, every subsequent comparison will be check whether some_other_value > nan. This will always be false, so nan will retain its position as "max seen so far". On the other hand, if nan is not the first value, then when it is reached, the comparison nan > max_so_far will again be false. But in this case that means the current "max seen so far" (which is not nan) will remain the max seen so far, so the nan will always be discarded.
In the first case you are using the numpy max function, which is aware of how to handle numpy.nan.
In the second case you are using the builtin max function from python. This is not aware of how to handle numpy.nan. Presumably this effect is due to the fact that any comparison (>, <, == etc.) of numpy.nan with a float leads to False. An obvious way to implement max would be to iterate the iterable (the row in this case) and check if each value is larger than the previous, and store it as the maximum value if so. Since this larger than comparison will always be False when one of the compared values is numpy.nan, whether the recorded maximum is the number you want or numpy.nan depends entirely on whether the first value is numpy.nan or not.
This is due to the ordering of the elements in the list. First off, if you type
max([1, 2, np.nan])
The result is 2, while
max([np.nan, 2, 3])
gives np.nan. The reason for this is that the max function goes through the values in the list one by one with a comparison like this:
if a > b
now if we look at what we get when comparing to nan, both np.nan > 2 and 1 > np.nan both give False, so in one case the running maximum is replaced with nan and in the other it is not.
the two are different: max() vs df.max().
max(): python built-in function, it must be a non-empty iterable. Check here:
https://docs.python.org/2/library/functions.html#max
While pandas dataframe -- df.max(skipna=..), there is a parameter called skipna, the default value is True, which means the NA/null values are excluded. Check here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html
If possibly it's inf issue, try to replace it as well as nan.
df[column] = df[column].replace([np.inf, -np.inf], 0.0)
df[column] = df[column].replace([np.nan, -np.nan], 0.0)
I have a 3 million rows dataframe that contains the different values :
d a0 a1 a2
0.5 10.0 5.0 1.0
0.8 10.0 2.0 0.0
I want to fill a fourth column with a linear interpolation of (a0,a1,a2) that takes the value in the "d" case,
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 3.0
0.8 10.0 2.0 0.0 3.6
newcol is the weighted average between a[int(d)] and a[int(d+1)], e.g. when d = 0.8, newcol = 0.2 * a0 + 0.8 * a1 because 0.8 is 80% of the way between 0 and 1
I found that np.interp can be used, but there is no way for me to put the three column names in variable) :
df["newcol"]=np.interp(df["d"],[0,1,2], [100,200,300])
will indeed give me
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 250.0
0.8 10.0 2.0 0.0 180.0
BUT I have no way to specify that the values vector changes :
df["newcol"]=np.interp(df["d"],[0,1,2], df[["a0","a1","a2"]])
gives me the following traceback :
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 1271, in interp
return compiled_interp(x, xp, fp, left, right)
ValueError: object too deep for desired array
Is there any way to use a different vector for values at each line? Could you think of any workaround ?
Basically, I could find no way to create this new column based on the definition :
What is the value in x = column "d" of the function that is piecewise linear
between given points and whose values at these points are described in the columns "ai"
Edit: Before, I used scipy.interp1d, which is not memory efficient, the comment helped me to solve partially my problem
Edit2 :
I tried the approach from ev-br that stated that I had to try to code the loop myself.
for i in range(len(tps)):
columns=["a1","a2","a3"]
length=len(columns)
x=np.maximum(0,np.minimum(df.ix[i,"d"],len-2))
xint = np.int(x)
xfrac = x-xint
name1=columns[xint]
name2=columns[xint+1]
tps.ix[i,"Multiplier"]=df.ix[i,name1]+xfrac*(df.ix[i,name2]-tps.ix[i,name1])
The above loop loops around 50 times a second, so I guess I have a major optimisation issue. What part of working on a DataFrame do I do wrong?
It might comes a bit too late, but I would use np.interpolate with pandas' apply function. Creating the DataFrame in your example:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
Then comes the apply function:
t.apply(lambda x: np.interp(x.d, [0,1,2], x['a0':]), axis=1)
which yields:
0 3.0
1 3.6
dtype: float64
This is perfectly usable on "normal" datasets. However, the size of your DataFrame might call for a better/more optimized solution. The processing time scales linearily, my machine clocks in 10000 lines per second, which means 5 minutes for 3 million...
OK, I have a second solution, which uses the numexpr module. This method is much more specific, but also much faster. I've measured the complete process to take 733 milliseconds for 1 million lines, which is not bad...
So we have the original DataFrame as before:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
We import the module and use it, but it requires that we separate the two cases where we will use 'a0' and 'a1' or 'a1' and 'a2' as lower/upper limits for the linear interpolation. We also prepare the numbers so they can be fed to the same evaluation (hence the -1). We do that by creating 3 arrays with the interpolation value (originally: 'd') and the limits, depending on the value of "d". So we have:
import numexpr as ne
lim = np.where(t.d > 1, [t.d-1, t.a1, t.a2], [t.d, t.a0, t.a1])
Then we evaluate the simple linear interpolation expression and finally add it as a new column like that:
x = ne.evaluate('(1-x)*a+x*b', local_dict={'x': lim[0], 'a': lim[1], 'b': lim[2]})
t['IP'] = np.where(t.d > 1, x+1, x)