expanding mean on rows in pandas

expanding mean on rows in pandas - python

I am trying to calculate expanding mean on rows in my dataframe using pandas.
All seems to be working fine if calculating for columns:
>>> t = pd.DataFrame([1,2,3,4,5,np.nan])
>>> t
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
>>> t.expanding(min_periods=2, axis=0).mean()
0
0 NaN
1 1.5
2 2.0
3 2.5
4 3.0
5 3.0
however if I try the same rows, I get wrong results (seems like window of size 2 is applied all the time):
>>> t.T
0 1 2 3 4 5
0 1.0 2.0 3.0 4.0 5.0 NaN
>>> t.T.expanding(min_periods=2, axis=1).mean()
0 1 2 3 4 5
0 NaN 1.5 2.5 3.5 4.5 NaN
seems like bug to me, but maybe I'm missing something... any clues please?

It is indeed a bug, listed on github-pandas-expanding and github-pandas-rolling.

Related

How to assign each element in an array column its ordered position?

I have a dataframe that looks like this:
df = pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,3,3,4,4],
'x':[np.nan,np.nan,3,np.nan,2,np.nan,3,3,4,2,1,1,3],
'y':[np.nan,np.nan,2,np.nan,1,np.nan,1,1,5,1,5,1,1]})
group x y
1 nan nan
1 nan nan
1 3.0 2.0
1 nan nan
1 2.0 1.0
2 nan nan
2 3.0 1.0
2 3.0 1.0
2 4.0 5.0
3 2.0 1.0
3 1.0 5.0
4 1.0 1.0
4 3.0 1.0
Basically, lets say I have 4 groups and each group contains points with x,y coordinates. Points can have the same coordinates. For example (3,1) exists (twice) in group 2 and also in group 4. Furthermore if x is nan then y should also be nan
I want to assign each pair (x,y) its corresponding position with respect to the sorted list of tuples. If x=y=nan then zero should be returned.
Hence the output should be:
group x y label_global
1 nan nan 0
1 nan nan 0
1 3.0 2.0 5
1 nan nan 0
1 2.0 1.0 3
2 nan nan 0
2 3.0 1.0 4
2 3.0 1.0 4
2 4.0 5.0 6
3 2.0 1.0 3
3 1.0 5.0 2
4 1.0 1.0 1
4 3.0 1.0 4
What I have done is the following:
centroids = sorted(set([x for x in zip(df.dropna().x.values, df.dropna().y.values)]))
df['label_global'] = [centroids.index(d) + 1 if d[1]==d[1] else 0 for d in zip(df.x.values, df.y.values)]
Is there a better way to do this please? My dataframe is about 2million lines long and it takes around 3mins for the task to complete
As a sidenote: In the last list comprehension, the expression if d[1]==d[1] else is meant to filter out tuples with nan since np.nan==np.nan evaluates to False. I had initially tried with if np.nan not in d else, ie:
df['label_global'] = [centroids.index(d) + 1 if np.nan not in d else 0 for d in zip(df.x.values, df.y.values)]
but that doesnt work and I have no idea why. It returns a value error:
ValueError: (nan, nan) is not in list
which to me indicates that the if else loop hasn't worked. Any insights are very much welcome.
I find it also a bit strange that
(np.nan, np.nan)==(np.nan, np.nan) returns True
or even
(np.nan,)==(np.nan,) returns True
but
np.nan==np.nan returns False

Sort by x,y pairs, setting nan first, and use cumsum to set group numbers
df['label_global'] = df.sort_values(['x','y'], na_position='first') \
[['x','y']].fillna(0).diff().ne([0,0]).any(1).cumsum()-1
group x y label_global
0 1 NaN NaN 0
1 1 NaN NaN 0
2 1 3.0 2.0 5
3 1 NaN NaN 0
4 1 2.0 1.0 3
5 2 NaN NaN 0
6 2 3.0 1.0 4
7 2 3.0 1.0 4
8 2 4.0 5.0 6
9 3 2.0 1.0 3
10 3 1.0 5.0 2
11 4 1.0 1.0 1
12 4 3.0 1.0 4

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

pandas rolling apply to allow nan

I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.

You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64

Pandas Aggregate Method on RollingGroupby

Question: Does the .agg method work on a RollingGroupby object? It seems like it should and IPython auto populates for this method, but I'm getting an error.
Documentation: I did not see anything specific to RollingGroupby objects. I am probably looking in the wrong place, but i looked at Standard moving window functions and GroupBy
Sample Data:
# test data
df = pd.DataFrame({
'animal':np.random.choice( ['panda','python','shark'], 12),
'period':np.repeat(range(3), 4 ),
'value':np.tile(range(2), 6 ),
})
# this works as expected
df.groupby(['animal', 'period'])['value'].rolling(2).count()
animal period
panda 0 2 1.0
2 8 1.0
10 2.0
python 0 0 1.0
1 2.0
1 6 1.0
2 11 1.0
shark 0 3 1.0
1 4 1.0
5 2.0
7 2.0
2 9 1.0
Name: value, dtype: float64
# this works as expected
df.groupby(['animal', 'period'])['value'].rolling(2).mean()
animal period
panda 0 2 NaN
2 8 NaN
10 0.0
python 0 0 NaN
1 0.5
1 6 NaN
2 11 NaN
shark 0 3 NaN
1 4 NaN
5 0.5
7 1.0
2 9 NaN
Name: value, dtype: float64
This does not work for me.
df.groupby(['animal', 'period'])['value'].rolling(2).agg(['count', 'mean'])
The short exception is:
Exception: Column(s) value already selected
The desired DataFrame is below. I got this from merging the two DataFrames that worked above, but this seems cumbersome.
animal period level_2 value_x value_y
0 panda 0 2 1.0 NaN
1 panda 2 8 1.0 NaN
2 panda 2 10 2.0 0.0
3 python 0 0 1.0 NaN
4 python 0 1 2.0 0.5
5 python 1 6 1.0 NaN
6 python 2 11 1.0 NaN
7 shark 0 3 1.0 NaN
8 shark 1 4 1.0 NaN
9 shark 1 5 2.0 0.5
10 shark 1 7 2.0 1.0
11 shark 2 9 1.0 NaN

Jeff (one of the main Pandas developers) said:
sophisticated .agg was never explicitly implemented on
.groupby.rolling, so not surprising this doesn't work.

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...

set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3

Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.

This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

expanding mean on rows in pandas - python

It is indeed a bug, listed on github-pandas-expanding and github-pandas-rolling.

Related

How to assign each element in an array column its ordered position?

Perform arithmetic operations on null values

pandas rolling apply to allow nan

Pandas Aggregate Method on RollingGroupby

Missing data, insert rows in Pandas and fill with NAN

Categories

Resources