pandas rolling apply to allow nan

pandas rolling apply to allow nan - python

I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.

You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64

Related

Pandas: Custom fillna() function?

Lets say I have data like this:
>>> df = pd.DataFrame({'values': [5, np.nan, 2, 2, 2, 5, np.nan, 4, 5]})
>>> print(df)
values
0 5.0
1 NaN
2 2.0
3 2.0
4 2.0
5 5.0
6 NaN
7 4.0
8 5.0
I know that I can use fillna(), with arguments such as fillna(method='ffill') to fill missing values with the previous value. Is there a way of writing a custom method for fillna? Lets say I want every NaN value to be replaced by the arithmetic middle of to previous 2 values and the next 2 values, how would I do that? (I am not saying that is a good method of filling the values, but I want to know if it can be done).
Example for what the output would have to look like:
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0

You can use ffill and bfill together as follows :
df['values'] = df['values'].ffill().add(df['values'].bfill()).div(2)
print(df)
values
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0
Just change the df['values'] to df to apply over the whole dataframe!

How to assign each element in an array column its ordered position?

I have a dataframe that looks like this:
df = pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,3,3,4,4],
'x':[np.nan,np.nan,3,np.nan,2,np.nan,3,3,4,2,1,1,3],
'y':[np.nan,np.nan,2,np.nan,1,np.nan,1,1,5,1,5,1,1]})
group x y
1 nan nan
1 nan nan
1 3.0 2.0
1 nan nan
1 2.0 1.0
2 nan nan
2 3.0 1.0
2 3.0 1.0
2 4.0 5.0
3 2.0 1.0
3 1.0 5.0
4 1.0 1.0
4 3.0 1.0
Basically, lets say I have 4 groups and each group contains points with x,y coordinates. Points can have the same coordinates. For example (3,1) exists (twice) in group 2 and also in group 4. Furthermore if x is nan then y should also be nan
I want to assign each pair (x,y) its corresponding position with respect to the sorted list of tuples. If x=y=nan then zero should be returned.
Hence the output should be:
group x y label_global
1 nan nan 0
1 nan nan 0
1 3.0 2.0 5
1 nan nan 0
1 2.0 1.0 3
2 nan nan 0
2 3.0 1.0 4
2 3.0 1.0 4
2 4.0 5.0 6
3 2.0 1.0 3
3 1.0 5.0 2
4 1.0 1.0 1
4 3.0 1.0 4
What I have done is the following:
centroids = sorted(set([x for x in zip(df.dropna().x.values, df.dropna().y.values)]))
df['label_global'] = [centroids.index(d) + 1 if d[1]==d[1] else 0 for d in zip(df.x.values, df.y.values)]
Is there a better way to do this please? My dataframe is about 2million lines long and it takes around 3mins for the task to complete
As a sidenote: In the last list comprehension, the expression if d[1]==d[1] else is meant to filter out tuples with nan since np.nan==np.nan evaluates to False. I had initially tried with if np.nan not in d else, ie:
df['label_global'] = [centroids.index(d) + 1 if np.nan not in d else 0 for d in zip(df.x.values, df.y.values)]
but that doesnt work and I have no idea why. It returns a value error:
ValueError: (nan, nan) is not in list
which to me indicates that the if else loop hasn't worked. Any insights are very much welcome.
I find it also a bit strange that
(np.nan, np.nan)==(np.nan, np.nan) returns True
or even
(np.nan,)==(np.nan,) returns True
but
np.nan==np.nan returns False

Sort by x,y pairs, setting nan first, and use cumsum to set group numbers
df['label_global'] = df.sort_values(['x','y'], na_position='first') \
[['x','y']].fillna(0).diff().ne([0,0]).any(1).cumsum()-1
group x y label_global
0 1 NaN NaN 0
1 1 NaN NaN 0
2 1 3.0 2.0 5
3 1 NaN NaN 0
4 1 2.0 1.0 3
5 2 NaN NaN 0
6 2 3.0 1.0 4
7 2 3.0 1.0 4
8 2 4.0 5.0 6
9 3 2.0 1.0 3
10 3 1.0 5.0 2
11 4 1.0 1.0 1
12 4 3.0 1.0 4

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

expanding mean on rows in pandas

I am trying to calculate expanding mean on rows in my dataframe using pandas.
All seems to be working fine if calculating for columns:
>>> t = pd.DataFrame([1,2,3,4,5,np.nan])
>>> t
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
>>> t.expanding(min_periods=2, axis=0).mean()
0
0 NaN
1 1.5
2 2.0
3 2.5
4 3.0
5 3.0
however if I try the same rows, I get wrong results (seems like window of size 2 is applied all the time):
>>> t.T
0 1 2 3 4 5
0 1.0 2.0 3.0 4.0 5.0 NaN
>>> t.T.expanding(min_periods=2, axis=1).mean()
0 1 2 3 4 5
0 NaN 1.5 2.5 3.5 4.5 NaN
seems like bug to me, but maybe I'm missing something... any clues please?

It is indeed a bug, listed on github-pandas-expanding and github-pandas-rolling.

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...

set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3

Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.

This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas rolling apply to allow nan - python

Related

Pandas: Custom fillna() function?

How to assign each element in an array column its ordered position?

Perform arithmetic operations on null values

expanding mean on rows in pandas

Missing data, insert rows in Pandas and fill with NAN

Categories

Resources