I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN
Related
Lets say I have data like this:
>>> df = pd.DataFrame({'values': [5, np.nan, 2, 2, 2, 5, np.nan, 4, 5]})
>>> print(df)
values
0 5.0
1 NaN
2 2.0
3 2.0
4 2.0
5 5.0
6 NaN
7 4.0
8 5.0
I know that I can use fillna(), with arguments such as fillna(method='ffill') to fill missing values with the previous value. Is there a way of writing a custom method for fillna? Lets say I want every NaN value to be replaced by the arithmetic middle of to previous 2 values and the next 2 values, how would I do that? (I am not saying that is a good method of filling the values, but I want to know if it can be done).
Example for what the output would have to look like:
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0
You can use ffill and bfill together as follows :
df['values'] = df['values'].ffill().add(df['values'].bfill()).div(2)
print(df)
values
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0
Just change the df['values'] to df to apply over the whole dataframe!
When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Let's say I have the following pandas dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([1,2,4, None, None, None, None, -1, 1, None, None])
>>> df
0
0 1.0
1 3.0
2 4.0
3 NaN
4 NaN
5 NaN
6 NaN
7 -1.0
8 1.0
9 NaN
10 NaN
I want to fill the missing values with an exponential decay starting from the previous value, giving:
>>> df_result
0
0 1.0
1 2.0
2 4.0
3 4.0 # NaN replaced with previous value
4 2.0 # NaN replaced previous value / 2
5 1.0 # NaN replaced previous value / 2
6 0.5 # NaN replaced previous value / 2
7 -1.0
8 1.0
9 1.0 # NaN replaced previous value
10 0.5 # NaN replaced previous value / 2
With fillna, I have method='pad', but I cannot fit my formula here.
With interpolate, I'm not sure I can give a specific exponential decay formula, and take into account only the last not-NaN value.
I'm thinking of creating a separate dataframe df_replacements initialised with 0.5 instead of the NaN and 0 elsewhere, do a cumprod (somehow I need to reset the running product to 1 for every first NaN), and then df_result = df.fillna(df_replacements, inplace=True)
Is there a simple way to achieve this replacement with pandas?
In your case fill the nan forward, then we groupby to find the consecutive NaN , get the cumcount
s=df[0].ffill()
df[0].fillna(s[df[0].isnull()].mul((1/2)**(df[0].groupby(df[0].notnull().cumsum()).cumcount()-1),0))
Out[655]:
0 1.0
1 2.0
2 4.0
3 4.0
4 2.0
5 1.0
6 0.5
7 -1.0
8 1.0
9 1.0
10 0.5
Name: 0, dtype: float64
Edit by OP: same solution with more explicit variables names:
ffilled = df[0].ffill()
is_na = df[0].isnull()
group_ids = df[0].notnull().cumsum()
mul_factors = (1 / 2) ** (df[0].groupby(group_ids).cumcount() - 1)
result = df[0].fillna(ffilled[is_na].mul(mul_factors, 0))
I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the oringal data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?
By the way, I also tried np.logical_or():
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))
print(df2)
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
How about filter_by?
df >> filter_by((X.a.isnull()) | (X.b.isnull()))
I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.
You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64
One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3