Interpolate / fillna with a decay formula in pandas - python

Let's say I have the following pandas dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([1,2,4, None, None, None, None, -1, 1, None, None])
>>> df
0
0 1.0
1 3.0
2 4.0
3 NaN
4 NaN
5 NaN
6 NaN
7 -1.0
8 1.0
9 NaN
10 NaN
I want to fill the missing values with an exponential decay starting from the previous value, giving:
>>> df_result
0
0 1.0
1 2.0
2 4.0
3 4.0 # NaN replaced with previous value
4 2.0 # NaN replaced previous value / 2
5 1.0 # NaN replaced previous value / 2
6 0.5 # NaN replaced previous value / 2
7 -1.0
8 1.0
9 1.0 # NaN replaced previous value
10 0.5 # NaN replaced previous value / 2
With fillna, I have method='pad', but I cannot fit my formula here.
With interpolate, I'm not sure I can give a specific exponential decay formula, and take into account only the last not-NaN value.
I'm thinking of creating a separate dataframe df_replacements initialised with 0.5 instead of the NaN and 0 elsewhere, do a cumprod (somehow I need to reset the running product to 1 for every first NaN), and then df_result = df.fillna(df_replacements, inplace=True)
Is there a simple way to achieve this replacement with pandas?

In your case fill the nan forward, then we groupby to find the consecutive NaN , get the cumcount
s=df[0].ffill()
df[0].fillna(s[df[0].isnull()].mul((1/2)**(df[0].groupby(df[0].notnull().cumsum()).cumcount()-1),0))
Out[655]:
0 1.0
1 2.0
2 4.0
3 4.0
4 2.0
5 1.0
6 0.5
7 -1.0
8 1.0
9 1.0
10 0.5
Name: 0, dtype: float64
Edit by OP: same solution with more explicit variables names:
ffilled = df[0].ffill()
is_na = df[0].isnull()
group_ids = df[0].notnull().cumsum()
mul_factors = (1 / 2) ** (df[0].groupby(group_ids).cumcount() - 1)
result = df[0].fillna(ffilled[is_na].mul(mul_factors, 0))

Related

More efficient way to build dataset then using lists

I am building a dataset for a squence to point conv network, where each window is moved by one timestep.
Basically this loop is doing it:
x_train = []
y_train = []
for i in range(window,len(input_train)):
x_train.append(input_train[i-window:i].tolist())
y = target_train[i-window:i]
y = y[int(len(y)/2)]
y_train.append(y)
When im using a big value for window, e.g. 500 i get a memory error.
Is there a way to build the training dataset more efficiently?
You should use pandas. It still might take too much space, but you can try:
import pandas as pd
# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)
rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]
Some explanations with an example:
Assuming a simple series:
>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0 1
1 2
2 3
3 4
4 5
dtype: int64
We can create a dataframe with the windowed data like so:
>>> pd.DataFrame(input_train.rolling(2))
0 1 2 3 4
0 1.0 NaN NaN NaN NaN
1 1.0 2.0 NaN NaN NaN
2 NaN 2.0 3.0 NaN NaN
3 NaN NaN 3.0 4.0 NaN
4 NaN NaN NaN 4.0 5.0
The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns. We can fix this by resetting indices for each window:
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
0 1
0 1.0 NaN
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
The only thing left to do is remove the first window - 1 number of rows because they are not complete (that is just how rolling works):
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
0 1
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0

pandas filling nan with previous row value multiplied with another column

I have dataframe for which I want to fill nan with values from previous rows mulitplied with pct_change column
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 nan 0.5
4 nan 1.3
5 nan 2
6 5 3
so for 3rd row 10*0.5 = 5 and use that filled value to fill next rows if its nan.
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 5 0.5
4 6.5 1.3
5 13 2
6 5 3
I have used this
while df['col_to_fill'].isna().sum() > 0:
df.loc[df['col_to_fill'].isna(), 'col_to_fill'] = df['col_to_fill'].shift(1) * df['pct_change']
but Its taking too much time as its only filling those row whos previous row are nonnan in one loop.
Try with cumprod after ffill
s = df.col_to_fill.ffill()*df.loc[df.col_to_fill.isna(),'pct_change'].cumprod()
df.col_to_fill.fillna(s, inplace=True)
df
Out[90]:
col_to_fill pct_change
0 1.0 NaN
1 2.0 1.0
2 10.0 0.5
3 5.0 0.5
4 6.5 1.3
5 13.0 2.0
6 5.0 3.0

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

pandas take average on odd rows

I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0

Calculate the two rows following a row with a certain value

I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.
You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64
One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3

Categories

Resources