pandas: Perform mulitple rolling computation efficiently? - python

Say I have dataset that is index on date
id, date, col1, col2
1, 4, 1, 12
1, 5, 2, 13
1, 6, 6, 14
2, 4, 20, 16
2, 5, 8, 17
2, 6, 11, 18
...
and I wish to compute the rolling mean, sum, min, max for col1 and col2 grouped by id, with window size 2 and 3. I can do that in a loop like so
def multi_rolling(df, winsize, column):
[df.groupby("id")[column].rolling(winsize).mean(),
df.groupby("id")[column].rolling(winsize).sum(),
df.groupby("id")[column].rolling(winsize).min(),
df.groupby("id")[column].rolling(winsize).max(),
df.groupby("id")[column].rolling(winsize).count()]
Then I just have to call the above in a loop. But this feels inefficient. Is there a way to call it on all combinations of all functions and all columns and all window size more efficiently? E.g. run them in parallel?

Use pandas.DataFrame.agg:
new_df = df.groupby("id").rolling(2)[["col1","col2"]].agg(['mean','sum','min','max','count'])
print(new_df)
Output:
col1 col2 \
mean sum min max count mean
col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
1 1.5 12.5 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0 1.5 12.5
2 4.0 13.5 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0 4.0 13.5
2 3 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
4 14.0 16.5 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0 14.0 16.5
5 9.5 17.5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0 9.5 17.5
sum min max count
col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN 1.0 1.0
1 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0
2 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0
2 3 NaN NaN NaN NaN NaN NaN 1.0 1.0
4 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0
5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0

Because your question is ambiguous, I'm not sure if I understand what you need the output data to look like.
But see if one liner helps:
df.groupby("id")[column].rolling(winsize).agg(['mean','sum','min','max','count'])
Because you are grouping repeatedly, it is bound to be very inefficient.

Related

Is there a function to add non null columns in python?

pd.DataFrame({'col1': [1, np.nan, np.nan, 4, 7],
'col2': [4, 5, np.nan, 9, 5]})
I want sum of null values (col1, col2) to be null How it can be achieved?
d['SUM' ]=d[[ ' col1' , 'col2' ]]. sum (axis=1)
d
With the sum function I got sum of null values as
'O'
COl1 col2 SUM
0 1.0 4.0 5.0
1 NaN 5.0 5.0
NaN NaN 0.0
3 4.0 9.0 13.0
4 7.0 5.0 12.0
You can mask according to your rule:
cols = ['col1', 'col2']
df['SUM'] = df[cols].sum(axis=1).mask(df[cols].isna().all(1))
output:
col1 col2 SUM
0 1.0 4.0 5.0
1 NaN 5.0 5.0
2 NaN NaN NaN
3 4.0 9.0 13.0
4 7.0 5.0 12.0
If you want any NaN to yield NaN:
cols = ['col1', 'col2']
df['SUM'] = df[cols].sum(axis=1, skipna=False)
output:
col1 col2 SUM
0 1.0 4.0 5.0
1 NaN 5.0 NaN
2 NaN NaN NaN
3 4.0 9.0 13.0
4 7.0 5.0 12.0

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

How to delete the first and last rows with NaN of a dataframe and replace the remaining NaN with the average of the values below and above?

Let's take this dataframe as a simple example:
df = pd.DataFrame(dict(Col1=[np.nan,1,1,2,3,8,7], Col2=[1,1,np.nan,np.nan,3,np.nan,4], Col3=[1,1,np.nan,5,1,1,np.nan]))
Col1 Col2 Col3
0 NaN 1.0 1.0
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
5 8.0 NaN 1.0
6 7.0 4.0 NaN
I would like first to remove first and last rows until there is no longer NaN in the first and last row.
Intermediate expected output :
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 NaN NaN
3 2.0 NaN 5.0
4 3.0 3.0 1.0
Then, I would like to replace the remaining NaN by the mean of the nearest value below which is not a NaN, and the one above.
Final expected output :
Col1 Col2 Col3
0 1.0 1.0 1.0
1 1.0 2.0 3.0
2 2.0 2.0 5.0
3 3.0 3.0 1.0
I know I can have the positions of NaN in my dataframe through
df.isna()
But I can't solve my problem. How please could I do ?
My approach:
# identify the rows with some NaN
s = df.notnull().all(1)
# remove those with NaN at beginning and at the end:
new_df = df.loc[s.idxmax():s[::-1].idxmax()]
# average:
new_df = (new_df.ffill()+ new_df.bfill())/2
Output:
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0
Another option would be to use DataFrame.interpolate with round:
nans = df.notna().all(axis=1).cumsum().drop_duplicates()
low, high = nans.idxmin(), nans.idxmax()
df.loc[low+1: high].interpolate().round()
Col1 Col2 Col3
1 1.0 1.0 1.0
2 1.0 2.0 3.0
3 2.0 2.0 5.0
4 3.0 3.0 1.0

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

pandas shift rows NaNs

Say we have a dataframe set up as follows:
x = pd.DataFrame(np.random.randint(1, 10, 30).reshape(5,6),
columns=[f'col{i}' for i in range(6)])
x['col6'] = np.nan
x['col7'] = np.nan
col0 col1 col2 col3 col4 col5 col6 col7
0 6 5 1 5 2 4 NaN NaN
1 8 8 9 6 7 2 NaN NaN
2 8 3 9 6 6 6 NaN NaN
3 8 4 4 4 8 9 NaN NaN
4 5 3 4 3 8 7 NaN NaN
When calling x.shift(2, axis=1), col2 -> col5 shifts correctly, but col6 and col7 stays as NaN?
How can I overwrite the NaN in col6 and col7 values with col4 and col5's values? Is this a bug or intended?
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 NaN NaN
1 NaN NaN 8.0 8.0 9.0 6.0 NaN NaN
2 NaN NaN 8.0 3.0 9.0 6.0 NaN NaN
3 NaN NaN 8.0 4.0 4.0 4.0 NaN NaN
4 NaN NaN 5.0 3.0 4.0 3.0 NaN NaN
It's possible this is a bug, you can use np.roll to achieve this:
In[11]:
x.apply(lambda x: np.roll(x, 2), axis=1)
Out[11]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
Speedwise, it's probably quicker to construct a df and reuse the existing columns and pass the result of np.roll as the data arg to the constructor to DataFrame:
In[12]:
x = pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
x
Out[12]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
timings
In[13]:
%timeit pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
%timeit x.fillna(0).astype(int).shift(2, axis=1)
10000 loops, best of 3: 117 µs per loop
1000 loops, best of 3: 418 µs per loop
So constructing a new df with the result of np.roll is quicker than first filling the NaN values, cast to int, and then shifting.

Categories

Resources