I have a pandas data frame like this;
>df
leg speed
1 10
1 11
1 12
1 13
1 12
1 15
1 19
1 12
2 10
2 10
2 12
2 15
2 19
2 11
: :
I want to make a new column roll_speed where it takes a rolling average speed of the last 5 positions. But I wanna put more detailed condition in it.
Groupby leg(it doesn't take into account the speed of the rows in different leg.
I want the rolling window to be changed from 1 to 5 maximum according to the available rows. For example in leg == 1, in the first row there is only one row to calculate, so the rolling speed should be 10/1 = 10. For the second row, there are only two rows available for calculation, the rolling speed should be (10+11)/2 = 10.5.
leg speed roll_speed
1 10 10 # 10/1
1 11 10.5 # (10+11)/2
1 12 11 # (10+11+12)/3
1 13 11.5 # (10+11+12+13)/4
1 12 11.6 # (10+11+12+13+12)/5
1 15 12.6 # (11+12+13+12+15)/5
1 19 14.2 # (12+13+12+15+19)/5
1 12 14.2 # (13+12+15+19+12)/5
2 10 10 # 10/1
2 10 10 # (10+10)/2
2 12 10.7 # (10+10+12)/3
2 15 11.8 # (10+10+12+15)/4
2 19 13.2 # (10+10+12+15+19)/5
2 11 13.4 # (10+12+15+19+11)/5
: :
My attempt:
df['roll_speed'] = df.speed.rolling(5).mean()
But it just returns NA for rows where less than five rows are available for calculation. How should I solve this problem? Thank you for any help!
Set the parameter min_periods to 1
df['roll_speed'] = df.groupby('leg').speed.rolling(5, min_periods = 1).mean()\
.round(1).reset_index(drop = True)
leg speed roll_speed
0 1 10 10.0
1 1 11 10.5
2 1 12 11.0
3 1 13 11.5
4 1 12 11.6
5 1 15 12.6
6 1 19 14.2
7 1 12 14.2
8 2 10 10.0
9 2 10 10.0
10 2 12 10.7
11 2 15 11.8
12 2 19 13.2
13 2 11 13.4
Using rolling(5) will get you your results for all but the first 4 occurences of each group. We can fill the remaining values with the expanding mean:
(df.groupby('leg').speed.rolling(5)
.mean().fillna(df.groupby('leg').speed.expanding().mean())
).reset_index(drop=True)
0 10.000000
1 10.500000
2 11.000000
3 11.500000
4 11.600000
5 12.600000
6 14.200000
7 14.200000
8 10.000000
9 10.000000
10 10.666667
11 11.750000
12 13.200000
13 13.400000
Name: speed, dtype: float64
Related
given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0
need to get output in column <5_Days_Up> like the image.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 5
25-May-21 7 6
26-May-21 8 7
27-May-21 9 8
28-May-21 10 9
29-May-21 11 10
30-May-21 12 11
31-May-21 13 12
1-Jun-21 14 13
2-Jun-21 15 14
But, got the output like this.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 6
25-May-21 7 7
26-May-21 8 8
27-May-21 9 9
28-May-21 10 10
29-May-21 11 11
30-May-21 12 12
31-May-21 13 13
1-Jun-21 14 14
2-Jun-21 15 15
Here, in python pandas, I am using
df['5_Days_Up'] = df['price'].rolling(window=5).max()
is there a way to get the maximum value of the last 5 periods after skipping the today's price using the same rolling() or any other?
Your data has only 4 (instead of 5) previous entries before the entry on date 24-May-21 with price equals 6 (owing to there is no price equals 3 in the data sample.) Therefore, your first entry to show non-NaN value will start from the date 25-May-21 with price equals 7.
To include up to the previous entry (exclude current entry), you can use the parameter closed='left' to achieve this:
df['5_Days_Up'] = df['price'].rolling(window=5, closed='left').max()
Result:
Date price 5_Days_Up
0 20-May-21 1 NaN
1 21-May-21 2 NaN
2 22-May-21 4 NaN
3 23-May-21 5 NaN
4 24-May-21 6 NaN
5 25-May-21 7 6.0
6 26-May-21 8 7.0
7 27-May-21 9 8.0
8 28-May-21 10 9.0
9 29-May-21 11 10.0
10 30-May-21 12 11.0
11 31-May-21 13 12.0
12 1-Jun-21 14 13.0
13 2-Jun-21 15 14.0
So I have a dataframe that has the beginning and end times of certain activities in subsequent rows that have the same id and activity. Every now and then there is a row without an end that I want to drop evtl. (id 3 & 5 in this example). The rows that are paired (with id/act pairs: 1/10,2/10 & 1/10 at a different time) can be merged, i.e. the second row can be dropped. I can add the end times simply by shifting one column, but I am having a hard time getting rid of the unnecessary rows without iterating through the whole dataframe.
import pandas as pd
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
df["time 2"]=df["time"].shift(-1)
Thank yo uso much for the quick reply, but I actually fixed this myself with a very simple solution:
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
id act time
0 1 10 20
1 1 10 25
2 2 10 40
3 2 10 41
4 3 10 42
5 1 10 45
6 1 10 45
7 5 10 50
df["end"]=df["time"].shift(-1)
df["id 2"]=df["id"].shift(-1)
df["act 2"]=df["act"].shift(-1)
df.drop(df.index[len(df)-1],inplace=True)
id act time time 2 id 2 act 2
0 1 10 20 25.0 1.0 10.0
1 1 10 25 40.0 2.0 10.0
2 2 10 40 41.0 2.0 10.0
3 2 10 41 42.0 3.0 10.0
4 3 10 42 45.0 1.0 10.0
5 1 10 45 45.0 1.0 10.0
6 1 10 45 50.0 5.0 10.0
df=df.loc[(df["id"]==df["id 2"])== (df["act"]==df["act 2"])]
df.drop(columns=["id 2","act 2"],axis=0,inplace=True)
id act time end
0 1 10 20 25.0
2 2 10 40 41.0
5 1 10 45 45.0
I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']
I'm working a Pandas Dataframe, that looks like this:
0 Data
1
2
3
4 5
5
6
7
8 21
9
10 2
11
12
13
14
15
I'm trying to fill the blank with next valid values by: df.fillna(method='backfill'). This works, but then I need to add the previous valid value to the next valid value, from the bottom up, such as:
0 Data
1 28
2 28
3 28
4 28
5 23
6 23
7 23
8 23
9 2
10 2
11
12
13
14
15
I can get this to work by looping over it, but is there a method within pandas that can do this?
Thanks a lot!
You could reverse the df, then fillna(0) and then cumsum and reverse again:
In [12]:
df = df[::-1].fillna(0).cumsum()[::-1]
df
Out[12]:
Data
0 28.0
1 28.0
2 28.0
3 28.0
4 23.0
5 23.0
6 23.0
7 23.0
8 2.0
9 2.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
here we use slicing notation to reverse the df, then replace all NaN with 0, perform cumsum and reverse back
Another simple way to do that : df.sum()-df.fillna(0).cumsum()