Moving average with pandas using the 2 prior occurrences - python

I was able to find the proper formula for a Moving average here: Moving Average SO Question
The issue is it is using the 1 occurrence prior and the current rows input. I am trying to use the 2 prior occurrence to the row I am trying to predict.
import pandas as pd
import numpy as np
df = pd.DataFrame({'person':['john','mike','john','mike','john','mike'],
'pts':[10,9,2,2,5,5]})
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean())
OUTPUT:
From the output we see that Johns second entry is using his first and the current row to Avg. What I am looking for is John and Mikes last occurrences to be John: 6 and Mike: 5.5 using the prior two, not the previous one and the current rows input. I am using this for a prediction and would not know the current rows pts because they haven't happend yet. New to Machine Learning and this was my first thought for a feature.

If want shift per groups add Series.shift to lambda function:
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean().shift())
print (df)
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5

Try:
df['avg'] = df.groupby('person').rolling(3)['pts'].sum().reset_index(level=0, drop=True)
df['avg']=df['avg'].sub(df['pts']).div(2)
Outputs:
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5

Related

Delete rows between NaN and a change in the column value

I am stuck on a problem which looks simple but for which I cannot find a proper solution.
Consider a given Pandas dataframe df, composed by multiple columns A1,A2, etc., and let Ai be one of its column filled for example as follows:
Ai
25
30
30
NaN
12
15
15
NaN
I would like to delete all the rows in df for which Ai values are between NaN and a "further change" in its value, so that my output (for column Ai) would be:
Ai
25
NaN
12
NaN
Any idea on how to do so would be very much appreciated. Thank you very much in advance.
update
Similar to the previous solution but with a filter per group to keep the early duplicates
m = df['Ai'].isna()
df.loc[((m|m.shift(fill_value=True))
.groupby(df['Ai'].ne(df['Ai'].shift()).cumsum())
.filter(lambda d: d.sum()>0).index
)]
output:
Ai
0 25.0
1 25.0
2 25.0
5 NaN
6 30.0
7 30.0
9 NaN
original answer
This is equivalent to selecting the NaNs and line below. You could use a mask:
m = df['Ai'].isna()
df[m|m.shift(fill_value=True)]
Output:
Ai
0 25.0
3 NaN
4 12.0
7 NaN

Apply a softmax function on groupby in the same pandas dataframe

I have been looking to apply the following softmax function from https://machinelearningmastery.com/softmax-activation-function-with-python/
from scipy.special import softmax
# define data
data = [1, 3, 2]
# calculate softmax
result = softmax(data)
# report the probabilities
print(result)
[0.09003057 0.66524096 0.24472847]
I am trying to apply this to a dataframe which is split by groups, and return the probabilites row by row for a group.
My dataframe is:
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','12','12','12'],
'Name': ['Joe','Jack','John','James','Jim'],
'Rating':[30,32,2.5,3,4],
}
df = pd.DataFrame(data=d)
df
EventNo Name Rating
0 10 Joe 30.0
1 10 Jack 32.0
2 12 John 2.5
3 12 James 3.0
4 12 Jim 4
In this instance there are two different events (10 and 12) where for event 10 the values are data = [30,32] and event 12 data = [2.5,3,4]
My expected result would be a new column probabilities with the results:
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.1192
1 10 Jack 32.0 0.8807
2 12 John 2.5 0.1402
3 12 James 3.0 0.2312
4 12 Jim 4 0.6285
Any help on how to do this on all groups in the dataframe would be much appreciated! Thanks!
You can use groupby followed by transform which returns results indexed by the original dataframe. A simple way to do it would be
df["Probabilities"] = df.groupby('EventNo')["Rating"].transform(softmax)
The result is
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.119203
1 10 Jack 32.0 0.880797
2 12 John 2.5 0.140244
3 12 James 3.0 0.231224
4 12 Jim 4.0 0.628532

Pandas - Count consecutive rows with column values greater than a threshold limit

I have a dataframe where the speed of several persons is recorded on a specific time frame. Below is a simplified version:
df = pd.DataFrame([["Mary",0,2.3], ["Mary",1,1.8], ["Mary",2,3.2],
["Mary",3,3.0], ["Mary",4,2.6], ["Mary",5,2.2],
["Steve",0,1.6], ["Steve",1,1.7], ["Steve",2,2.5],
["Steve",3,2.7], ["Steve",4,2.3], ["Steve",5,1.8],
["Jane",0,1.9], ["Jane",1,2.7], ["Jane",2,2.3],
["Jane",3,1.9], ["Jane",4,2.2], ["Jane",5,2.1]],
columns = [ "name","time","speed (m/s)" ])
print(df)
name time (s) speed (m/s)
0 Mary 0 2.3
1 Mary 1 1.8
2 Mary 2 3.2
3 Mary 3 3.0
4 Mary 4 2.6
5 Mary 5 2.2
6 Steve 0 1.6
7 Steve 1 1.7
8 Steve 2 2.5
9 Steve 3 2.7
10 Steve 4 2.3
11 Steve 5 1.8
12 Jane 0 1.9
13 Jane 1 2.7
14 Jane 2 2.3
15 Jane 3 1.9
16 Jane 4 2.2
17 Jane 5 2.1
I'm looking for a way to count, for each name, how many times the speed is greater than 2 m/s for 2 consecutive records or more, and the average duration of these lapse times. The real dataframe has more than 1.5 million rows, making loops unefficient.
The result I expect looks like this:
name count average_duration(s)
0 Mary 1 4 # from 2 to 5s (included) - 1 time, 4/1 = 4s
1 Steve 1 3 # from 2 to 4s (included) - 1 time, 3/1 = 3s
2 Jane 2 2 # from 1 to 2s & from 4 to 5s (included) - 2 times, 4/2 = 2s
I've spent more than a day on this problem, without success...
Thanks by advance for your help!
So here's my go:
df['over2'] = df['speed (m/s)']>2
df['streak_id'] = (df['over2'] != df['over2'].shift(1)).cumsum()
streak_groups = df.groupby(['name','over2','streak_id'])["time"].agg(['min','max']).reset_index()
positive_streaks = streak_groups[streak_groups['over2'] & (streak_groups['min'] != streak_groups['max'])].copy()
positive_streaks["duration"] = positive_streaks["max"] - positive_streaks["min"] + 1
result = positive_streaks.groupby('name')['duration'].agg(['size', 'mean']).reset_index()
print(result)
Output:
name size mean
0 Jane 2 2
1 Mary 1 4
2 Steve 1 3
I'm basically giving each False/True streak a unique ID to be able to group by it, so each group is such a consecutive result.
Then I simply take the duration as the max time - min time, get rid of the streaks of len 1, and then get the size and mean of grouping by the name.
If you want to understand each step better, I suggest printing the intermediate DataFrames I have along the way.
Here is another version which checks for the condition (greater then 2) and creates a helper series s to keep track of duplicates later, then using series.where and series.duplicated we group on name using this result and aggregate count and nunique (number of unique values) , then divide:
c = df['speed (m/s)'].gt(2)
s = c.ne(c.shift()).cumsum()
u = (s.where(c&s.duplicated(keep=False)).groupby(df['name'],sort=False)
.agg(['count','nunique']))
out = (u.join(u['count'].div(u['nunique']).rename("Avg_duration")).reset_index()
.drop("count",1).rename(columns={"nunique":"Count"}))
print(out)
name Count Avg_duration
0 Mary 1 4.0
1 Steve 1 3.0
2 Jane 2 2.0
Interesting question! I found it quite difficult to come up with a nice solution using pandas, but if you happen to know R and the dplyr package, then you could write something like this:
library(tidyverse)
df %>%
mutate(indicator = `speed_(m/s)` > 2.0) %>%
group_by(name) %>%
mutate(streak = cumsum(!indicator)) %>%
group_by(streak, .add = TRUE) %>%
summarise(duration = sum(indicator)) %>%
filter(duration >= 2) %>%
summarise(count = n(), mean_duration = mean(duration))
#> # A tibble: 3 x 3
#> name count mean_duration
#> <chr> <int> <dbl>
#> 1 Jane 2 2
#> 2 Mary 1 4
#> 3 Steve 1 3
Created on 2020-08-31 by the reprex package (v0.3.0)
I apologize in advance if this is too off-topic, but I thought that other R-users (or maybe pandas-wizards) would find it interesting.

extract word and fillna in specific range, between two points in pandas

my df:
A,B
hello my world, adam
i like my turbo1, nan
with love,nan
good morning, john
enev one,nan
turbo2,nan
good to you,nan
man too,emily
I want to extract words turbo1 and turbo2 to B column, and then fill all nans of those words but only until any word appears in B column in each way up and down
expected output:
A,B
hello my world, adam
i like my turbo1, turbo1
with love,turbo1
good morning, john
enev one,turbo2
turbo2,turbo2
goon to you,turbo2
man too,emily
my code:
df['B']=df['B'].str.extract(r'(turbo1|turbo2)').fillna(method='bfill').fillna(method='ffill')
problem i have is that i cannot fill nans only in this range between already existing words.
If need replace all missing values of NaNs consecutive groups use:
m = df['B'].notna()
#for oldier pandas versions
#m = df['B'].notnull()
g = m.cumsum()[~m]
s = df['A'].str.extract(r'(turbo1|turbo2)', expand=False)
df.loc[~m, 'B'] = df['B'].fillna(s).groupby(g).apply(lambda x: x.ffill().bfill())
print (df)
A B
0 hello my world adam
1 i like my turbo1 turbo1
2 with love turbo1
3 good morning john
4 enev one turbo2
5 turbo2 turbo2
6 good to you turbo2
7 man too emily
Details:
First replace missing values by extracted values of B, then create unique groups only for consecutive NaNs and replace missing values per groups with forward and back filling:
print (df.assign(filled = df['B'].fillna(s),
cumsum = m.cumsum(),
g = m.cumsum()[~m]))
A B filled cumsum g
0 hello my world adam adam 1 NaN
1 i like my turbo1 NaN turbo1 1 1.0
2 with love NaN NaN 1 1.0
3 good morning john john 2 NaN
4 enev one NaN NaN 2 2.0
5 turbo2 NaN turbo2 2 2.0
6 good to you NaN NaN 2 2.0
7 man too emily emily 3 NaN

Fill values of a column based on mean of another column

I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan
You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64
pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace

Categories

Resources