Pandas - Count consecutive rows with column values greater than a threshold limit - python

I have a dataframe where the speed of several persons is recorded on a specific time frame. Below is a simplified version:
df = pd.DataFrame([["Mary",0,2.3], ["Mary",1,1.8], ["Mary",2,3.2],
["Mary",3,3.0], ["Mary",4,2.6], ["Mary",5,2.2],
["Steve",0,1.6], ["Steve",1,1.7], ["Steve",2,2.5],
["Steve",3,2.7], ["Steve",4,2.3], ["Steve",5,1.8],
["Jane",0,1.9], ["Jane",1,2.7], ["Jane",2,2.3],
["Jane",3,1.9], ["Jane",4,2.2], ["Jane",5,2.1]],
columns = [ "name","time","speed (m/s)" ])
print(df)
name time (s) speed (m/s)
0 Mary 0 2.3
1 Mary 1 1.8
2 Mary 2 3.2
3 Mary 3 3.0
4 Mary 4 2.6
5 Mary 5 2.2
6 Steve 0 1.6
7 Steve 1 1.7
8 Steve 2 2.5
9 Steve 3 2.7
10 Steve 4 2.3
11 Steve 5 1.8
12 Jane 0 1.9
13 Jane 1 2.7
14 Jane 2 2.3
15 Jane 3 1.9
16 Jane 4 2.2
17 Jane 5 2.1
I'm looking for a way to count, for each name, how many times the speed is greater than 2 m/s for 2 consecutive records or more, and the average duration of these lapse times. The real dataframe has more than 1.5 million rows, making loops unefficient.
The result I expect looks like this:
name count average_duration(s)
0 Mary 1 4 # from 2 to 5s (included) - 1 time, 4/1 = 4s
1 Steve 1 3 # from 2 to 4s (included) - 1 time, 3/1 = 3s
2 Jane 2 2 # from 1 to 2s & from 4 to 5s (included) - 2 times, 4/2 = 2s
I've spent more than a day on this problem, without success...
Thanks by advance for your help!

So here's my go:
df['over2'] = df['speed (m/s)']>2
df['streak_id'] = (df['over2'] != df['over2'].shift(1)).cumsum()
streak_groups = df.groupby(['name','over2','streak_id'])["time"].agg(['min','max']).reset_index()
positive_streaks = streak_groups[streak_groups['over2'] & (streak_groups['min'] != streak_groups['max'])].copy()
positive_streaks["duration"] = positive_streaks["max"] - positive_streaks["min"] + 1
result = positive_streaks.groupby('name')['duration'].agg(['size', 'mean']).reset_index()
print(result)
Output:
name size mean
0 Jane 2 2
1 Mary 1 4
2 Steve 1 3
I'm basically giving each False/True streak a unique ID to be able to group by it, so each group is such a consecutive result.
Then I simply take the duration as the max time - min time, get rid of the streaks of len 1, and then get the size and mean of grouping by the name.
If you want to understand each step better, I suggest printing the intermediate DataFrames I have along the way.

Here is another version which checks for the condition (greater then 2) and creates a helper series s to keep track of duplicates later, then using series.where and series.duplicated we group on name using this result and aggregate count and nunique (number of unique values) , then divide:
c = df['speed (m/s)'].gt(2)
s = c.ne(c.shift()).cumsum()
u = (s.where(c&s.duplicated(keep=False)).groupby(df['name'],sort=False)
.agg(['count','nunique']))
out = (u.join(u['count'].div(u['nunique']).rename("Avg_duration")).reset_index()
.drop("count",1).rename(columns={"nunique":"Count"}))
print(out)
name Count Avg_duration
0 Mary 1 4.0
1 Steve 1 3.0
2 Jane 2 2.0

Interesting question! I found it quite difficult to come up with a nice solution using pandas, but if you happen to know R and the dplyr package, then you could write something like this:
library(tidyverse)
df %>%
mutate(indicator = `speed_(m/s)` > 2.0) %>%
group_by(name) %>%
mutate(streak = cumsum(!indicator)) %>%
group_by(streak, .add = TRUE) %>%
summarise(duration = sum(indicator)) %>%
filter(duration >= 2) %>%
summarise(count = n(), mean_duration = mean(duration))
#> # A tibble: 3 x 3
#> name count mean_duration
#> <chr> <int> <dbl>
#> 1 Jane 2 2
#> 2 Mary 1 4
#> 3 Steve 1 3
Created on 2020-08-31 by the reprex package (v0.3.0)
I apologize in advance if this is too off-topic, but I thought that other R-users (or maybe pandas-wizards) would find it interesting.

Related

Remove duplicates based on combination of two columns in Pandas

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.
For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
Answer df should be:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Try as follows:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Explanation
First, we use df.filter to select just the columns with person*.
For these columns only we use df.apply to turn each row (axis=1) into a frozenset. So, at this stage, we are looking at a pd.Series like this:
0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
Now, we want to select the duplicate rows, using Series.duplicated and add ~ as a prefix to the resulting boolean series to select the inverse from the original df.
Finally, we reset the index with df.reset_index.
Here's a less general approach than the one given by #ouroboros1, this only works for your two columns case
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
You can put the two person columns in order within each row, then drop duplicates.
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
After the swap:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
After dropping duplicates:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?
groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5
You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

Take a random sample from a dataframe making sure that I will keep at least one row for each column that has a value different from zero

So I have a dataframe that looks like this:
Player
Points
Assists
Rebounds
Steals
Blocks
Wins
Bryant
35
5
5
1
0
1
James
24
11
9
2
1
0
Durant
31
2
12
0
0
0
Curry
29
4
2
2
0
0
Harden
13
12
0
0
1
0
Doncic
12
5
3
0
0
1
Buttler
24
0
2
1
0
0
Paul
0
12
3
3
0
1
And I want to take a random sample from that dataframe, but in a way that in the resulting sample, each column will have at least one value different from 0. So for example if I decide to take a random sample of 3 players, those 3 players can't be James, Durant and Curry since all three of them have zeros on the Win column. They also couldn't be Bryant, Doncic and Paul since they all have zero blocks.
How can I do this ?
FWI: This dataframe is just a simplification, mine has a lot more of rows and columns, hence I need a generic answer or method.
Thanks!
Try this. I took myself the freedom to add a new player:
import pandas as pd
df = pd.read_csv('./data/players.csv')
_cols = list(df.columns)
_cols.remove('Player')
df['sum'] = df[_cols].sum(axis=1)
df
samples = 3
df[(df['sum']!=0)].sample(samples)
Unfortunately Marcello will never be sampled.
IIUC, you can try something like this:
def sample_df(df, n=3):
while True:
dfs=df.sample(n)
#print(dfs) Just added this print to show dataframes dropped do to zeroes
if ~dfs.iloc[:,1:].sum().eq(0).any():
return dfs
sample_df(df)
Output:
Player Points Assists Rebounds Steals Blocks Wins
1 James 24 11 9 2 1 0
0 Bryant 35 5 5 1 0 1
2 Durant 31 2 12 0 0 0

Moving average with pandas using the 2 prior occurrences

I was able to find the proper formula for a Moving average here: Moving Average SO Question
The issue is it is using the 1 occurrence prior and the current rows input. I am trying to use the 2 prior occurrence to the row I am trying to predict.
import pandas as pd
import numpy as np
df = pd.DataFrame({'person':['john','mike','john','mike','john','mike'],
'pts':[10,9,2,2,5,5]})
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean())
OUTPUT:
From the output we see that Johns second entry is using his first and the current row to Avg. What I am looking for is John and Mikes last occurrences to be John: 6 and Mike: 5.5 using the prior two, not the previous one and the current rows input. I am using this for a prediction and would not know the current rows pts because they haven't happend yet. New to Machine Learning and this was my first thought for a feature.
If want shift per groups add Series.shift to lambda function:
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean().shift())
print (df)
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5
Try:
df['avg'] = df.groupby('person').rolling(3)['pts'].sum().reset_index(level=0, drop=True)
df['avg']=df['avg'].sub(df['pts']).div(2)
Outputs:
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5

Fill values of a column based on mean of another column

I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan
You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64
pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace

Categories

Resources