How to compare group sizes in pandas - python

Maybe I'm thinking of this in the wrong way but I cannot think of an easy way to do this in pandas. I am trying to get a dataframe that is filtered by the relation between the count values above a setpoint compared to those below it. It is further complicated that it
Contrived example: Let's say I have a dataset of people and their test scores over several tests:
Person | day | test score |
----------------------------
Bob 1 10
Bob 2 40
Bob 3 45
Mary 1 30
Mary 2 35
Mary 3 45
I want to filter this dataframe by the number of test scores >= 40 compared to the total but for each person. Let's say I set the threshold to 50%. So Bob would have 2/3 of test scores but Mary would 1/3 and would be excluded.
My end goal would be to have a groupby object to do means/etc. on those that matched the threshold. So in this case it would look like this:
test score
Person | above_count | total | score mean |
-------------------------------------------
Bob 2 3 31.67
I have tried the following but couldn't figure out what to do my groupby object.
df = pd.read_csv("all_data.csv")
gb = df.groupby('Person')
df2 = df[df['test_score'] >= 40]
gb2 = df2.groupby('Person')
# This would get me the count for each person but how to compare it?
gb.size()

import pandas as pd
df = pd.DataFrame({'Person': ['Bob'] * 3 + ['Mary'] * 4,
'day': [1, 2, 3, 1, 2, 3, 4],
'test_score': [10, 40, 45, 30, 35, 45, 55]})
>>> df
Person day test_score
0 Bob 1 10
1 Bob 2 40
2 Bob 3 45
3 Mary 1 30
4 Mary 2 35
5 Mary 3 45
6 Mary 4 55
In a groupby operation, you can pass different functions to perform on the same column via a dictionary.
result = df.groupby('Person').test_score.agg(
{'total': pd.Series.count,
'test_score_above_mean': lambda s: s.ge(40).sum(),
'score mean': np.mean})
>>> result
test_score_above_mean total score mean
Person
Bob 2 3 31.666667
Mary 2 4 41.250000
>>> result[result.test_score_above_mean.gt(result.total * .5)]
test_score_above_mean total score mean
Person
Bob 2 3 31.666667

Sum and mean can be done with .agg() on a groupby object, but the threshold function forces you to do a flexible apply.
Untested, but something like this should work:
df.groupby('Person').apply(lambda x: sum(x > 40), sum(x), mean(x))
You could make the lambda function a more complicated, regular function that implements all the criteria/functionality you want.

I think it might make sense to use groupby and aggregations to generate each of your columns as pd.Series, and then paste them together at the end.
df = pd.DataFrame([['Bob',1,10],['Bob',2,40],['Bob',3,45],
['Mary',1,30],['Mary',2,35],['Mary',3,45]], columns=
['Person','day', 'test score'])
df_group = df.groupby('Person')
above_count = df_group.apply(lambda x: x[x['test score'] >= 40]['test score'].count())
above_count.name = 'test score above_count'
total_count = df_group['test score'].agg(np.size)
total_count.name = 'total'
test_mean = df_group['test score'].agg(np.mean)
test_mean.name = 'score mean'
results = pd.concat([above_count, total_count, test_mean])

There is an easy way of doing this ...
import pandas as pd
import numpy as np
data = '''Bob 1 10
Bob 2 40
Bob 3 45
Mary 1 30
Mary 2 35
Mary 3 45'''
data = [d.split() for d in data.split('\n')]
data = pd.DataFrame(data, columns=['Name', 'day', 'score'])
data.score = data.score.astype(float)
data['pass'] = (data.score >=40)*1
data['total'] = 1
You add two columns for easy computation to data. The result should look like this ...
Name day score pass total
0 Bob 1 10 0 1
1 Bob 2 40 1 1
2 Bob 3 45 1 1
3 Mary 1 30 0 1
4 Mary 2 35 0 1
5 Mary 3 45 1 1
Now you summarize the data ...
summary = data.groupby('Name').agg(np.sum).reset_index()
summary['mean score'] = summary['score']/summary['total']
summary['pass ratio'] = summary['pass']/summary['total']
print summary
The result looks like this ...
Name score pass total mean score pass ratio
0 Bob 95 2 3 31.666667 0.666667
1 Mary 110 1 3 36.666667 0.333333
Now, you can always filter out the names based on pass ratio ...

Related

Apply a softmax function on groupby in the same pandas dataframe

I have been looking to apply the following softmax function from https://machinelearningmastery.com/softmax-activation-function-with-python/
from scipy.special import softmax
# define data
data = [1, 3, 2]
# calculate softmax
result = softmax(data)
# report the probabilities
print(result)
[0.09003057 0.66524096 0.24472847]
I am trying to apply this to a dataframe which is split by groups, and return the probabilites row by row for a group.
My dataframe is:
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','12','12','12'],
'Name': ['Joe','Jack','John','James','Jim'],
'Rating':[30,32,2.5,3,4],
}
df = pd.DataFrame(data=d)
df
EventNo Name Rating
0 10 Joe 30.0
1 10 Jack 32.0
2 12 John 2.5
3 12 James 3.0
4 12 Jim 4
In this instance there are two different events (10 and 12) where for event 10 the values are data = [30,32] and event 12 data = [2.5,3,4]
My expected result would be a new column probabilities with the results:
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.1192
1 10 Jack 32.0 0.8807
2 12 John 2.5 0.1402
3 12 James 3.0 0.2312
4 12 Jim 4 0.6285
Any help on how to do this on all groups in the dataframe would be much appreciated! Thanks!
You can use groupby followed by transform which returns results indexed by the original dataframe. A simple way to do it would be
df["Probabilities"] = df.groupby('EventNo')["Rating"].transform(softmax)
The result is
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.119203
1 10 Jack 32.0 0.880797
2 12 John 2.5 0.140244
3 12 James 3.0 0.231224
4 12 Jim 4.0 0.628532

Create multiple DataFrames from one pandas DataFrame by grouping by column values [duplicate]

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 2 years ago.
So I have the following dataframe, but with a valuable amount of rows(100, 1000, etc.):
#
Person1
Person2
Age
1
Alex
Maria
20
2
Paul
Peter
20
3
Klaus
Hans
30
4
Victor
Otto
30
5
Gerry
Justin
30
Problem:
Now I want to print separate dataframes, which contain all people, that visit the same age, so the output should look like this:
DF1:
#
Person1
Person2
Age
1
ALex
Maria
20
2
Paul
Peter
20
DF2:
#
Person1
Person2
Age
3
Klaus
Hans
30
4
Victor
Otto
30
5
Gerry
Justin
30
I've tried this with the following functions:
Try1:
def groupAge(data):
x = -1
for x in range(len(data)):
#q = len(data[data["Age"] == data.loc[x, "Age"]])
b = data[data["Age"] == data.loc[x,"Age"]]
x = x + 1
print(b,x)
return b
Try2:
def groupAge(data):
x = 0
for x in range(len(data)):
q = len(data[data["Age"] == data.loc[x, "Age"]])
x = x + 1
for k in range(0,q,q):
b = data[data["Age"] == data.loc[k,"Age"]]
print(b)
return b
Neither of them produced the right output. Try1 prints a few groups, and all of them twice, but doesn't go through the entire dataframe and Try2 only prints the first Age "group", also twice.
I can't identify firstly why it always prints the output two times, neither why it doesn't work through the entire dataframe.
Can anyone help?
In your first try, you are looping through the length of dataframe and then repeating the below line every time replacing x with 0,1,2,3 and 4, respectively. On a side note, x = x + 1 is not required. range already takes care of that.
b = data[data["Age"] == data.loc[x,"Age"]]
It will obviously print them twice every time because you are scanning through the entire dataframe data and executing duplicate commands. For example:
print(data.loc[0, 'Age'])
print(data.loc[1, 'Age'])
20
20
Both the above statements print 20, so by substituting 20 in the loop, essentially you will be executing the following commands twice.
b = data[data["Age"] == 20]
I think all you need is this,
unq_age = data['Age'].unique()
df1 = df.loc[df['Age'] == unq_age[0]]
df2 = df.loc[df['Age'] == unq_age[1]]
df1
# Person1 Person2 Age
0 1 Alex Maria 20
1 2 Paul Peter 20
df2
# Person1 Person2 Age
2 3 Klaus Hans 30
3 4 Victor Otto 30
4 5 Gerry Justin 30

Pandas - Count consecutive rows with column values greater than a threshold limit

I have a dataframe where the speed of several persons is recorded on a specific time frame. Below is a simplified version:
df = pd.DataFrame([["Mary",0,2.3], ["Mary",1,1.8], ["Mary",2,3.2],
["Mary",3,3.0], ["Mary",4,2.6], ["Mary",5,2.2],
["Steve",0,1.6], ["Steve",1,1.7], ["Steve",2,2.5],
["Steve",3,2.7], ["Steve",4,2.3], ["Steve",5,1.8],
["Jane",0,1.9], ["Jane",1,2.7], ["Jane",2,2.3],
["Jane",3,1.9], ["Jane",4,2.2], ["Jane",5,2.1]],
columns = [ "name","time","speed (m/s)" ])
print(df)
name time (s) speed (m/s)
0 Mary 0 2.3
1 Mary 1 1.8
2 Mary 2 3.2
3 Mary 3 3.0
4 Mary 4 2.6
5 Mary 5 2.2
6 Steve 0 1.6
7 Steve 1 1.7
8 Steve 2 2.5
9 Steve 3 2.7
10 Steve 4 2.3
11 Steve 5 1.8
12 Jane 0 1.9
13 Jane 1 2.7
14 Jane 2 2.3
15 Jane 3 1.9
16 Jane 4 2.2
17 Jane 5 2.1
I'm looking for a way to count, for each name, how many times the speed is greater than 2 m/s for 2 consecutive records or more, and the average duration of these lapse times. The real dataframe has more than 1.5 million rows, making loops unefficient.
The result I expect looks like this:
name count average_duration(s)
0 Mary 1 4 # from 2 to 5s (included) - 1 time, 4/1 = 4s
1 Steve 1 3 # from 2 to 4s (included) - 1 time, 3/1 = 3s
2 Jane 2 2 # from 1 to 2s & from 4 to 5s (included) - 2 times, 4/2 = 2s
I've spent more than a day on this problem, without success...
Thanks by advance for your help!
So here's my go:
df['over2'] = df['speed (m/s)']>2
df['streak_id'] = (df['over2'] != df['over2'].shift(1)).cumsum()
streak_groups = df.groupby(['name','over2','streak_id'])["time"].agg(['min','max']).reset_index()
positive_streaks = streak_groups[streak_groups['over2'] & (streak_groups['min'] != streak_groups['max'])].copy()
positive_streaks["duration"] = positive_streaks["max"] - positive_streaks["min"] + 1
result = positive_streaks.groupby('name')['duration'].agg(['size', 'mean']).reset_index()
print(result)
Output:
name size mean
0 Jane 2 2
1 Mary 1 4
2 Steve 1 3
I'm basically giving each False/True streak a unique ID to be able to group by it, so each group is such a consecutive result.
Then I simply take the duration as the max time - min time, get rid of the streaks of len 1, and then get the size and mean of grouping by the name.
If you want to understand each step better, I suggest printing the intermediate DataFrames I have along the way.
Here is another version which checks for the condition (greater then 2) and creates a helper series s to keep track of duplicates later, then using series.where and series.duplicated we group on name using this result and aggregate count and nunique (number of unique values) , then divide:
c = df['speed (m/s)'].gt(2)
s = c.ne(c.shift()).cumsum()
u = (s.where(c&s.duplicated(keep=False)).groupby(df['name'],sort=False)
.agg(['count','nunique']))
out = (u.join(u['count'].div(u['nunique']).rename("Avg_duration")).reset_index()
.drop("count",1).rename(columns={"nunique":"Count"}))
print(out)
name Count Avg_duration
0 Mary 1 4.0
1 Steve 1 3.0
2 Jane 2 2.0
Interesting question! I found it quite difficult to come up with a nice solution using pandas, but if you happen to know R and the dplyr package, then you could write something like this:
library(tidyverse)
df %>%
mutate(indicator = `speed_(m/s)` > 2.0) %>%
group_by(name) %>%
mutate(streak = cumsum(!indicator)) %>%
group_by(streak, .add = TRUE) %>%
summarise(duration = sum(indicator)) %>%
filter(duration >= 2) %>%
summarise(count = n(), mean_duration = mean(duration))
#> # A tibble: 3 x 3
#> name count mean_duration
#> <chr> <int> <dbl>
#> 1 Jane 2 2
#> 2 Mary 1 4
#> 3 Steve 1 3
Created on 2020-08-31 by the reprex package (v0.3.0)
I apologize in advance if this is too off-topic, but I thought that other R-users (or maybe pandas-wizards) would find it interesting.

How to perform a multiple groupby and transform count with a condition in pandas

This is an extension of the question here: here
I am trying add an extra column to the grouby:
# Import pandas library
import pandas as pd
import numpy as np
# data
data = [['tom', 10,2,'c',100,'x'], ['tom',16 ,3,'a',100,'x'], ['tom', 22,2,'a',100,'x'],
['matt', 10,1,'c',100,'x'], ['matt', 15,5,'b',100,'x'], ['matt', 14,1,'b',100,'x']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category','Rating','Other'])
df['AttemptsbyRating'] = df.groupby(by=['Rating','Other'])['Attempts'].transform('count')
df
Then i try to add another column for the sum of rows that have a Score greater than 1 (which should equal 4):
df['scoregreaterthan1'] = df['Score'].gt(1).groupby(by=df[['Rating','Other']]).transform('sum')
But i am getting a
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Any ideas? thanks very much!
df['Score'].gt(1) is returning a boolean series rather than a dataframe. You need to return a dataframe first before you can groupby the relevant columns.
use:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
df
output:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
4 matt 15 5 b 100 x 6 4
If you want to keep the people who have a score that is not greater than one, then instead of this:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
do this:
df['scoregreaterthan1'] = df[df['Score'].gt(1)].groupby(['Rating','Other'])['Score'].transform('count')
df['scoregreaterthan1'] = df['scoregreaterthan1'].ffill().astype(int)
output 2:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
3 matt 10 1 c 100 x 6 4
4 matt 15 5 b 100 x 6 4
5 matt 14 1 b 100 x 6 4

Pandas Slow. Want first occurrence in DataFrame

I have a DataFrame of people. One of the columns in this DataFrame is a place_id. I also have a DataFrame of places, where one of the columns is place_id and another is weather. For every person, I am trying to find the corresponding weather. Importantly, many people have the same place_ids.
Currently, my setup is this:
def place_id_to_weather(pid):
return place_df[place_df['place_id'] == pid]['weather'].item()
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`
But this is untenably slow. I would like to speed this up. I suspect that I could achieve a speedup like this:
Instead of returning place_df[...].item(), which does a search for place_id == pid for that entire column and returns a series, and then grabbing the first item in that series, I really just want to curtail the search in place_df after the first match place_df['place_id']==pid has been found. After that, I don't need to search any further. How do I limit the search to first occurrences only?
Are there other methods I could use to achieve a speedup here? Some kind of join-type method?
I think you need drop_duplicates with merge, if there is only common columns place_id and weather in both DataFrames, you can omit parameter on (it depends of data, maybe on='place_id' is necessary):
df1 = place_df.drop_duplicates(['place_id'])
print (df1)
print (pd.merge(person_df, df1))
Sample data:
person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
'A':[4,5,6,7,8,9]})
print (person_df)
A place_id
0 4 s
1 5 d
2 6 f
3 7 s
4 8 d
5 9 f
place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
'weather':['y','e','r', 'h','u','i']})
print (place_df)
place_id weather
0 s y
1 d e
2 f r
3 s h
4 d u
5 f i
def place_id_to_weather(pid):
#for first occurence add iloc[0]
return place_df[place_df['place_id'] == pid]['weather'].iloc[0]
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
A place_id weather
0 4 s y
1 5 d e
2 6 f r
3 7 s y
4 8 d e
5 9 f r
#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
place_id weather
0 s y
1 d e
2 f r
print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
A place_id weather
0 4 s y
1 7 s y
2 5 d e
3 8 d e
4 6 f r
5 9 f r
The map function is your quickest method, the purpose of which is to avoid calling an entire dataframe to run some function repeatedly. This is what you ended up doing in your function i.e. calling an entire dataframe which is fine but not good doing it repeatedly. To tweak your code just a little will significantly speed up your process and only call the place_df dataframe once:
person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))
You can use merge to do the operation :
people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])
# name place
#0 bob 1
#1 alice 2
#2 john 3
#3 paul 2
weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])
# place weather
#0 1 sun
#1 2 rain
#2 3 snow
#3 1 rain
pd.merge(people, weather, on='place')
# name place weather
#0 bob 1 sun
#1 bob 1 rain
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow
In case you have several weather for the same place, you may want to use drop_duplicates, then you have the following result :
pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])
# name place weather
#0 bob 1 sun
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow

Categories

Resources