Apply a softmax function on groupby in the same pandas dataframe - python

I have been looking to apply the following softmax function from https://machinelearningmastery.com/softmax-activation-function-with-python/
from scipy.special import softmax
# define data
data = [1, 3, 2]
# calculate softmax
result = softmax(data)
# report the probabilities
print(result)
[0.09003057 0.66524096 0.24472847]
I am trying to apply this to a dataframe which is split by groups, and return the probabilites row by row for a group.
My dataframe is:
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','12','12','12'],
'Name': ['Joe','Jack','John','James','Jim'],
'Rating':[30,32,2.5,3,4],
}
df = pd.DataFrame(data=d)
df
EventNo Name Rating
0 10 Joe 30.0
1 10 Jack 32.0
2 12 John 2.5
3 12 James 3.0
4 12 Jim 4
In this instance there are two different events (10 and 12) where for event 10 the values are data = [30,32] and event 12 data = [2.5,3,4]
My expected result would be a new column probabilities with the results:
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.1192
1 10 Jack 32.0 0.8807
2 12 John 2.5 0.1402
3 12 James 3.0 0.2312
4 12 Jim 4 0.6285
Any help on how to do this on all groups in the dataframe would be much appreciated! Thanks!

You can use groupby followed by transform which returns results indexed by the original dataframe. A simple way to do it would be
df["Probabilities"] = df.groupby('EventNo')["Rating"].transform(softmax)
The result is
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.119203
1 10 Jack 32.0 0.880797
2 12 John 2.5 0.140244
3 12 James 3.0 0.231224
4 12 Jim 4.0 0.628532

Related

Select top n items in a pandas groupby and calculate the mean

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!
Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

How to perform a multiple groupby and transform count with a condition in pandas

This is an extension of the question here: here
I am trying add an extra column to the grouby:
# Import pandas library
import pandas as pd
import numpy as np
# data
data = [['tom', 10,2,'c',100,'x'], ['tom',16 ,3,'a',100,'x'], ['tom', 22,2,'a',100,'x'],
['matt', 10,1,'c',100,'x'], ['matt', 15,5,'b',100,'x'], ['matt', 14,1,'b',100,'x']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category','Rating','Other'])
df['AttemptsbyRating'] = df.groupby(by=['Rating','Other'])['Attempts'].transform('count')
df
Then i try to add another column for the sum of rows that have a Score greater than 1 (which should equal 4):
df['scoregreaterthan1'] = df['Score'].gt(1).groupby(by=df[['Rating','Other']]).transform('sum')
But i am getting a
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Any ideas? thanks very much!
df['Score'].gt(1) is returning a boolean series rather than a dataframe. You need to return a dataframe first before you can groupby the relevant columns.
use:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
df
output:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
4 matt 15 5 b 100 x 6 4
If you want to keep the people who have a score that is not greater than one, then instead of this:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
do this:
df['scoregreaterthan1'] = df[df['Score'].gt(1)].groupby(['Rating','Other'])['Score'].transform('count')
df['scoregreaterthan1'] = df['scoregreaterthan1'].ffill().astype(int)
output 2:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
3 matt 10 1 c 100 x 6 4
4 matt 15 5 b 100 x 6 4
5 matt 14 1 b 100 x 6 4

Moving average with pandas using the 2 prior occurrences

I was able to find the proper formula for a Moving average here: Moving Average SO Question
The issue is it is using the 1 occurrence prior and the current rows input. I am trying to use the 2 prior occurrence to the row I am trying to predict.
import pandas as pd
import numpy as np
df = pd.DataFrame({'person':['john','mike','john','mike','john','mike'],
'pts':[10,9,2,2,5,5]})
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean())
OUTPUT:
From the output we see that Johns second entry is using his first and the current row to Avg. What I am looking for is John and Mikes last occurrences to be John: 6 and Mike: 5.5 using the prior two, not the previous one and the current rows input. I am using this for a prediction and would not know the current rows pts because they haven't happend yet. New to Machine Learning and this was my first thought for a feature.
If want shift per groups add Series.shift to lambda function:
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean().shift())
print (df)
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5
Try:
df['avg'] = df.groupby('person').rolling(3)['pts'].sum().reset_index(level=0, drop=True)
df['avg']=df['avg'].sub(df['pts']).div(2)
Outputs:
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5

Fill values of a column based on mean of another column

I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan
You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64
pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace

How to compare group sizes in pandas

Maybe I'm thinking of this in the wrong way but I cannot think of an easy way to do this in pandas. I am trying to get a dataframe that is filtered by the relation between the count values above a setpoint compared to those below it. It is further complicated that it
Contrived example: Let's say I have a dataset of people and their test scores over several tests:
Person | day | test score |
----------------------------
Bob 1 10
Bob 2 40
Bob 3 45
Mary 1 30
Mary 2 35
Mary 3 45
I want to filter this dataframe by the number of test scores >= 40 compared to the total but for each person. Let's say I set the threshold to 50%. So Bob would have 2/3 of test scores but Mary would 1/3 and would be excluded.
My end goal would be to have a groupby object to do means/etc. on those that matched the threshold. So in this case it would look like this:
test score
Person | above_count | total | score mean |
-------------------------------------------
Bob 2 3 31.67
I have tried the following but couldn't figure out what to do my groupby object.
df = pd.read_csv("all_data.csv")
gb = df.groupby('Person')
df2 = df[df['test_score'] >= 40]
gb2 = df2.groupby('Person')
# This would get me the count for each person but how to compare it?
gb.size()
import pandas as pd
df = pd.DataFrame({'Person': ['Bob'] * 3 + ['Mary'] * 4,
'day': [1, 2, 3, 1, 2, 3, 4],
'test_score': [10, 40, 45, 30, 35, 45, 55]})
>>> df
Person day test_score
0 Bob 1 10
1 Bob 2 40
2 Bob 3 45
3 Mary 1 30
4 Mary 2 35
5 Mary 3 45
6 Mary 4 55
In a groupby operation, you can pass different functions to perform on the same column via a dictionary.
result = df.groupby('Person').test_score.agg(
{'total': pd.Series.count,
'test_score_above_mean': lambda s: s.ge(40).sum(),
'score mean': np.mean})
>>> result
test_score_above_mean total score mean
Person
Bob 2 3 31.666667
Mary 2 4 41.250000
>>> result[result.test_score_above_mean.gt(result.total * .5)]
test_score_above_mean total score mean
Person
Bob 2 3 31.666667
Sum and mean can be done with .agg() on a groupby object, but the threshold function forces you to do a flexible apply.
Untested, but something like this should work:
df.groupby('Person').apply(lambda x: sum(x > 40), sum(x), mean(x))
You could make the lambda function a more complicated, regular function that implements all the criteria/functionality you want.
I think it might make sense to use groupby and aggregations to generate each of your columns as pd.Series, and then paste them together at the end.
df = pd.DataFrame([['Bob',1,10],['Bob',2,40],['Bob',3,45],
['Mary',1,30],['Mary',2,35],['Mary',3,45]], columns=
['Person','day', 'test score'])
df_group = df.groupby('Person')
above_count = df_group.apply(lambda x: x[x['test score'] >= 40]['test score'].count())
above_count.name = 'test score above_count'
total_count = df_group['test score'].agg(np.size)
total_count.name = 'total'
test_mean = df_group['test score'].agg(np.mean)
test_mean.name = 'score mean'
results = pd.concat([above_count, total_count, test_mean])
There is an easy way of doing this ...
import pandas as pd
import numpy as np
data = '''Bob 1 10
Bob 2 40
Bob 3 45
Mary 1 30
Mary 2 35
Mary 3 45'''
data = [d.split() for d in data.split('\n')]
data = pd.DataFrame(data, columns=['Name', 'day', 'score'])
data.score = data.score.astype(float)
data['pass'] = (data.score >=40)*1
data['total'] = 1
You add two columns for easy computation to data. The result should look like this ...
Name day score pass total
0 Bob 1 10 0 1
1 Bob 2 40 1 1
2 Bob 3 45 1 1
3 Mary 1 30 0 1
4 Mary 2 35 0 1
5 Mary 3 45 1 1
Now you summarize the data ...
summary = data.groupby('Name').agg(np.sum).reset_index()
summary['mean score'] = summary['score']/summary['total']
summary['pass ratio'] = summary['pass']/summary['total']
print summary
The result looks like this ...
Name score pass total mean score pass ratio
0 Bob 95 2 3 31.666667 0.666667
1 Mary 110 1 3 36.666667 0.333333
Now, you can always filter out the names based on pass ratio ...

Categories

Resources