How to divide 2 aggregated columns using groupby in Pandas? - python

In the titanic dataset, I wish to calculate the percentage of passengers who survived with each of Passenger class (Pclass) 1,2 & 3. I figured out how to get the count of passengers and no. of passengers who survived using group by as below:
train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))
So, I get the below output:
PassengerCount SurvivedPassengerCount
Pclass
1 216 136
2 184 87
3 491 119
But how do I get a percentage column? I want the output as below:
PassengerCount SurvivedPassengerCount PercSurvived
Pclass
1 216 136 62.9%
2 184 87 47.3%
3 491 119 24.2%
Thanks in advance!

Since you only need to divide SurvivedPassengerCount by PassengerCount, you can do this using the .assign method:
result = train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))\
result = result.assign(PercSurvived=df['PassengerCount']/df['SurvivedPassengerCount'])

Related

Adding a matching value from 3 different DataFrames, not the entire column Python

I have three different DateFrames (df2019, df2020, and df2021) and the all have the same columns(here are a few) with some overlapping 'BrandID':
BrandID StockedOutDays Profit SalesQuantity
243 01-02760 120 516452.76 64476
138 01-01737 96 603900.0 80520
166 01-02018 125 306796.8 52896
141 01-01770 109 297258.6 39372
965 02-35464 128 214039.2 24240
385 01-03857 92 326255.16 30954
242 01-02757 73 393866.4 67908
What I'm trying to do is add the value from one column for a specific BrandID from each of the 3 DataFrame's. In my specific case, I'd like to add the value of 'Sales Quantity' for 'BrandID' = 01-02757 from df2019, df2020 and df2021 and get a line I can run to see a single number.
I've searched around and tried a bunch of different things, but am stuck. Please help, thank you!
EDIT *** I'm looking for something like this I think, I just don't know how to sum them all together:
df2021.set_index('BrandID',inplace=True)
df2020.set_index('BrandID',inplace=True)
df2019.set_index('BrandID',inplace=True)
df2021.loc['01-02757']['SalesQuantity']+df2020.loc['01-02757']['SalesQuantity']+
df2019.loc['01-02757']['SalesQuantity']
import pandas as pd
df2019 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":120, "Profit":516452.76, "SalesQuantity":64476},
{"BrandID":"01-01737", "StockedOutDays":96, "Profit":603900.0, "SalesQuantity":80520}])
df2020 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":123, "Profit":76481.76, "SalesQuantity":2457},
{"BrandID":"01-01737", "StockedOutDays":27, "Profit":203014.0, "SalesQuantity":15648}])
df2019["year"] = 2019
df2020["year"] = 2020
df = pd.DataFrame.append(df2019, df2020)
df_sum = df.groupby("BrandID").agg("sum").drop("year",axis=1)
print(df)
print(df_sum)
df:
BrandID StockedOutDays Profit SalesQuantity year
0 01-02760 120 516452.76 64476 2019
1 01-01737 96 603900.00 80520 2019
0 01-02760 123 76481.76 2457 2020
1 01-01737 27 203014.00 15648 2020
df_sum:
StockedOutDays Profit SalesQuantity
BrandID
01-01737 123 806914.00 96168
01-02760 243 592934.52 66933

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Pandas: how to apply weight column to create a new dataframe with weighted data

I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!

Pandas/Python - update dataframe

I have the following dataframe:
df:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
I'm running a model to find out Ratio for each id.
My model is finding Ratio, it is in another data frame, that looks like this:
result:
143
Wins 32
Ratio 987
However, I'm struggling to update df with ratio. I'm looking for a function that simply updates df for the id 143. Tryed to use the pd.dataframe.update() but seems it doesn't work that way (or I was unable to make it work). Can someone help on that?
Where:
df
Outputs:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
And:
result
Outputs:
143
Wins 32
Ratio 98
You can update df using combine_first:
df.replace('None',np.nan).combine_first(result.T)
Output:
Wins Ratio
143 32 98.0
234 10 NaN
678 2 NaN

pandas - percentage of matr

Afternoon,
I am trying to recreate a table but replacing the raw numbers with percentage of the total column. For instance, i have:
Code 03/31/2016 12/31/2015 09/30/2015
F55 425 387 369
F554 109 106 106
F508 105 105 106
the desired output is a new dataframe, with the numbers replaced by the percentage with the total being the sum of the column (03/31/2016 = 425+109+105)
Code 03/31/2016 12/31/2015 09/30/2015
F55 66.5% 64.7% 63.5%
F554 17% 17.7% 18.2%
F508 16.4% 17.5% 18.2%
thanks for your help
I'm sure there's a more elegant answer somewhere but this will work:
df['03/31/2016'].apply(lambda x : x/df['03/31/2016'].sum())
or if you want to do this for the entire dataframe:
df.apply(lambda x : x/x.sum(), axis=0)

Categories

Resources