I have the following dataframe:
df:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
I'm running a model to find out Ratio for each id.
My model is finding Ratio, it is in another data frame, that looks like this:
result:
143
Wins 32
Ratio 987
However, I'm struggling to update df with ratio. I'm looking for a function that simply updates df for the id 143. Tryed to use the pd.dataframe.update() but seems it doesn't work that way (or I was unable to make it work). Can someone help on that?
Where:
df
Outputs:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
And:
result
Outputs:
143
Wins 32
Ratio 98
You can update df using combine_first:
df.replace('None',np.nan).combine_first(result.T)
Output:
Wins Ratio
143 32 98.0
234 10 NaN
678 2 NaN
Related
I have three different DateFrames (df2019, df2020, and df2021) and the all have the same columns(here are a few) with some overlapping 'BrandID':
BrandID StockedOutDays Profit SalesQuantity
243 01-02760 120 516452.76 64476
138 01-01737 96 603900.0 80520
166 01-02018 125 306796.8 52896
141 01-01770 109 297258.6 39372
965 02-35464 128 214039.2 24240
385 01-03857 92 326255.16 30954
242 01-02757 73 393866.4 67908
What I'm trying to do is add the value from one column for a specific BrandID from each of the 3 DataFrame's. In my specific case, I'd like to add the value of 'Sales Quantity' for 'BrandID' = 01-02757 from df2019, df2020 and df2021 and get a line I can run to see a single number.
I've searched around and tried a bunch of different things, but am stuck. Please help, thank you!
EDIT *** I'm looking for something like this I think, I just don't know how to sum them all together:
df2021.set_index('BrandID',inplace=True)
df2020.set_index('BrandID',inplace=True)
df2019.set_index('BrandID',inplace=True)
df2021.loc['01-02757']['SalesQuantity']+df2020.loc['01-02757']['SalesQuantity']+
df2019.loc['01-02757']['SalesQuantity']
import pandas as pd
df2019 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":120, "Profit":516452.76, "SalesQuantity":64476},
{"BrandID":"01-01737", "StockedOutDays":96, "Profit":603900.0, "SalesQuantity":80520}])
df2020 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":123, "Profit":76481.76, "SalesQuantity":2457},
{"BrandID":"01-01737", "StockedOutDays":27, "Profit":203014.0, "SalesQuantity":15648}])
df2019["year"] = 2019
df2020["year"] = 2020
df = pd.DataFrame.append(df2019, df2020)
df_sum = df.groupby("BrandID").agg("sum").drop("year",axis=1)
print(df)
print(df_sum)
df:
BrandID StockedOutDays Profit SalesQuantity year
0 01-02760 120 516452.76 64476 2019
1 01-01737 96 603900.00 80520 2019
0 01-02760 123 76481.76 2457 2020
1 01-01737 27 203014.00 15648 2020
df_sum:
StockedOutDays Profit SalesQuantity
BrandID
01-01737 123 806914.00 96168
01-02760 243 592934.52 66933
I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)
In the titanic dataset, I wish to calculate the percentage of passengers who survived with each of Passenger class (Pclass) 1,2 & 3. I figured out how to get the count of passengers and no. of passengers who survived using group by as below:
train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))
So, I get the below output:
PassengerCount SurvivedPassengerCount
Pclass
1 216 136
2 184 87
3 491 119
But how do I get a percentage column? I want the output as below:
PassengerCount SurvivedPassengerCount PercSurvived
Pclass
1 216 136 62.9%
2 184 87 47.3%
3 491 119 24.2%
Thanks in advance!
Since you only need to divide SurvivedPassengerCount by PassengerCount, you can do this using the .assign method:
result = train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))\
result = result.assign(PercSurvived=df['PassengerCount']/df['SurvivedPassengerCount'])
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47
Afternoon,
I am trying to recreate a table but replacing the raw numbers with percentage of the total column. For instance, i have:
Code 03/31/2016 12/31/2015 09/30/2015
F55 425 387 369
F554 109 106 106
F508 105 105 106
the desired output is a new dataframe, with the numbers replaced by the percentage with the total being the sum of the column (03/31/2016 = 425+109+105)
Code 03/31/2016 12/31/2015 09/30/2015
F55 66.5% 64.7% 63.5%
F554 17% 17.7% 18.2%
F508 16.4% 17.5% 18.2%
thanks for your help
I'm sure there's a more elegant answer somewhere but this will work:
df['03/31/2016'].apply(lambda x : x/df['03/31/2016'].sum())
or if you want to do this for the entire dataframe:
df.apply(lambda x : x/x.sum(), axis=0)