I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0
Related
I have a csv dataset where I have a column name "Types of Incidents" and another column named "Number of units".
Using Python and Pandas I am trying to find the average of "Number of units" when the value in type of incidents is 111. (It is found multiple times).
I have tried searching for multiple pandas methods but couldn't find how to find it on a huge dataset.
Here is the question:
What is the ratio of the average number of units that arrive to a scene of an incident classified as '111 - Building fire' to the number that arrive for '651 - Smoke scare, odor of smoke'?
An alternate to ML-Nielsen's value specific answer:
df.groupby('Types of Incidents')['Number of units'].mean()
This will provide the average Number of units for all Incident Types.
You can specify multiple columns as well if needed.
Reproducible Example:
data = {
"Incident_Type": [111, 380, 390, 111, 651, 651],
"Number_of_units": [50, 40, 45, 99, 12, 13]
}
data = pd.DataFrame(data)
data
Incident_Type Number_of_units
0 111 50
1 380 40
2 390 45
3 111 99
4 651 12
5 651 13
data.groupby('Incident_Type')['Number_of_units'].mean()
Incident_Type
111 74.5
380 40.0
390 45.0
651 12.5
Name: Number_of_units, dtype: float64
Now if you wish to find the ratios of the units you will need to store this result as a dataframe.
average_units = data.groupby('Incident_Type')['Number_of_units'].mean().to_frame()
average_units = average_units.reset_index()
average_units
Incident_Type Number_of_units
0 111 74.5
1 380 40.0
2 390 45.0
3 651 12.5
So we have our result stored in a dataframe called average_units.
incident1_units = average_units[average_units['Incident_Type']==111]['Number_of_units'].values[0]
incident2_units = average_units[average_units['Incident_Type']==651]['Number_of_units'].values[0]
incident1_units / incident2_units
5.96
If I understand correctly, you probably have to first select the right rows and then calculate the mean. Something like this:
df.loc[df['Types of Incidents']==111, 'Number of units'].mean()
This will give you the mean of Number of units where the condition df['Types of Incidents']==111 is true.
I'm using python 3.x I have pandas data frame df. df looks like
Name Maths Science Social studies
abc 80 70 90
cde 90 60 80
xyz 100 80 85
...
...
I would like to generate a pandas data frame which will store student name, maximum marks & the subject contributed maximum marks. If maximum marks is 100 then it will consider next highest instead of 100. So my output data frame will look like
Name Highest_Marks Subject_contributed_Max
abc 90 Social Studies
cde 90 Maths
xyz 85 Social Studies
Can you suggest me how to do it?
You can use:
df2 = df.drop(columns='Name').mask(df.eq(100))
df['Highest_Marks'] = df2.max(axis=1)
df['Subject_contributed_Max'] = df2.idxmax(axis=1)
output:
Name Maths Science Social studies Highest_Marks Subject_contributed_Max
0 abc 80 70 90 90.0 Social studies
1 cde 90 60 80 90.0 Maths
2 xyz 100 80 85 85.0 Social studies
For efficiency, avoiding computing twice the max/idxmax, you can compute the idxmax and use a lookup
s = (df
.drop(columns='Name')
.mask(df.eq(100))
.idxmax(axis=1)
)
idx, cols = pd.factorize(s)
df['Highest_Marks'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
df['Subject_contributed_Max'] = s
This will work
df_melt = df.melt('Name')
df_melt = df_melt.loc[df_melt['value'] < 100]
df_melt['RN'] = df_melt.sort_values(['value'], ascending=False).groupby(['Name']).cumcount() + 1
df_melt.loc[df_melt['RN'] == 1].sort_values('Name')
This will do it:
df = df.set_index('Name').stack().reset_index().rename(columns={
'level_1':'Subject_contributed_Max', 0:'Highest_Marks'}).sort_values(
['Name','Highest_Marks'])
df = df[df['Highest_Marks'] != 100].groupby('Name').last().reset_index()[[
'Name', 'Highest_Marks', 'Subject_contributed_Max']]
Input:
Name Maths Science Social studies
0 abc 80 70 90
1 cde 90 60 80
2 xyz 100 80 85
Output:
Name Highest_Marks Subject_contributed_Max
0 abc 90 Social studies
1 cde 90 Maths
2 xyz 85 Social studies
UPDATE:
Here is a faster approach than my original answer above. It's similar to one of the suggestions in #mozway's answer, though it uses where() instead of mask() (and it also only returns the high marks, not the individual columns of marks).
df2 = df.set_index('Name')
df2 = df2.where(df2 < 100)
df2 = df2.assign(**{'Highest_Marks':df2.max(axis=1).astype(int),
'Subject_contributed_Max':df2.idxmax(axis=1)}).reset_index()[[
'Name', 'Highest_Marks', 'Subject_contributed_Max']]
Prompted by a comment by OP, I ran some benchmarks on answers by #mozway and me (I also tried adding the answer by #ArchAngelPwn, but it doesn't seem to give comparable output in its current form).
Here are the results for a 1000 row by 9000 column dataframe:
Timeit results:
foo_1 (orig) ran in 2.4102477666456252 seconds using 3 iterations
foo_2 (refined) ran in 2.256996333327455 seconds using 3 iterations
foo_3 (where) ran in 1.1588773333545153 seconds using 3 iterations
foo_4 (mozway mask) ran in 1.17148056665125 seconds using 3 iterations
foo_5 (mozway mask lookup) ran in 1.1049298333236948 seconds using 3 iterations
I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.
First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for
I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64