Pandas - Grouping and summing rows using a certain cell identifier - python

What I'm looking to do is take a table similar to the one below, identify all rows with the Type = 'Fee', and then add the total of that row to a row where some of the other columns match (So take the total from rows with Fee, find a row where the WEEK, STORE, and ID match, and add the total to that row). I should note that the row where Week, Store, and ID will match and it is NOT Type = Fee will be unique (only one of them) however there may be multiple fees that we want to group into it. As a single row example, the third row in the table below has the following:
Week = 15
Store = US1
ID = T3400
Total = 13
What I would look to do is find the row that matches those criteria, and add the sum. In this case, that would be row 1.
Within this data, there will be multiple Type = 'Fee' that I want to all collapse into this one row, the thing that I am struggling to do is preserve the Type that is not Fee the same.
I've given what the expected output would be below. In the expected output:
Row 1 Total = 1098 = 200 (starting) + 13 (row 3 from input) + 885 (row 8 from input)
Row 2 Total = 287 = 189 (starting) + 98 (row 5 from input)
Row 3 Total = 15 (Did not change from input as there were no Fee where the ID matched)
Row 4 Total = 581 = 146 (starting) + 435 (row 6 from input)
Row 5 Total = 189 (Did not change because even though the Store and ID matches, it is from a different week)
As you can see, it will find the rows with Fee, match the other 3 columns, sum the total, and there are no more rows with 'Fee' in the entire dataset. Obviously this is only a small snippet of the data, in total it will have about 20,000 rows to go through.
Input:
Week
Store
Type
ID
Total
15
US1
RE-G
T3400
200
15
US4
TO
T656
189
15
US1
Fee
T3400
13
16
US4
RD
T173
15
15
US4
Fee
T656
98
16
US4
Fee
T1121
435
17
US4
TO
T656
189
15
US1
Fee
T3400
885
16
US4
MX
T1121
146
Expected output:
Week
Store
Type
ID
Total
15
US1
RE-G
T3400
1098
15
US4
TO
T656
287
16
US4
RD
T173
15
16
US4
MX
T1121
581
17
US4
TO
T656
189

It looks like you want to groupby week, store, and ID, and get the sum total. You can also use first on Type after replacing Fee with nulls to get the correct type.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Week': [15, 15, 15, 16, 15, 16, 17, 15, 16],
'Store': ['US1', 'US4', 'US1', 'US4', 'US4', 'US4', 'US4', 'US1', 'US4'],
'Type': ['RE-G', 'TO', 'Fee', 'RD', 'Fee', 'Fee', 'TO', 'Fee', 'MX'],
'ID': ['T3400',
'T656',
'T3400',
'T173',
'T656',
'T1121',
'T656',
'T3400',
'T1121'],
'Total': [200, 189, 13, 15, 98, 435, 189, 885, 146]})
df['Type'].replace('Fee', np.nan, inplace=True)
df = df.groupby(['Week','Store', 'ID'], as_index=False).agg({'Type':'first', 'Total':sum})
print(df)
Output
Week Store ID Type Total
0 15 US1 T3400 RE-G 1098
1 15 US4 T656 TO 287
2 16 US4 T1121 MX 581
3 16 US4 T173 RD 15
4 17 US4 T656 TO 189

Related

Values summed in a previous group won't sum in a new group

When I group 'time_interval_code' the values of 'vehicle_real' in a result file are correct only for the first group, but not for the others. When 'time_interval_code' was used in previous group it seems that it is not in the sum of the new group. How to make sure 'vehicle _real' values are available to sum in every group?
The idea of 'time_interval_code' was to get rid of time format. I have 8 time intervals in the morning (07:00 - 07:15 is 1, 07:15 - 07:30 - 2, etc. up to 8).
I want to check maximum flow rate in an hour by adding 15 minutes each time for every junction and every direction from which cars were entering a junction. The measurements are given in 15 minutes interval, so I need to check 4 intervals every time. Results to be 'junction_id', 'source_direction' and sum of the 'vehicle_real' for that junction, direction and group of 'time_interval_code'.
To solve this I created groups that contains 4 time intervals. The problem I have is when I group 'time_interval_code' the values of 'vehicle_real' in a result file are correct only for the first group (1,2,3,4), but not for the others.
import pandas as pd
data = pd.read_excel("traffic.xlsx")
# Create a DataFrame from the list of data
df = pd.DataFrame(data)
# Define a function to get the morning groups for each time interval code
def get_morning_group(time_interval_code):
morning_groups = [(1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8)]
for group in morning_groups:
if time_interval_code in group:
return group
# Add a new column to the DataFrame that contains the morning groups for each time interval code
df['morning_groups'] = df['time_interval_code'].apply(get_morning_group)
# Group data by values
grouped_data = df.groupby(['junction_id', 'source_direction', 'morning_groups'])
# Calculate the sum of the vehicles_real values for each group
grouped_data = grouped_data['vehicles_real'].sum()
# Convert the grouped data back into a DataFrame
df = grouped_data.reset_index()
# Create the pivot table
pivot_table = df.pivot_table(index=['junction_id', 'source_direction'], columns=['morning_groups'], values='vehicles_real')
# Save the pivot table to a new Excel file
pivot_table.to_excel('max_flow_rate.xlsx')
The traffic.xlsx file has ca. 140k records. Every junction has at least 2 'source_direction' values. Junction with 'source_direction' has 'vehicles_real' values for every 'time_interval_code'. The file looks like this:
id
time_interval_code
junction_id
source_direction
vehicles_real
1
3
1001
N
140
2
1
2002
E
10
18
2
2011
W
41
21
5
2030
S
2
33
8
2030
N
140
35
7
2150
E
10
41
6
2150
W
41
52
5
2150
S
2
The output I get is fine, but the values are correct only for (1,2,3,4).
junction_id
source_direction
(1,2,3,4)
(2,3,4,5)
(3,4,5,6)
(4,5,6,7)
(5,6,7,8)
1001
N
257
95
69
61
59
1001
S
456
120
136
153
111
1002
N
2597
676
670
619
645
1002
S
2571
552
641
656
595
1003
N
586
181
148
127
142
1003
S
711
174
147
157
141

Pandas unique values in two columns?

I am very new to pandas. I have two dataframes related to two player Game
DF1:matches # match information
match_num winner_id loser_id points
270 201504 201595 28
271 201514 201426 19
272 201697 211901 21
273 201620 211539 30
274 214981 203564. 10
for match num 270 both players 201504 -> winner and 201595-> loser shared 28 points each.
I need to find out Which player(s) got the highest number of points overall?
I am using Hashmap to solve this problem?
hmap = defaultdict(int)
for index,row in matches_df.iterrows():
hmap[row["winner_id"]] += row["points"]
hmap[row["loser_id"]] += row["points"]
max_key = max(hmap, key=hmap.get)
Is this can be solved using pandas SQL way?
User melt to stack the two id columns, then groupby:
(df[['winner_id','loser_id','points']]
.melt('points', value_name='id')
.groupby('id')['points'].sum()
)
Output:
id
201426.0 19
201504.0 28
201514.0 19
201595.0 28
201620.0 30
201697.0 21
203564.0 10
211539.0 30
211901.0 21
214981.0 10
Name: points, dtype: int64

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

efficient way of sumproduct at row level based on column headers

I have a dataframe that looks somewhat below (please note There are columns beyond COST and UNITS)
TIME COST1 UNITS1_1 COST2 UNITS2_1 .... COSTN UNITSN_1
21:55:51 25 100 20 50 .... 22 130
22:55:51 23 100 24 150 .... 22 230
21:58:51 28 100 22 250 .... 22 430
I am looking at computing a sumproduct (New column) for each row such that (COST1*UNITS1_1) + (COST2*UNITS2_1) + (COSTN*UNITSN_1) is computed and stored in this column
Could you advise an efficient way here.
The ones that am thinking are looping through the column names based on the filter condition for the columns and /or using a lambda function to compute the necessary number.
Select columns by positions, convert to numpy array by DataFrame.to_numpy or DataFrame.values, multiple them and last sum:
#pandas 0.24+
df['new'] = (df.iloc[:, ::2].to_numpy() * df.iloc[:, 1::2].to_numpy()).sum(axis=1)
#pandas lower
#df['new'] = (df.iloc[:, ::2].values * df.iloc[:, 1::2].values).sum(axis=1)
Or use DataFrame.filter for select columns:
df['new'] = (df.filter(like='COST').to_numpy()*df.filter(like='UNITS').to_numpy()).sum(axis=1)
df['new'] = (df.filter(like='COST').values*df.filter(like='UNITS').values).sum(axis=1)
print (df)
COST1 UNITS1_1 COST2 UNITS2_1 COSTN UNITSN_1 new
TIME
21:55:51 25 100 20 50 22 130 6360
22:55:51 23 100 24 150 22 230 10960
21:58:51 28 100 22 250 22 430 17760

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.
What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()
​
Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Categories

Resources