Adding Subtotals to Pandas Groupby - python

I am looking for a cleaner way to add subtotals to Pandas groupby.
Here is my DataFrame:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 50),
'Sub-Category':np.random.choice( ['X','Y'], 50),
'Product':np.random.choice( ['Product 1','Product 2'], 50),
'Units_Sold':np.random.randint(1,100, size=(50)),
'Dollars_Sold':np.random.randint(100,1000, size=50),
'Date':np.random.choice( pd.date_range('1/1/2011','03/31/2011',
freq='D'), 50, replace=False)})
From there, I create a new Groupby Dataframe as such:
df1 = df.groupby(['Category','Sub-Category','Product',pd.TimeGrouper(key='Date',freq='M')]).agg({'Units_Sold':'sum','Dollars_Sold':'sum'}).unstack().fillna(0)
I would like to provide sub-totals for both Category & Sub-Category. I can do this using this code:
df2 = df1.groupby(level=[0,1]).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0),
df2.index.get_level_values(1) + ' Total',
len(df2) * ['']])
df3 = df1.groupby(level=[0]).sum()
df3.index = pd.MultiIndex.from_arrays([df3.index.get_level_values(0) + ' Total',
len(df3) * [''],
len(df3) * ['']])
pd.concat([df1,df2,df3]).sort_index()
This gives me the DataFrame I want:
Final DataFrame
My question - is there a more pythonic way to do this than to have to create a new DataFrame for each level then concat together? I have researched this, but can not find a better way. I have to do this for many different MultiIndex dataframes & am seeking a better solution.
Thanks in advance for your help!
EDIT WITH ADDITIONAL INFORMATION:
Thank you to both #Wen & #DaFanat for their replies. I attempted to use the link #Wen provided on my data [link]:Python (Pandas) Add subtotal on each lvl of multiindex dataframe
pd.concat([df.assign(\
**{x: 'Total' for x in "CategorySub-CategoryProduct"[i:]}\
).groupby(list('abc')).sum() for i in range(1,4)])\
.sort_index()
This sums the total, however it ignores the dates that make up the second level of the columns. It leaves me with this outcome.Resulting Image
I've tried to add in a TimeGrouper with the groupby, but that returns an error. Any help would be greatly appreciated. Thanks!

I can get you a lot closer by aligning your attempt above with the example from #piRSquared.
The list must match the MultiIndex. Try this instead:
iList = ['Category','Sub-Category','Product']
pd.concat([
df1.assign(
**{x: '' for x in iList[i:]}
).groupby(iList).sum() for i in range(1,4)
]).sort_index()
It doesn't present the word "Total" in the right place, nor are the totals at the bottom of each group, but at least it's more-or-less functionally correct. My totals won't match because the values in the DataFrame are random.
It took me a while to work through the original answer provided in Python (Pandas) Add subtotal on each lvl of multiindex dataframe. But the same logic applies here.
The assign() replaces the values in the columns with what is in the dict that is returned by the dict comprehension executed over the elements of the list of MultiIndex columns.
Then groupby() only finds unique values for those non-blanked-out columns and sums them accordingly.
These groupbys are enclosed in a list comprehension, so pd.concat() then just combines these sets of rows.
And sort_index() puts the index labels in ascending order.
(Yes, you still get a warning about "both a column name and an index level," but it still works.)

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

What is the most efficient way to compare to Dataframe columns and match the similar rows based on a function?

Assuming we have two Dataframes each containing a column of somewhat similar string-based values. What is the most efficient and/or effective way to match the rows with similar columns, based on a comparison function — like textdistance's implementation of Jaro-Winkler?
Example DataFrames:
first_df = pd.DataFrame( ['Cars and cats', 'Spaceship', 'Captain Marvel', 'Dune','Bucks in 6'], columns=['Title'])
second_df = pd.DataFrame( ['Captain Harlock', 'Cats and dogs', 'Buccuneers', 'Dune buggy','Milwaukee Bucks'], columns=['Title'])
What I'm thinking is:
Creating a Cartesian Product based on each DataFrame's column of interest
Apply the comparison function and store the result in a new column. Let's call it similarity_score
Sort the new DataFrame by best value to worst (depending on the algorithm)
Drop the duplicates of the column we are mostly interested in
Implementation:
comparison_df = first_df.merge(second_df, how='cross')
comparison_df['similarity_score'] = comparison_df.apply(lambda row: textdistance.jaro_winkler.normalized_similarity(row['First DataFrame Titles'], row['Second DataFrame Titles']), axis=1)
display(comparison_df)
comparison_df = comparison_df.sort_values('similarity_score', ascending=False).drop_duplicates(subset=['First DataFrame Titles'], keep='first')
Any suggestions are welcome. Thank you in advance.

Python dataframe groupby by dictionary list then sum

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.
If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Categories

Resources