Code Optimization for groupby - python

I have the below code that basically performs a group by operation, followed by a sum.
grouped = df.groupby(by=['Cabin'], as_index=False)['Fare'].sum()
I then rename the columns
grouped.columns = ['Cabin', 'testCol']
And I then merge the "grouped" dataframe with my original dataframe to calculate aggregate.
df2 = df.merge(grouped, on='Cabin')
What this does is to populate my initial dataframe with the 'testCol' from my "grouped" dataframe.
Can this code be optimized to fit in one line or something similar?

It seems need GroupBy.transform for new column of sums:
df['testCol'] = df.groupby('Cabin')['Fare'].transform('sum')

Related

Return dataframe to original state from pivot state

I am currently trying to return a dataframe to its' original state after performing some operations on the pivoted dataframe.
I basically have a dataframe which looks like:
After transforming the dataframe using pivot and performing some operations on it, the dataframe looks like this where every row represented by a date, every column is unique combination of appkey+cc and the value is the target.
besides that, I have also added an aggregation of the sum of target under total which sums up daily target and appkey_total which sums up daily target but only for the appkey.
The idea is to return the pivoted table to it's original state + the total and appkey_total as added columns.
My problem is that I don't keep appkey and cc as columns in the pivot table and I concatenate the appkey and cc, so I'm not sure how to return it back?
I can't melt it because I don't have the original columns names.
Any help will be appreciated, thanks!
Edit
After trying what #jezrael suggested, I got the following output:
As can be seen, the appkey was added as index, while the 3 unique appkey are stayed as column names.
Add total to index for MultiIndex, then split columns and reshape by DataFrame.stack:
df1 = df.set_index('total', append=True)
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis(['cc','appkey'], axis=1).stack([0, 1]).reset_index()

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Is there any advice on how to tweak my code to return the correct table as a dataframe?

I am tyring to write a function that takes a dataframe, groups the dataframe by a column, and then orders that column by from largest to smallest using the average of a second column. I am trying to return a dataframe. I am using both seaborn and pandas.
This what I have so far
def table(df, columnone, columntwo):
dfnew = df.groupby([columnone])[columntwo].nlargest()
return dfnew
I am not very sure what I am missing or what I should be looking for. I am pretty new with python and any help would be appreciated.
I think you are looking for this:
def table(df, columnone, columntwo):
return df.groupby([columnone])\
.mean()\
.sort_values(by=[columntwo], ascending=False)
Here groupby will create the groups, mean will average the values in other columns, sort_values will sort the resulting dataframe created after applying groupby.

Finding first repeated consecutive entries in pandas dataframe

I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]

How to do sorting after groupby and aggregation on a Pandas Dataframe

I'm having a Pandas Dataframe and I'm doing a groupby on two columns and have a couple of aggregate functions on a column. Here is how my code looks like
df2 = df[X,Y, Z].groupby([X,Y]).agg([np.mean, np.max, np.min]).reset_index()
It find the aggregate functions on the column Z.
I need to sort by let's say min (i.e. sort_values('min')) column but it keeps complaining that 'min' column does not exist. How can I do that
Since you are generating a pd.MultiIndex, you must use a tuple in sort_values.
Try:
df2.sort_values(('Z','amin'))

Categories

Resources