Sort values intra group [duplicate] - python

This question already has an answer here:
Pandas groupby sort each group values and order dataframe groups based on max of each group
(1 answer)
Closed 1 year ago.
Suppose I have this dataframe:
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
price category
0 2 shirts
1 13 pants
2 24 shirts
3 15 tops
4 11 hat
5 44 tops
I want to sort values in such a way that:
Find what is the highest price of each category.
Sort categories according to highest price (in this case, in descending order: tops, shirts, pants, hat).
Sort each category according to higher price.
The final dataframe would look like:
price category
0 44 tops
1 15 tops
2 24 shirts
3 24 shirts
4 13 pants
5 11 hat

I'm not a big fan of one-liners, so here's my solution:
# Add column with max-price for each category
df = df.merge(df.groupby('category')['price'].max().rename('max_cat_price'),
left_on='category', right_index=True)
# Sort
df.sort_values(['category','price','max_cat_price'], ascending=False)
# Drop column that has max-price for each category
df.drop('max_cat_price', axis=1, inplace=True)
print(df)
price category
5 44 tops
3 15 tops
2 24 shirts
0 2 shirts
1 13 pants
4 11 hat

You can use .groupby and .sort_values:
df.join(df.groupby("category").agg("max"), on="category", rsuffix="_r").sort_values(
["price_r", "price"], ascending=False
)
Output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

I used the get_group in an dataframe apply to get the max price for a category
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
grouped=df.groupby('category')
df['price_r']=df['category'].apply(lambda row: grouped.get_group(row).price.max())
df=df.sort_values(['category','price','price_r'], ascending=False)
print(df)
output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

Related

Summing different dates and categories

So I have a pandas data frame that is grouped by date and a particular category and has the sum of another column. What I would like to do is take the number for a particular category for a particular day and add it to the next day and then take that number and add it to the next day. For example, say the category is apples, the date is 5-26-2021 and the cost is $5. The next day, 5-27-2021 is $6. So 5-27-2021 should have a cost of $11. Then 5-28-2021 has a cost of $3 but it should be added to $11 so the cost should show up as $14. How can I go about doing this? There are multiple categories by the way besides just the apples. Thank you!
enter image description here
Expected Output:
(the output is not the most accurate and this data frame is not the most accurate so feel free to ask questions)
Use groupby then cumsum
data = [
[2021, 'apple', 1,],
[2022, 'apple', 2,],
[2021, 'banana', 3,],
[2022, 'cherry', 4],
[2022, 'banana', 5],
[2023, 'cherry', 6],
]
columns = ['date','category', 'cost']
df = pd.DataFrame(data, columns=columns)
>>> df
date category cost
0 2021 apple 1
1 2022 apple 2
2 2021 banana 3
3 2022 cherry 4
4 2022 banana 5
5 2023 cherry 6
df.sort_values(['category','date'], inplace=True)
df.reset_index(drop=True, inplace=True)
df['CostCsum'] = df.groupby(['category'])['cost'].cumsum()
date category cost CostCsum
0 2021 apple 1 1
1 2022 apple 2 3
2 2021 banana 3 3
3 2022 banana 5 8
4 2022 cherry 4 4
5 2023 cherry 6 10

Melt DataFrame with two value variables

I have a dataframe with inventory and purchases across multiple stores and regions. I am trying to stack the dataframe using melt, but I need to have two value columns, inventory and purchases, and can't figure out how to do that. The dataframe looks like this:
Region | Store | Inventory_Item_1 | Inventory_Item_2 | Purchase_Item_1 | Purchase_Item_2
------------------------------------------------------------------------------------------------------
North A 15 20 5 6
North B 20 25 7 8
North C 18 22 6 10
South D 10 15 9 7
South E 12 12 10 8
The format I am trying to get the dataframe into looks like this:
Region | Store | Item | Inventory | Purchases
-----------------------------------------------------------------------------
North A Inventory_Item_1 15 5
North A Inventory_Item_2 20 6
North B Inventory_Item_1 20 7
North B Inventory_Item_2 25 8
North C Inventory_Item_1 18 6
North C Inventory_Item_2 22 10
South D Inventory_Item_1 10 9
South D Inventory_Item_2 15 7
South E Inventory_Item_1 12 10
South E Inventory_Item_2 12 8
This is what I have written, but I don't know how to create columns for Inventory and Purchases. Note that my full dataframe is considerably larger (50+ regions, 140+ stores, 15+ items).
df_1 = df.melt(id_vars = ['Store','Region'],value_vars = ['Inventory_Item_1','Inventory_Item_2'])
Any help or advice would be appreciated!
I would do these with hierarchical indexes on the rows and columns.
For the rows, you can set_index(['Region', 'Store']) easily enough.
You have to get a little tricksy for the columns though. Since you need access to the non-index columns that result from setting the index on Region and Store, you need to pipe it to a custom function that builds the desired tuples and creates a name multi-level column index.
After that, you can stack the columns into the row index and optionally reset the full row index to make everything a normal column again.
df = pd.DataFrame({
'Region': ['North', 'North', 'North', 'South', 'South'],
'Store': ['A', 'B', 'C', 'D', 'E'],
'Inventory_Item_1': [15, 20, 18, 10, 12],
'Inventory_Item_2': [20, 25, 22, 15, 12],
'Purchase_Item_1': [5, 7, 6, 9, 10],
'Purchase_Item_2': [6, 8, 10, 7, 8]
})
output = (
df.set_index(['Region', 'Store'])
.pipe(lambda df:
df.set_axis(df.columns.str.split('_', n=1, expand=True), axis='columns')
)
.rename_axis(['Status', 'Product'], axis='columns')
.stack(level='Product')
.reset_index()
)
Which gives me:
Region Store Product Inventory Purchase
North A Item_1 15 5
North A Item_2 20 6
North B Item_1 20 7
North B Item_2 25 8
North C Item_1 18 6
North C Item_2 22 10
South D Item_1 10 9
South D Item_2 15 7
South E Item_1 12 10
South E Item_2 12 8
You can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github :
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(
index=["Region", "Store"],
names_to=(".value", "item"),
names_pattern=r"(Inventory|Purchase)_(.+)",
sort_by_appearance=True,
)
Region Store item Inventory Purchase
0 North A Item_1 15 5
1 North A Item_2 20 6
2 North B Item_1 20 7
3 North B Item_2 25 8
4 North C Item_1 18 6
5 North C Item_2 22 10
6 South D Item_1 10 9
7 South D Item_2 15 7
8 South E Item_1 12 10
9 South E Item_2 12 8
It works by passing a regex, containing groups to names_pattern parameter. The '.value' in names_to ensures that Inventory and Purchase are kept as column headers while the other group(Item_1 and Item_2) are collated into a new group item.
You can get to there by these steps:
# please always provide minimal working code - we as helpers and answerers
# otherwise have to invest extra time to generate beginning working code
# and that is unfair - we already spend enough time to solve the problem:
df = pd.DataFrame([
["North","A",15,20,5,6],
["North","B",20,25,7,8],
["North","C",18,22,6,10],
["South","D",10,15,9,7],
["South","E",12,12,10,8]], columns=["Region","Store","Inventory_Item_1","Inventory_Item_2","Purchase_Item_1","Purchase_Item_2"])
# melt the dataframe completely first
df_final = pd.melt(df, id_vars=['Region', 'Store'], value_vars=['Inventory_Item_1', 'Inventory_Item_2', 'Purchase_Item_1', 'Purchase_Item_2'])
# extract inventory and purchase sub data frames
# they have in common the "variable" column (the item number!)
# so let it look exactly the same in both data frames by removing
# unnecessary parts
df_inventory = df_final.loc[[x.startswith("Inventory") for x in df_final.variable],:]
df_inventory.variable = [s.replace("Inventory_", "") for s in df_inventory.variable]
df_purchase = df_final.loc[[x.startswith("Purchase") for x in df_final.variable],:]
df_purchase.variable = [s.replace("Purchase_", "") for s in df_purchase.variable]
# deepcopy the data frames (just to keep old results so that you can inspect them)
df_purchase_ = df_purchase.copy()
df_inventory_ = df_inventory.copy()
# rename the columns to prepare for merging
df_inventory_.columns = ["Region", "Store", "variable", "Inventory"]
df_purchase_.columns = ["Region", "Store", "variable", "Purchase"]
# merge by the three common columns
df_final_1 = pd.merge(df_inventory_, df_purchase_, how="left", left_on=["Region", "Store", "variable"], right_on=["Region", "Store", "variable"])
# sort by the three common columns
df_final_1.sort_values(by=["Region", "Store", "variable"], axis=0)
This returns
Region Store variable Inventory Purchase
0 North A Item_1 15 5
5 North A Item_2 20 6
1 North B Item_1 20 7
6 North B Item_2 25 8
2 North C Item_1 18 6
7 North C Item_2 22 10
3 South D Item_1 10 9
8 South D Item_2 15 7
4 South E Item_1 12 10
9 South E Item_2 12 8

% Difference Pivot Table python

I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

Python - Performing Max Function on Multiple Groupby

I have a data frame below that shows the price of wood and steel from two different suppliers.
I would like to add a column that shows the highest price for the opposite item (i.e. if line is wood, it would pull steel) from the same supplier.
For example, the "Steel" row for "Tom" would show his highest wood price which is 42.
The code I have so far simply returns the highest price for the original item (i.e. not the opposite, so for Tom's steel row returns 24 but I would have wanted it to return 42).
I think this is an issue with pulling the max value for a multi-group. I have tried a number of different ways but just cannot seem to get it.
Any thoughts would be greatly appreciated.
import pandas as pd
import numpy as np
data = {'Supplier':['Tom', 'Tom', 'Tom', 'Bill','Bill','Bill'],'Item':['Wood','Wood','Steel','Steel','Steel','Wood'],'Price':[42,33,24,16,12,18]}
df = pd.DataFrame(data)
df['Opp_Item'] = np.where(df['Item']=="Wood", "Steel", "Wood")
df['Opp_Item_Max'] = df.groupby(['Supplier','Opp_Item'])['Price'].transform(max)
print(df)
Supplier Item Price Opp_Item Opp_Item_Max
0 Tom Wood 42 Steel 42
1 Tom Wood 33 Steel 42
2 Tom Steel 24 Wood 24
3 Bill Steel 16 Wood 16
4 Bill Steel 12 Wood 16
5 Bill Wood 18 Steel 18
If you can find the per supplier+item maximum, then you can just swap the values and assign them back through a join:
v = df.groupby(['Supplier', 'Item']).Price.max().unstack(-1)
# This reversal operation works under the assumption that
# there are only two items and that they are opposites of each other.
v[:] = v.values[:, ::-1]
df = (df.set_index(['Supplier', 'Item'])
.join(v.stack().to_frame('Opp_Item_Max'), how='left')
.reset_index())
print(df)
Supplier Item Price Opp_Item_Max
0 Bill Steel 16 18
1 Bill Steel 12 18
2 Bill Wood 18 16
3 Tom Steel 24 42
4 Tom Wood 42 24
5 Tom Wood 33 24
Note: Ordering of your data will not be preserved after the join.
You could map to the opposite values before a groupby, and then merge this back to the original DataFrame.
d = {'Steel': 'Wood', 'Wood': 'Steel'}
df.merge(df.assign(Item = df.Item.map(d))
.groupby(['Supplier', 'Item'], as_index=False).max(),
on=['Supplier', 'Item'],
how='left',
suffixes=['', '_Opp_Item'])
Supplier Item Price Price_Opp_Item
0 Tom Wood 42 24
1 Tom Wood 33 24
2 Tom Steel 24 42
3 Bill Steel 16 18
4 Bill Steel 12 18
5 Bill Wood 18 16

Categories

Resources