Melt DataFrame with two value variables - python

I have a dataframe with inventory and purchases across multiple stores and regions. I am trying to stack the dataframe using melt, but I need to have two value columns, inventory and purchases, and can't figure out how to do that. The dataframe looks like this:
Region | Store | Inventory_Item_1 | Inventory_Item_2 | Purchase_Item_1 | Purchase_Item_2
------------------------------------------------------------------------------------------------------
North A 15 20 5 6
North B 20 25 7 8
North C 18 22 6 10
South D 10 15 9 7
South E 12 12 10 8
The format I am trying to get the dataframe into looks like this:
Region | Store | Item | Inventory | Purchases
-----------------------------------------------------------------------------
North A Inventory_Item_1 15 5
North A Inventory_Item_2 20 6
North B Inventory_Item_1 20 7
North B Inventory_Item_2 25 8
North C Inventory_Item_1 18 6
North C Inventory_Item_2 22 10
South D Inventory_Item_1 10 9
South D Inventory_Item_2 15 7
South E Inventory_Item_1 12 10
South E Inventory_Item_2 12 8
This is what I have written, but I don't know how to create columns for Inventory and Purchases. Note that my full dataframe is considerably larger (50+ regions, 140+ stores, 15+ items).
df_1 = df.melt(id_vars = ['Store','Region'],value_vars = ['Inventory_Item_1','Inventory_Item_2'])
Any help or advice would be appreciated!

I would do these with hierarchical indexes on the rows and columns.
For the rows, you can set_index(['Region', 'Store']) easily enough.
You have to get a little tricksy for the columns though. Since you need access to the non-index columns that result from setting the index on Region and Store, you need to pipe it to a custom function that builds the desired tuples and creates a name multi-level column index.
After that, you can stack the columns into the row index and optionally reset the full row index to make everything a normal column again.
df = pd.DataFrame({
'Region': ['North', 'North', 'North', 'South', 'South'],
'Store': ['A', 'B', 'C', 'D', 'E'],
'Inventory_Item_1': [15, 20, 18, 10, 12],
'Inventory_Item_2': [20, 25, 22, 15, 12],
'Purchase_Item_1': [5, 7, 6, 9, 10],
'Purchase_Item_2': [6, 8, 10, 7, 8]
})
output = (
df.set_index(['Region', 'Store'])
.pipe(lambda df:
df.set_axis(df.columns.str.split('_', n=1, expand=True), axis='columns')
)
.rename_axis(['Status', 'Product'], axis='columns')
.stack(level='Product')
.reset_index()
)
Which gives me:
Region Store Product Inventory Purchase
North A Item_1 15 5
North A Item_2 20 6
North B Item_1 20 7
North B Item_2 25 8
North C Item_1 18 6
North C Item_2 22 10
South D Item_1 10 9
South D Item_2 15 7
South E Item_1 12 10
South E Item_2 12 8

You can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github :
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(
index=["Region", "Store"],
names_to=(".value", "item"),
names_pattern=r"(Inventory|Purchase)_(.+)",
sort_by_appearance=True,
)
Region Store item Inventory Purchase
0 North A Item_1 15 5
1 North A Item_2 20 6
2 North B Item_1 20 7
3 North B Item_2 25 8
4 North C Item_1 18 6
5 North C Item_2 22 10
6 South D Item_1 10 9
7 South D Item_2 15 7
8 South E Item_1 12 10
9 South E Item_2 12 8
It works by passing a regex, containing groups to names_pattern parameter. The '.value' in names_to ensures that Inventory and Purchase are kept as column headers while the other group(Item_1 and Item_2) are collated into a new group item.

You can get to there by these steps:
# please always provide minimal working code - we as helpers and answerers
# otherwise have to invest extra time to generate beginning working code
# and that is unfair - we already spend enough time to solve the problem:
df = pd.DataFrame([
["North","A",15,20,5,6],
["North","B",20,25,7,8],
["North","C",18,22,6,10],
["South","D",10,15,9,7],
["South","E",12,12,10,8]], columns=["Region","Store","Inventory_Item_1","Inventory_Item_2","Purchase_Item_1","Purchase_Item_2"])
# melt the dataframe completely first
df_final = pd.melt(df, id_vars=['Region', 'Store'], value_vars=['Inventory_Item_1', 'Inventory_Item_2', 'Purchase_Item_1', 'Purchase_Item_2'])
# extract inventory and purchase sub data frames
# they have in common the "variable" column (the item number!)
# so let it look exactly the same in both data frames by removing
# unnecessary parts
df_inventory = df_final.loc[[x.startswith("Inventory") for x in df_final.variable],:]
df_inventory.variable = [s.replace("Inventory_", "") for s in df_inventory.variable]
df_purchase = df_final.loc[[x.startswith("Purchase") for x in df_final.variable],:]
df_purchase.variable = [s.replace("Purchase_", "") for s in df_purchase.variable]
# deepcopy the data frames (just to keep old results so that you can inspect them)
df_purchase_ = df_purchase.copy()
df_inventory_ = df_inventory.copy()
# rename the columns to prepare for merging
df_inventory_.columns = ["Region", "Store", "variable", "Inventory"]
df_purchase_.columns = ["Region", "Store", "variable", "Purchase"]
# merge by the three common columns
df_final_1 = pd.merge(df_inventory_, df_purchase_, how="left", left_on=["Region", "Store", "variable"], right_on=["Region", "Store", "variable"])
# sort by the three common columns
df_final_1.sort_values(by=["Region", "Store", "variable"], axis=0)
This returns
Region Store variable Inventory Purchase
0 North A Item_1 15 5
5 North A Item_2 20 6
1 North B Item_1 20 7
6 North B Item_2 25 8
2 North C Item_1 18 6
7 North C Item_2 22 10
3 South D Item_1 10 9
8 South D Item_2 15 7
4 South E Item_1 12 10
9 South E Item_2 12 8

Related

Sort values intra group [duplicate]

This question already has an answer here:
Pandas groupby sort each group values and order dataframe groups based on max of each group
(1 answer)
Closed 1 year ago.
Suppose I have this dataframe:
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
price category
0 2 shirts
1 13 pants
2 24 shirts
3 15 tops
4 11 hat
5 44 tops
I want to sort values in such a way that:
Find what is the highest price of each category.
Sort categories according to highest price (in this case, in descending order: tops, shirts, pants, hat).
Sort each category according to higher price.
The final dataframe would look like:
price category
0 44 tops
1 15 tops
2 24 shirts
3 24 shirts
4 13 pants
5 11 hat
I'm not a big fan of one-liners, so here's my solution:
# Add column with max-price for each category
df = df.merge(df.groupby('category')['price'].max().rename('max_cat_price'),
left_on='category', right_index=True)
# Sort
df.sort_values(['category','price','max_cat_price'], ascending=False)
# Drop column that has max-price for each category
df.drop('max_cat_price', axis=1, inplace=True)
print(df)
price category
5 44 tops
3 15 tops
2 24 shirts
0 2 shirts
1 13 pants
4 11 hat
You can use .groupby and .sort_values:
df.join(df.groupby("category").agg("max"), on="category", rsuffix="_r").sort_values(
["price_r", "price"], ascending=False
)
Output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11
I used the get_group in an dataframe apply to get the max price for a category
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
grouped=df.groupby('category')
df['price_r']=df['category'].apply(lambda row: grouped.get_group(row).price.max())
df=df.sort_values(['category','price','price_r'], ascending=False)
print(df)
output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

How to get top 3 sales in data frame after using group by and sorting in python?

recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Retrieve the numbers from the file corresponding to the given regions specified in the file

The below is my dataframe :
Sno Name Region Num
0 1 Rubin Indore 79744001550
1 2 Rahul Delhi 89824304549
2 3 Rohit Noida 91611611478
3 4 Chirag Delhi 85879761557
4 5 Shan Bharat 95604535786
5 6 Jordi Russia 80777784005
6 7 El Russia 70008700104
7 8 Nino Spain 87707101233
8 9 Mark USA 98271377772
9 10 Pattinson Hawk Eye 87888888889
Retrieve the numbers and store it region wise from the given CSV file.
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
I am getting the results, but I want to achieve the data by the use of dictionary in python. Can I use it?
IIUC, you can use groupby, apply the list aggregation then use to_dict:
data.groupby('Region')['Num'].apply(list).to_dict()
[out]
{'Bharat': [95604535786],
'Delhi': [89824304549, 85879761557],
'Hawk Eye': [87888888889],
'Indore': [79744001550],
'Noida': [91611611478],
'Russia': [80777784005, 70008700104],
'Spain': [87707101233],
'USA': [98271377772]}

How can I change this form of dictionary to pandas dataframe?

I'm now processing tweet data using python pandas module,
and I stuck with the problem.
I want to make a frequency table(pandas dataframe) from this dictionary:
d = {"Nigeria": 9, "India": 18, "Saudi Arabia": 9, "Japan": 60, "Brazil": 3, "United States": 38, "Spain": 5, "Russia": 3, "Ukraine": 3, "Azerbaijan": 5, "China": 1, "Germany": 3, "France": 12, "Philippines": 8, "Thailand": 5, "Argentina": 9, "Indonesia": 3, "Netherlands": 8, "Turkey": 2, "Mexico": 9, "Italy": 2}
desired output is:
>>> import pandas as pd
>>> df = pd.DataFrame(?????)
>>> df
Country Count
Nigeria 9
India 18
Saudi Arabia 9
.
.
.
(no matter if there's index from 0 to n at the leftmost column)
Can anyone help me to deal with this problem?
Thank you in advance!
You have only a single series (a column of data with index values), really, so this works:
pd.Series(d, name='Count')
You can then construct a DataFrame if you want:
df = pd.DataFrame(pd.Series(d, name='Count'))
df.index.name = 'Country'
Now you have:
Count
Country
Argentina 9
Azerbaijan 5
Brazil 3
...
Use DataFrame constructor and pass values and keys separately to columns:
df = pd.DataFrame({'Country':list(d.keys()),
'Count': list(d.values())}, columns=['Country','Count'])
print (df)
Country Count
0 Azerbaijan 5
1 Indonesia 3
2 Germany 3
3 France 12
4 Mexico 9
5 Italy 2
6 Spain 5
7 Brazil 3
8 Thailand 5
9 Argentina 9
10 Ukraine 3
11 United States 38
12 Turkey 2
13 Nigeria 9
14 Saudi Arabia 9
15 Philippines 8
16 China 1
17 Japan 60
18 Russia 3
19 India 18
20 Netherlands 8
Pass it as a list
pd.DataFrame([d]).T.rename(columns={0:'count'})
That might get the work done but will kill the performance since we are saying the keys are columns and then transposing it. So since d.items() gives us the tuples we can do
df = pd.DataFrame(list(d.items()),columns=['country','count'])
df.head()
country count
0 Germany 3
1 Philippines 8
2 Mexico 9
3 Nigeria 9
4 Saudi Arabia 9

Categories

Resources