Pivot Multiple Pandas Rows into Columns Based on Groupby Max - python

I'm fairly new with Python and pandas and have a problem I'm not quite sure how to solve. I have a pandas DataFrame that contains hockey players who have played for multiple teams in the same year:
Player Season Team GP G A TP
Player A 2020 A 10 8 3 11
Player A 2020 B 25 10 5 15
Player A 2020 C 6 4 7 11
Player B 2020 A 30 20 6 26
Player B 2020 B 25 18 5 23
I want to be able to combine rows that contain the same player from the same year, and arrange the columns by the team that player played the most for. In the above example all of Team B's numbers would be first because Player A has played the most games for Team B, followed by Team A and then Team C. If a player hasn't played for multiple teams or less than three, I'd like NA to be filled in for the given column.
For example the df above would turn into (Team1 stands for highest team):
Player Season Team1 GP1 G1 A1 TP1 Team2 GP2 G2 A2 TP2 Team3 GP3 G3 A3 TP3
Player A 2020 B 25 10 5 15 A 10 8 3 11 C 6 4 7 11
Player B 2020 A 30 20 6 26 B 25 18 5 23 NA NA NA NA NA
The initial way I can think of attacking this problem is by using a series of groupby max but I'm not sure if that will achieve the desired outcome. Any help would be greatly appreciated!

You could sort, then pivot:
a=(df.sort_values('GP')
.assign(col=df.groupby(['Player','Season']).cumcount()+1)
.pivot_table(index=['Player','Season'], columns='col', aggfunc='first')
)
# rename:
a.columns = [f'{x}{y}' for x,y in a.columns]

Related

% Difference Pivot Table python

I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

Pandas compare the same colums between merged dfs

I have two dfs that look like the following:
Df1:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 30
Df2:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 70
If I do the following:
merge = pd.merge(df1, df2, on=['area', 'team'])
I get:
merge:
area team score_x score_y
ontario team 1 60 60
ontario team 3 30 30
ontario team 2 50 50
new york team 1 90 90
new york team 2 30 70
It can be noted that the score in the last row of both dfs is different.
I would like to find what the percent difference is in between score_x and score_y.
However I actually have hundreds of metrics such as "score". How can I find the percent difference of each column of the merged df which has the same key before the merge is done and the _x and _y are apended?
Whats the best way to do this? I guess I could just get a list of the common keys and append a _y and _x to each and then go through the list and check the percent difference of both columns, but is there a better way to do this?
Just set 'area' and 'team' as the frame index and do the "normal" math:
df1.set_index(['area','team'], inplace=True)
df2.set_index(['area','team'], inplace=True)
(df1 - df2) / df1
# score
#area team
#ontario team 1 0.000000
# team 3 0.000000
# team 2 0.000000
#new york team 1 0.000000
# team 2 -1.333333

Categories

Resources