Pandas compare the same colums between merged dfs

Pandas compare the same colums between merged dfs - python

I have two dfs that look like the following:
Df1:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 30
Df2:
area team score
ontario team 1 60
ontario team 3 30
ontario team 2 50
new york team 1 90
new york team 2 70
If I do the following:
merge = pd.merge(df1, df2, on=['area', 'team'])
I get:
merge:
area team score_x score_y
ontario team 1 60 60
ontario team 3 30 30
ontario team 2 50 50
new york team 1 90 90
new york team 2 30 70
It can be noted that the score in the last row of both dfs is different.
I would like to find what the percent difference is in between score_x and score_y.
However I actually have hundreds of metrics such as "score". How can I find the percent difference of each column of the merged df which has the same key before the merge is done and the _x and _y are apended?
Whats the best way to do this? I guess I could just get a list of the common keys and append a _y and _x to each and then go through the list and check the percent difference of both columns, but is there a better way to do this?

Just set 'area' and 'team' as the frame index and do the "normal" math:
df1.set_index(['area','team'], inplace=True)
df2.set_index(['area','team'], inplace=True)
(df1 - df2) / df1
# score
#area team
#ontario team 1 0.000000
# team 3 0.000000
# team 2 0.000000
#new york team 1 0.000000
# team 2 -1.333333

Related

Group by when condition is not just by row in Python Pandas

I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.

You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0

There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')

In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Pivot Multiple Pandas Rows into Columns Based on Groupby Max

I'm fairly new with Python and pandas and have a problem I'm not quite sure how to solve. I have a pandas DataFrame that contains hockey players who have played for multiple teams in the same year:
Player Season Team GP G A TP
Player A 2020 A 10 8 3 11
Player A 2020 B 25 10 5 15
Player A 2020 C 6 4 7 11
Player B 2020 A 30 20 6 26
Player B 2020 B 25 18 5 23
I want to be able to combine rows that contain the same player from the same year, and arrange the columns by the team that player played the most for. In the above example all of Team B's numbers would be first because Player A has played the most games for Team B, followed by Team A and then Team C. If a player hasn't played for multiple teams or less than three, I'd like NA to be filled in for the given column.
For example the df above would turn into (Team1 stands for highest team):
Player Season Team1 GP1 G1 A1 TP1 Team2 GP2 G2 A2 TP2 Team3 GP3 G3 A3 TP3
Player A 2020 B 25 10 5 15 A 10 8 3 11 C 6 4 7 11
Player B 2020 A 30 20 6 26 B 25 18 5 23 NA NA NA NA NA
The initial way I can think of attacking this problem is by using a series of groupby max but I'm not sure if that will achieve the desired outcome. Any help would be greatly appreciated!

You could sort, then pivot:
a=(df.sort_values('GP')
.assign(col=df.groupby(['Player','Season']).cumcount()+1)
.pivot_table(index=['Player','Season'], columns='col', aggfunc='first')
)
# rename:
a.columns = [f'{x}{y}' for x,y in a.columns]

Multiply columns based on two columns conditions from different dataframes?

I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?

Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64

You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0

you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB

Merge two dataframes based on a column

I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ?
df1
place name qty unit
NY Tom 2 10
TK Ron 3 15
Lon Don 5 90
Hk Sam 4 49
df2
place name price
PH Tom 7
TK Ron 5
Result:
df3
place name qty unit
NY Tom 2 10
TK Ron 3 15

Option 1
Using df.isin:
In [1362]: df1[df1.name.isin(df2.name)]
Out[1362]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 2
Performing an inner-join with df.merge:
In [1365]: df1.merge(df2.name.to_frame())
Out[1365]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 3
Using df.eq:
In [1374]: df1[df1.name.eq(df2.name)]
Out[1374]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15

You want something called an inner join.
df1.merge(df2,on = 'name')
place_x name qty unit place_y price
NY Tom 2 10 PH 7
TK Ron 3 15 TK 5
The _xand _y happens when you have a column in both data frames being merged.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas compare the same colums between merged dfs - python

Related

Group by when condition is not just by row in Python Pandas

Get latest value looked up from other dataframe

Pivot Multiple Pandas Rows into Columns Based on Groupby Max

Multiply columns based on two columns conditions from different dataframes?

Merge two dataframes based on a column

Categories

Resources