I have a data frame below that shows the price of wood and steel from two different suppliers.
I would like to add a column that shows the highest price for the opposite item (i.e. if line is wood, it would pull steel) from the same supplier.
For example, the "Steel" row for "Tom" would show his highest wood price which is 42.
The code I have so far simply returns the highest price for the original item (i.e. not the opposite, so for Tom's steel row returns 24 but I would have wanted it to return 42).
I think this is an issue with pulling the max value for a multi-group. I have tried a number of different ways but just cannot seem to get it.
Any thoughts would be greatly appreciated.
import pandas as pd
import numpy as np
data = {'Supplier':['Tom', 'Tom', 'Tom', 'Bill','Bill','Bill'],'Item':['Wood','Wood','Steel','Steel','Steel','Wood'],'Price':[42,33,24,16,12,18]}
df = pd.DataFrame(data)
df['Opp_Item'] = np.where(df['Item']=="Wood", "Steel", "Wood")
df['Opp_Item_Max'] = df.groupby(['Supplier','Opp_Item'])['Price'].transform(max)
print(df)
Supplier Item Price Opp_Item Opp_Item_Max
0 Tom Wood 42 Steel 42
1 Tom Wood 33 Steel 42
2 Tom Steel 24 Wood 24
3 Bill Steel 16 Wood 16
4 Bill Steel 12 Wood 16
5 Bill Wood 18 Steel 18
If you can find the per supplier+item maximum, then you can just swap the values and assign them back through a join:
v = df.groupby(['Supplier', 'Item']).Price.max().unstack(-1)
# This reversal operation works under the assumption that
# there are only two items and that they are opposites of each other.
v[:] = v.values[:, ::-1]
df = (df.set_index(['Supplier', 'Item'])
.join(v.stack().to_frame('Opp_Item_Max'), how='left')
.reset_index())
print(df)
Supplier Item Price Opp_Item_Max
0 Bill Steel 16 18
1 Bill Steel 12 18
2 Bill Wood 18 16
3 Tom Steel 24 42
4 Tom Wood 42 24
5 Tom Wood 33 24
Note: Ordering of your data will not be preserved after the join.
You could map to the opposite values before a groupby, and then merge this back to the original DataFrame.
d = {'Steel': 'Wood', 'Wood': 'Steel'}
df.merge(df.assign(Item = df.Item.map(d))
.groupby(['Supplier', 'Item'], as_index=False).max(),
on=['Supplier', 'Item'],
how='left',
suffixes=['', '_Opp_Item'])
Supplier Item Price Price_Opp_Item
0 Tom Wood 42 24
1 Tom Wood 33 24
2 Tom Steel 24 42
3 Bill Steel 16 18
4 Bill Steel 12 18
5 Bill Wood 18 16
Related
Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
I have some American Football data in a DataFrame like below:
df = pd.DataFrame({'Green Bay Packers' : ['30-18-0', '5-37', '10-71' ],
'Chicago Bears' : ['45-26-1', '5-20', '10-107']},
index=['Att - Comp - Int', 'Sacked - Yds Lost', 'Penalties - Yards'])
Green Bay Packers Chicago Bears
Att - Comp - Int 30-18-0 45-26-1
Sacked - Yds Lost 5-37 5-20
Penalties - Yards 10-71 10-107
You can see above that each row contains multiple data points that need to be split off.
What I'd like to do is find some way to split the rows up so that each data point is it's own row. The final output would like like:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Is there a way to do this efficiently? I tried some Regex but it just turned into a mess. Sorry if my formatting isn't perfect...2nd question ever posted here.
Try:
df = df.reset_index().apply(lambda x: x.str.split("-"))
df = pd.DataFrame(
{c: df[c].explode().str.strip() for c in df.columns},
).set_index("index")
df.index.name = None
print(df)
Prints:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
First reset the index, then stack all the columns and split them on -, You can also additionally apply to remove any left over whitespace characters after after using split, then unstack again, then apply pd.Series.explode finally reset the index, and drop any left-over unrequired column.
out = (df.reset_index()
.stack().str.split('-').apply(lambda x:[i.strip() for i in x])
.unstack()
.apply(pd.Series.explode)
.reset_index()
.drop(columns='level_0'))
index Green Bay Packers Chicago Bears
0 Att 30 45
1 Comp 18 26
2 Int 0 1
3 Sacked 5 5
4 Yds Lost 37 20
5 Penalties 10 10
6 Yards 71 107
Assuming you have same number of splits for every row, with pandas >= 1.3.0, you can explode multiple columns at the same time:
df = df.reset_index().apply(lambda s: s.str.split(' *- *'))
df.explode(df.columns.tolist()).set_index('index')
Green Bay Packers Chicago Bears
index
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Use .apply() on each column (including index) and for each column:
use .str.split() to split data points and
use .explode() to create rows for each split data point element
df_out = (df.reset_index()
.apply(lambda x: x.str.split(r'\s*-\s*').explode())
.set_index('index').rename_axis(index=None)
)
Result:
print(df_out)
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18
question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna
I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))