How to do this with pandas?
I have this dataset, which consists of a list of cars and its colors (cars and colors may repeat):
Color Car
Blue Honda
Green Honda
Green Honda
Blue fiat
Black fiat
....
Yellow nissan
I would like to create a column for each car with its respective color (without duplicated colors related to each car). In the example, Honda & green happens twice, but in the honda-column ” green” would appear only once.
Something like this:
+----------------------+------------+----------------------+---------+
| Color | Car | Honda | Fiat |
+----------------------+------------+----------------------+---------+
| Blue | Honda |Blue |Blue
| Green | Honda Green |Black
| Green | Honda |Yellow
| Blue | fiat
| Black | fiat
….
| Yellow | nissan
+-----------------------------------+------------+--------+
I also would like to know how many colors (no duplicate) each car has (amount of unique items in the column "Colar" related to each item in the "Car" column).
try join with pd.crosstab
df1 = df.join(
pd.crosstab(df.index, df["Car"], df["Color"], aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda Green
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
For unique colors according to your example output we can create a boolean mask and apply this back to the values parameter in pd.crosstab
unique_color = np.where(
df.groupby(['Car','Color']).cumcount().ge(1), "", df["Color"]
)
df1 = df.join(pd.crosstab(df.index, df["Car"], unique_color, aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
I tried using the below code
df.groupby(['CATEGORY', 'COLOR']).size().unstack(fill_value=0)
which does not give me what I need. The output I am expecting is
CATEGORY COLOR ITEM
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I believe you need DataFrame.set_index with DataFrame.sort_index:
df1 = df.set_index(['CATEGORY', 'COLOR']).sort_index()
print (df1)
ITEM
CATEGORY COLOR
BIKE BLACK 54519
BLUE 23661
CAR BLACK 14582
RED 48684
JEEP WHITE 45685
If order is omportant convert both columns to ordered categoricals:
cols = ['CATEGORY', 'COLOR']
for c in cols:
df[c] = pd.Categorical(df[c], categories=df[c].drop_duplicates(), ordered=True)
df1 = df.set_index(cols).sort_index()
print (df1)
ITEM
CATEGORY COLOR
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I will use Categorical first, since this will keep the original CATEGORY order
df.CATEGORY=pd.Categorical(df.CATEGORY,df.CATEGORY.unique())
s=df.sort_values(['CATEGORY','COLOR']).set_index(['CATEGORY','COLOR'])
ITEM
CATEGORY COLOR
CAR BLACK 14582
RED 48684
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.
Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
I have a master dataframe, df:
Colour Item Price
Blue Car 40
Red Car 30
Green Truck 50
Green Bike 30
I then have a price correction dataframe, df_pc:
Colour Item Price
Red Car 60
Green Bike 70
I want to say if there is a match on Colour and Item in the price correction dataframe, then replace the price in the master df. so the expected output is;
Colour Item Price
Blue Car 60
Red Car 30
Green Truck 50
Green Bike 70
I can't find a way of doing this currently
Use Index.isin for filter out no matched rows and then DataFrame.combine_first:
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
Another data test:
print (df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 <- not matched row
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 30.0
2 Green Truck 50.0
3 Red Car 60.0
here is a way using combine_first():
df_pc.set_index(['Colour','Item']).combine_first(
df.set_index(['Colour','Item'])).reset_index()
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
EDIT:
If you want only matching items, we can also use merge with fillna:
print(df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 #changed row not matching
(df.merge(df_pc, on = ['Colour','Item'],how='left',suffixes=('_x',''))
.assign(Price=lambda x:x['Price'].fillna(x['Price_x'])).reindex(df.columns,axis=1))
Colour Item Price
0 Blue Car 40.0
1 Red Car 60.0
2 Green Truck 50.0
3 Green Bike 30.0
I want to generate a dataframe that contains lists of a person's potential favorite crayon colors, based on their favorite color. I have two dataframes that contain the necessary information:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
I want to reference one database against the other by matching the df1 color entry to the df2 color entry, and returning the corresponding possible_crayons values as a list in a new column in df1. Any terms that did not find a match would be labeled N/A. So the desired output would be:
person favorite_color possible_crayons_list
Jeff blue [navy, aqua]
Marie purple [periwinkle, royal purple]
Jenna brown NaN
Mike green [forest green, pink]
I've tried:
mergedDF = pd.merge(df1, df2, how='left')
However, this results in the following:
person color possible_crayons
0 Jeff blue navy
1 Jeff blue aqua
2 Marie purple periwinkle
3 Marie purple royal purple
4 Jenna brown NaN
5 Mike green forest green
6 Mike green pine
Is there any way to achieve my desired output of lists?
We can use DataFrame.merge with how='left' and then GroupBy.agg with as_index=False:
new_df= ( df1.merge(df2,how='left',on='color')
.groupby(['color','person'],as_index=False).agg(list) )
Output
print(new_df)
color person possible_crayons
0 blue Jeff [navy, aqua]
1 brown Jenna [nan]
2 green Mike [forest green, pine]
3 purple Marie [periwinkle, royal purple]
Use this:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
tmp = df2.groupby('color')['possible_crayons'].apply(list)
mergedDF = df1.merge(tmp, how='left', left_on='color', right_index=True)
print(mergedDF)
mergedDF2 = mergedDF.groupby('color')['possible_crayons'].apply(list).reset_index(name='new_possible_crayons')