Replace duplicate values from one DF to another DF - python

df1
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df2
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
df2 has many other columns.
The output I need is:
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
From df1 it is clear that ITEM 54252 and 54519 are the same. So based on df1 we need to replace the values in df2.

I modify previous solution with new column orig for remember original values of ITEM, create Series by DataFrame.set_index and Series.replace values in another DataFrame:
df = df1.assign(orig=df1['ITEM'])
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
s = df[m].set_index('orig')['ITEM']
print (s)
orig
54519 54519
23661 23661
23226 23661
54252 54519
Name: ITEM, dtype: int64
df2['ITEM'] = df2['ITEM'].replace(s)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
Another alternative without new column is replace by dictionary:
orig = df1['ITEM']
m = df1.duplicated(['CATEGORY', 'COLOR'], keep=False)
df1.loc[m, 'ITEM'] = df1[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df1)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
d = dict(zip(orig[m], df1.loc[m, 'ITEM']))
print (d)
{54519: 54519, 23661: 23661}
df2['ITEM'] = df2['ITEM'].replace(d)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES

Related

Pandas - organize and count unique data

How to do this with pandas?
I have this dataset, which consists of a list of cars and its colors (cars and colors may repeat):
Color Car
Blue Honda
Green Honda
Green Honda
Blue fiat
Black fiat
....
Yellow nissan
I would like to create a column for each car with its respective color (without duplicated colors related to each car). In the example, Honda & green happens twice, but in the honda-column ” green” would appear only once.
Something like this:
+----------------------+------------+----------------------+---------+
| Color | Car | Honda | Fiat |
+----------------------+------------+----------------------+---------+
| Blue | Honda |Blue |Blue
| Green | Honda Green |Black
| Green | Honda |Yellow
| Blue | fiat
| Black | fiat
….
| Yellow | nissan
+-----------------------------------+------------+--------+
I also would like to know how many colors (no duplicate) each car has (amount of unique items in the column "Colar" related to each item in the "Car" column).
try join with pd.crosstab
df1 = df.join(
pd.crosstab(df.index, df["Car"], df["Color"], aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda Green
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
For unique colors according to your example output we can create a boolean mask and apply this back to the values parameter in pd.crosstab
unique_color = np.where(
df.groupby(['Car','Color']).cumcount().ge(1), "", df["Color"]
)
df1 = df.join(pd.crosstab(df.index, df["Car"], unique_color, aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow

Mutli indexing using pandas

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
I tried using the below code
df.groupby(['CATEGORY', 'COLOR']).size().unstack(fill_value=0)
which does not give me what I need. The output I am expecting is
CATEGORY COLOR ITEM
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I believe you need DataFrame.set_index with DataFrame.sort_index:
df1 = df.set_index(['CATEGORY', 'COLOR']).sort_index()
print (df1)
ITEM
CATEGORY COLOR
BIKE BLACK 54519
BLUE 23661
CAR BLACK 14582
RED 48684
JEEP WHITE 45685
If order is omportant convert both columns to ordered categoricals:
cols = ['CATEGORY', 'COLOR']
for c in cols:
df[c] = pd.Categorical(df[c], categories=df[c].drop_duplicates(), ordered=True)
df1 = df.set_index(cols).sort_index()
print (df1)
ITEM
CATEGORY COLOR
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I will use Categorical first, since this will keep the original CATEGORY order
df.CATEGORY=pd.Categorical(df.CATEGORY,df.CATEGORY.unique())
s=df.sort_values(['CATEGORY','COLOR']).set_index(['CATEGORY','COLOR'])
ITEM
CATEGORY COLOR
CAR BLACK 14582
RED 48684
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685

Replace duplicates with first values in dataframe

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.
Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

Pandas replace dataframe values based on criteria

I have a master dataframe, df:
Colour Item Price
Blue Car 40
Red Car 30
Green Truck 50
Green Bike 30
I then have a price correction dataframe, df_pc:
Colour Item Price
Red Car 60
Green Bike 70
I want to say if there is a match on Colour and Item in the price correction dataframe, then replace the price in the master df. so the expected output is;
Colour Item Price
Blue Car 60
Red Car 30
Green Truck 50
Green Bike 70
I can't find a way of doing this currently
Use Index.isin for filter out no matched rows and then DataFrame.combine_first:
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
Another data test:
print (df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 <- not matched row
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 30.0
2 Green Truck 50.0
3 Red Car 60.0
here is a way using combine_first():
df_pc.set_index(['Colour','Item']).combine_first(
df.set_index(['Colour','Item'])).reset_index()
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
EDIT:
If you want only matching items, we can also use merge with fillna:
print(df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 #changed row not matching
(df.merge(df_pc, on = ['Colour','Item'],how='left',suffixes=('_x',''))
.assign(Price=lambda x:x['Price'].fillna(x['Price_x'])).reindex(df.columns,axis=1))
Colour Item Price
0 Blue Car 40.0
1 Red Car 60.0
2 Green Truck 50.0
3 Green Bike 30.0

Python Pandas dataframe cross-referencing and new column generation

I want to generate a dataframe that contains lists of a person's potential favorite crayon colors, based on their favorite color. I have two dataframes that contain the necessary information:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
I want to reference one database against the other by matching the df1 color entry to the df2 color entry, and returning the corresponding possible_crayons values as a list in a new column in df1. Any terms that did not find a match would be labeled N/A. So the desired output would be:
person favorite_color possible_crayons_list
Jeff blue [navy, aqua]
Marie purple [periwinkle, royal purple]
Jenna brown NaN
Mike green [forest green, pink]
I've tried:
mergedDF = pd.merge(df1, df2, how='left')
However, this results in the following:
person color possible_crayons
0 Jeff blue navy
1 Jeff blue aqua
2 Marie purple periwinkle
3 Marie purple royal purple
4 Jenna brown NaN
5 Mike green forest green
6 Mike green pine
Is there any way to achieve my desired output of lists?
We can use DataFrame.merge with how='left' and then GroupBy.agg with as_index=False:
new_df= ( df1.merge(df2,how='left',on='color')
.groupby(['color','person'],as_index=False).agg(list) )
Output
print(new_df)
color person possible_crayons
0 blue Jeff [navy, aqua]
1 brown Jenna [nan]
2 green Mike [forest green, pine]
3 purple Marie [periwinkle, royal purple]
Use this:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
tmp = df2.groupby('color')['possible_crayons'].apply(list)
mergedDF = df1.merge(tmp, how='left', left_on='color', right_index=True)
print(mergedDF)
mergedDF2 = mergedDF.groupby('color')['possible_crayons'].apply(list).reset_index(name='new_possible_crayons')

Categories

Resources