How to do this with pandas?
I have this dataset, which consists of a list of cars and its colors (cars and colors may repeat):
Color Car
Blue Honda
Green Honda
Green Honda
Blue fiat
Black fiat
....
Yellow nissan
I would like to create a column for each car with its respective color (without duplicated colors related to each car). In the example, Honda & green happens twice, but in the honda-column ” green” would appear only once.
Something like this:
+----------------------+------------+----------------------+---------+
| Color | Car | Honda | Fiat |
+----------------------+------------+----------------------+---------+
| Blue | Honda |Blue |Blue
| Green | Honda Green |Black
| Green | Honda |Yellow
| Blue | fiat
| Black | fiat
….
| Yellow | nissan
+-----------------------------------+------------+--------+
I also would like to know how many colors (no duplicate) each car has (amount of unique items in the column "Colar" related to each item in the "Car" column).
try join with pd.crosstab
df1 = df.join(
pd.crosstab(df.index, df["Car"], df["Color"], aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda Green
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
For unique colors according to your example output we can create a boolean mask and apply this back to the values parameter in pd.crosstab
unique_color = np.where(
df.groupby(['Car','Color']).cumcount().ge(1), "", df["Color"]
)
df1 = df.join(pd.crosstab(df.index, df["Car"], unique_color, aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
Related
It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks
Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
I tried using the below code
df.groupby(['CATEGORY', 'COLOR']).size().unstack(fill_value=0)
which does not give me what I need. The output I am expecting is
CATEGORY COLOR ITEM
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I believe you need DataFrame.set_index with DataFrame.sort_index:
df1 = df.set_index(['CATEGORY', 'COLOR']).sort_index()
print (df1)
ITEM
CATEGORY COLOR
BIKE BLACK 54519
BLUE 23661
CAR BLACK 14582
RED 48684
JEEP WHITE 45685
If order is omportant convert both columns to ordered categoricals:
cols = ['CATEGORY', 'COLOR']
for c in cols:
df[c] = pd.Categorical(df[c], categories=df[c].drop_duplicates(), ordered=True)
df1 = df.set_index(cols).sort_index()
print (df1)
ITEM
CATEGORY COLOR
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I will use Categorical first, since this will keep the original CATEGORY order
df.CATEGORY=pd.Categorical(df.CATEGORY,df.CATEGORY.unique())
s=df.sort_values(['CATEGORY','COLOR']).set_index(['CATEGORY','COLOR'])
ITEM
CATEGORY COLOR
CAR BLACK 14582
RED 48684
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.
Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
df1
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df2
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
df2 has many other columns.
The output I need is:
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
From df1 it is clear that ITEM 54252 and 54519 are the same. So based on df1 we need to replace the values in df2.
I modify previous solution with new column orig for remember original values of ITEM, create Series by DataFrame.set_index and Series.replace values in another DataFrame:
df = df1.assign(orig=df1['ITEM'])
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
s = df[m].set_index('orig')['ITEM']
print (s)
orig
54519 54519
23661 23661
23226 23661
54252 54519
Name: ITEM, dtype: int64
df2['ITEM'] = df2['ITEM'].replace(s)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
Another alternative without new column is replace by dictionary:
orig = df1['ITEM']
m = df1.duplicated(['CATEGORY', 'COLOR'], keep=False)
df1.loc[m, 'ITEM'] = df1[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df1)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
d = dict(zip(orig[m], df1.loc[m, 'ITEM']))
print (d)
{54519: 54519, 23661: 23661}
df2['ITEM'] = df2['ITEM'].replace(d)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
I have a master dataframe, df:
Colour Item Price
Blue Car 40
Red Car 30
Green Truck 50
Green Bike 30
I then have a price correction dataframe, df_pc:
Colour Item Price
Red Car 60
Green Bike 70
I want to say if there is a match on Colour and Item in the price correction dataframe, then replace the price in the master df. so the expected output is;
Colour Item Price
Blue Car 60
Red Car 30
Green Truck 50
Green Bike 70
I can't find a way of doing this currently
Use Index.isin for filter out no matched rows and then DataFrame.combine_first:
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
Another data test:
print (df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 <- not matched row
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 30.0
2 Green Truck 50.0
3 Red Car 60.0
here is a way using combine_first():
df_pc.set_index(['Colour','Item']).combine_first(
df.set_index(['Colour','Item'])).reset_index()
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
EDIT:
If you want only matching items, we can also use merge with fillna:
print(df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 #changed row not matching
(df.merge(df_pc, on = ['Colour','Item'],how='left',suffixes=('_x',''))
.assign(Price=lambda x:x['Price'].fillna(x['Price_x'])).reindex(df.columns,axis=1))
Colour Item Price
0 Blue Car 40.0
1 Red Car 60.0
2 Green Truck 50.0
3 Green Bike 30.0