Replace duplicates with first values in dataframe

Replace duplicates with first values in dataframe - python

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.

Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

Related

Splitting a column into two in dataframe

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks

Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Pandas - organize and count unique data

How to do this with pandas?
I have this dataset, which consists of a list of cars and its colors (cars and colors may repeat):
Color Car
Blue Honda
Green Honda
Green Honda
Blue fiat
Black fiat
....
Yellow nissan
I would like to create a column for each car with its respective color (without duplicated colors related to each car). In the example, Honda & green happens twice, but in the honda-column ” green” would appear only once.
Something like this:
+----------------------+------------+----------------------+---------+
| Color | Car | Honda | Fiat |
+----------------------+------------+----------------------+---------+
| Blue | Honda |Blue |Blue
| Green | Honda Green |Black
| Green | Honda |Yellow
| Blue | fiat
| Black | fiat
….
| Yellow | nissan
+-----------------------------------+------------+--------+
I also would like to know how many colors (no duplicate) each car has (amount of unique items in the column "Colar" related to each item in the "Car" column).

try join with pd.crosstab
df1 = df.join(
pd.crosstab(df.index, df["Car"], df["Color"], aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda Green
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow
For unique colors according to your example output we can create a boolean mask and apply this back to the values parameter in pd.crosstab
unique_color = np.where(
df.groupby(['Car','Color']).cumcount().ge(1), "", df["Color"]
)
df1 = df.join(pd.crosstab(df.index, df["Car"], unique_color, aggfunc="first").fillna(" ")
)
print(df1)
Color Car Honda fiat nissan
0 Blue Honda Blue
1 Green Honda Green
2 Green Honda
3 Blue fiat Blue
4 Black fiat Black
5 Yellow nissan Yellow

Mutli indexing using pandas

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
I tried using the below code
df.groupby(['CATEGORY', 'COLOR']).size().unstack(fill_value=0)
which does not give me what I need. The output I am expecting is
CATEGORY COLOR ITEM
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685

I believe you need DataFrame.set_index with DataFrame.sort_index:
df1 = df.set_index(['CATEGORY', 'COLOR']).sort_index()
print (df1)
ITEM
CATEGORY COLOR
BIKE BLACK 54519
BLUE 23661
CAR BLACK 14582
RED 48684
JEEP WHITE 45685
If order is omportant convert both columns to ordered categoricals:
cols = ['CATEGORY', 'COLOR']
for c in cols:
df[c] = pd.Categorical(df[c], categories=df[c].drop_duplicates(), ordered=True)
df1 = df.set_index(cols).sort_index()
print (df1)
ITEM
CATEGORY COLOR
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685

I will use Categorical first, since this will keep the original CATEGORY order
df.CATEGORY=pd.Categorical(df.CATEGORY,df.CATEGORY.unique())
s=df.sort_values(['CATEGORY','COLOR']).set_index(['CATEGORY','COLOR'])
ITEM
CATEGORY COLOR
CAR BLACK 14582
RED 48684
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685

Replace duplicate values from one DF to another DF

df1
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df2
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
df2 has many other columns.
The output I need is:
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
From df1 it is clear that ITEM 54252 and 54519 are the same. So based on df1 we need to replace the values in df2.

I modify previous solution with new column orig for remember original values of ITEM, create Series by DataFrame.set_index and Series.replace values in another DataFrame:
df = df1.assign(orig=df1['ITEM'])
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
s = df[m].set_index('orig')['ITEM']
print (s)
orig
54519 54519
23661 23661
23226 23661
54252 54519
Name: ITEM, dtype: int64
df2['ITEM'] = df2['ITEM'].replace(s)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
Another alternative without new column is replace by dictionary:
orig = df1['ITEM']
m = df1.duplicated(['CATEGORY', 'COLOR'], keep=False)
df1.loc[m, 'ITEM'] = df1[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df1)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
d = dict(zip(orig[m], df1.loc[m, 'ITEM']))
print (d)
{54519: 54519, 23661: 23661}
df2['ITEM'] = df2['ITEM'].replace(d)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES

Loop through pandas grouped data index

Struggling newbie, we are trying to list the items of a grouped dataframe.
To highlight the problem please see a simplified example below.
First group the items:
data = {'colour': ['red','purple','green','purple','blue','red'], 'item': ['hat','scarf','belt','belt','hat','scarf'], 'material': ['felt','wool','leather','wool','plastic','wool']}
df = pd.DataFrame(data=data)
grpd_df = df.groupby(df['item']).apply(lambda df:df.reset_index(drop=True))
grpd_df
colour item material
item
belt 0 green belt leather
1 purple belt wool
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
Then find all the rows that have a red item
df = grpd_df[grpd_df['colour'].eq('red').groupby(level=0).transform('any')]
print (df)
colour item material
item
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
We would like to then loop over a list of the items in the grpd_df i.e. hat and scarf. We've tried a df.index.levels but this outputs all the items including belt.

You can using IndexSlice and get_level_values, to achieve it .
grpd_df.loc[pd.IndexSlice[list(set(df.index.get_level_values(0).tolist())),:]]
Out[302]:
colour item material
item
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
If need the level of the index from df
set(df.index.get_level_values(0).tolist())
Out[303]: {'hat', 'scarf'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace duplicates with first values in dataframe - python

Related

Splitting a column into two in dataframe

Pandas - organize and count unique data

Mutli indexing using pandas

Replace duplicate values from one DF to another DF

Loop through pandas grouped data index

Categories

Resources