Pandas replace dataframe values based on criteria - python

I have a master dataframe, df:
Colour Item Price
Blue Car 40
Red Car 30
Green Truck 50
Green Bike 30
I then have a price correction dataframe, df_pc:
Colour Item Price
Red Car 60
Green Bike 70
I want to say if there is a match on Colour and Item in the price correction dataframe, then replace the price in the master df. so the expected output is;
Colour Item Price
Blue Car 60
Red Car 30
Green Truck 50
Green Bike 70
I can't find a way of doing this currently

Use Index.isin for filter out no matched rows and then DataFrame.combine_first:
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
Another data test:
print (df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 <- not matched row
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 30.0
2 Green Truck 50.0
3 Red Car 60.0

here is a way using combine_first():
df_pc.set_index(['Colour','Item']).combine_first(
df.set_index(['Colour','Item'])).reset_index()
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
EDIT:
If you want only matching items, we can also use merge with fillna:
print(df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 #changed row not matching
(df.merge(df_pc, on = ['Colour','Item'],how='left',suffixes=('_x',''))
.assign(Price=lambda x:x['Price'].fillna(x['Price_x'])).reindex(df.columns,axis=1))
Colour Item Price
0 Blue Car 40.0
1 Red Car 60.0
2 Green Truck 50.0
3 Green Bike 30.0

Related

Splitting a column into two in dataframe

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks
Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Replace duplicates with first values in dataframe

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.
Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

Replace duplicate values from one DF to another DF

df1
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df2
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
df2 has many other columns.
The output I need is:
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
From df1 it is clear that ITEM 54252 and 54519 are the same. So based on df1 we need to replace the values in df2.
I modify previous solution with new column orig for remember original values of ITEM, create Series by DataFrame.set_index and Series.replace values in another DataFrame:
df = df1.assign(orig=df1['ITEM'])
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
s = df[m].set_index('orig')['ITEM']
print (s)
orig
54519 54519
23661 23661
23226 23661
54252 54519
Name: ITEM, dtype: int64
df2['ITEM'] = df2['ITEM'].replace(s)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54519 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES
Another alternative without new column is replace by dictionary:
orig = df1['ITEM']
m = df1.duplicated(['CATEGORY', 'COLOR'], keep=False)
df1.loc[m, 'ITEM'] = df1[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df1)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
d = dict(zip(orig[m], df1.loc[m, 'ITEM']))
print (d)
{54519: 54519, 23661: 23661}
df2['ITEM'] = df2['ITEM'].replace(d)
print (df2)
USERID WEBBROWSE ITEM PURCHASE
1 1541 CHROME 54252 YES
2 3351 EXPLORER 54519 YES
3 2639 MOBILE APP 23661 YES

How to collapse pandas rows for select column values to minimal combinations and map back to original rows

Context:
I have a pandas dataframe with 7 columns (taste, color, temperature, texture, shape, age_of_participant, name_of_participant).
Of the 7 columns, taste, color, temperature, texture and shape can have overlapping values across multiple rows (i.e taste could be sour for more than one row)
I'm trying to collapse all the rows into the lowest number of combinations given
taste,color,temperature,texture and shape values while ignoring NA's ( in other words, overwriting them). The next part is to map each of these rows to the original rows.
Mock data set:
data_set = [
{'color':'brown', 'age_of_participant':23, 'name_of_participant':'feb'},
{'taste': 'sour', 'color':'green', 'temperature': 'hot', 'age_of_participant':16,'name_of_participant': 'joe'},
{'taste': 'sour', 'color':'green', 'texture':'soft', 'age_of_participant':17,'name_of_participant': 'jane'},
{'color':'green','age_of_participant':18,'name_of_participant': 'jeff'},
{'taste': 'sweet', 'color':'red', 'age_of_participant':19,'name_of_participant': 'joke'},
{'taste': 'sweet', 'temperature': 'cold', 'age_of_participant':20,'name_of_participant': 'jolly'},
{'taste': 'salty', 'color':'purple', 'texture':'soft', 'age_of_participant':21,'name_of_participant': 'jupyter'},
{'taste': 'salty', 'color':'brown', 'age_of_participant':22,'name_of_participant': 'january'}
]
import pandas as pd
import random
data_set = random.sample(data_set, k=len(data_set))
data_frame = pd.DataFrame(data_set)
print(data_frame)
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot NaN
1 17 green jane sour NaN soft
2 18 green jeff NaN NaN NaN
3 19 red joke sweet NaN NaN
4 20 NaN jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
What I've attempted:
# These columns are used to do the grouping since age_of_participant and name_of_participant are unique per row
values_that_can_be_grouped = ['taste', 'color','temperature','texture']
sub_set = data_frame[values_that_can_be_grouped].drop_duplicates().reset_index(drop=False)
my_unique_set = sub_set.groupby('taste', as_index=False).first()
print(my_unique_set)
taste index color temperature texture
0 2 green
1 salty 6 brown
2 sour 1 green soft
3 sweet 4 cold
At this point I'm not quite sure how I can map the rows above to all original rows except for indices 2,6,1,4. I checked pandas code and doesn't look like the other indices are preserved anywhere?
What I'm trying to achieve:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
data_frame.assign(color=data_frame.color.ffill()).groupby('color').apply(lambda x: x.ffill().bfill())
Out[1089]:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
IIUC I feel using ffill and bfill for each taste and color, then groupby them is safer here
df.taste.fillna(df.groupby('color').taste.apply(lambda x : x.ffill().bfill()),inplace=True)
df.color.fillna(df.groupby('taste').color.apply(lambda x : x.ffill().bfill()),inplace=True)
df=df.groupby(['color','taste']).apply(lambda x : x.ffill().bfill())
df
age_of_participant color ... temperature texture
0 16 green ... hot soft
1 17 green ... hot soft
2 18 green ... hot soft
3 19 red ... cold NaN
4 20 red ... cold NaN
5 21 purple ... NaN soft
6 22 brown ... NaN NaN
[7 rows x 6 columns]

Pandas fill in missing dates in DataFrame with multiple columns

I want to add missing dates for a specific date range, but keep all columns. I found many posts using afreq(), resample(), reindex(), but they seemed to be for Series and I couldn't get them to work for my DataFrame.
Given a sample dataframe:
data = [{'id' : '123', 'product' : 'apple', 'color' : 'red', 'qty' : 10, 'week' : '2019-3-7'}, {'id' : '123', 'product' : 'apple', 'color' : 'blue', 'qty' : 20, 'week' : '2019-3-21'}, {'id' : '123', 'product' : 'orange', 'color' : 'orange', 'qty' : 8, 'week' : '2019-3-21'}]
df = pd.DataFrame(data)
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
My goal is to return below; filling in qty as 0, but fill other columns. Of course, I have many other ids. I would like to be able to specify the start/end dates to fill; this example uses 3/7 to 3/21.
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
3 red 123 apple 0 2019-3-14
4 red 123 apple 0 2019-3-21
5 blue 123 apple 0 2019-3-7
6 blue 123 apple 0 2019-3-14
7 orange 123 orange 0 2019-3-7
8 orange 123 orange 0 2019-3-14
How can I keep the remainder of my DataFrame intact?
In you case , you just need do with unstack and stack + reindex
df.week=pd.to_datetime(df.week)
s=pd.date_range(df.week.min(),df.week.max(),freq='7 D')
df=df.set_index(['color','id','product','week']).\
qty.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df
color id product level_3 0
0 blue 123 apple 2019-03-14 0.0
1 blue 123 apple 2019-03-21 20.0
2 orange 123 orange 2019-03-14 0.0
3 orange 123 orange 2019-03-21 8.0
4 red 123 apple 2019-03-07 10.0
5 red 123 apple 2019-03-14 0.0
One option is to use the complete function from pyjanitor to expose the implicitly missing rows; afterwards you can fill with fillna:
# pip install pyjanitor
import pandas as pd
import janitor
df.week = pd.to_datetime(df.week)
# create new dates, which will be used to expand the dataframe
new_dates = {"week": pd.date_range(df.week.min(), df.week.max(), freq="7D")}
# use the complete function
# note how color, id and product are wrapped together
# this ensures only missing values based on data in the dataframe is exposed
# if you want all combinations, then you get rid of the tuple,
(df
.complete(("color", "id", "product"), new_dates, sort = False)
.fillna({'qty':0, downcast='infer')
)
id product color qty week
0 123 apple red 10 2019-03-07
1 123 apple blue 20 2019-03-21
2 123 orange orange 8 2019-03-21
3 123 apple red 0 2019-03-14
4 123 apple red 0 2019-03-21
5 123 apple blue 0 2019-03-07
6 123 apple blue 0 2019-03-14
7 123 orange orange 0 2019-03-07
8 123 orange orange 0 2019-03-14

Categories

Resources