Update a DataFrame with duplicate destination - python

I would like to update a dataframe with another one but with multiple "destination". Here is an example
df1 = pd.DataFrame({'name':['A', 'B', 'C', 'A'], 'category':['X', 'X', 'Y', 'Y'], 'value1':[None, 1, None, None], 'value2':[None, 10, None, None]})
name category value1 value2
0 A X NaN NaN
1 B X 1.0 10.0
2 C Y NaN NaN
3 A Y NaN NaN
df2 = pd.DataFrame({'name':['A', 'C'], 'value1':[2, 3], 'value2':[11, 12]})
name value1 value2
0 A 2 11
1 C 3 12
And the desired result would be
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
I don't think pd.update works since there are two time 'A' in my first DataFrame.
pd.merge creates other columns and I think there is probably a more elegant way than to merge these columns manually after their creation
Thanks in advance for your help!

You can use fillna after mapping the column A in df1 with the corresponding values from df2:
mapping = df2.set_index('name')['value']
df1['value'] = df1['value'].fillna(df1['name'].map(mapping))
If you want to map multiple columns:
mapping = df2.set_index('name')
for col in mapping:
df1[col] = df1[col].fillna(df1['name'].map(mapping[col]))
Alternatively you can try merge:
df = df1.merge(df2, on='name', how='left', suffixes=['', '_r'])
df.groupby(df.columns.str.rstrip('_r'), axis=1, sort=False).first()
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0

Related

Logic to insert 0 for multiple columns in first row - Pandas

I have a df that may contain nan values for two columns in the first row. If true I want to replace these values with 0. However, if there are integers there then leave as is. So this df should replace X,Y with 0.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : [np.nan,np.nan,5,np.nan,np.nan,10],
'X' : [np.nan,np.nan,1,np.nan,np.nan,3],
'Y' : [np.nan,np.nan,2,np.nan,np.nan,4],
})
1)
if df.loc[0,'X':'Y'] == np.nan:
df.loc[:0, 'X':'Y'] = 0
2)
if df.loc[[0],['X','Y']].isnull():
df.loc[:0, 'X':'Y'] = 0
else:
pass
But this example df should not replace with 0 as integers exist:
df1 = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : [np.nan,np.nan,5,np.nan,np.nan,10],
'X' : [5,np.nan,1,np.nan,np.nan,3],
'Y' : [6,np.nan,2,np.nan,np.nan,4],
})
df.loc[0, ['X', 'Y']] = df.loc[0, ['X', 'Y']].fillna(0)
>>>> df
Code1 Code2 X Y
0 A NaN 0.0 0.0
1 A NaN NaN NaN
2 B 5.0 1.0 2.0
3 B NaN NaN NaN
4 C NaN NaN NaN
5 C 10.0 3.0 4.0
Try this
if sum(pd.isnull(df.loc[0,'X':'Y']))==2:
df.loc[0, ['X', 'Y']] = df.loc[0, ['X', 'Y']].fillna(0)

DataFrame take union of columns and retain find first non-NaN value

Dataframe df has many thousand columns and rows. For a subset of columns that are given in a particular sequence, say columns B, C, E, I want to fill NaN values in B with first non-NaN value found in remaining columns (C, E) searching sequentially. Finally C, E are dropped
Sample df can be built as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
df.loc[1, 'B'] = np.nan
df.loc[2, 'B'] = np.nan
df.loc[5, 'B'] = np.nan
df.loc[2, 'C'] = np.nan
df.loc[5, 'C'] = np.nan
df.loc[2, 'D'] = np.nan
df.loc[2, 'E'] = np.nan
df.loc[4, 'E'] = np.nan
df
A B C D E
0 18.161033 6.453597 25.253036 18.542586 20.667311
1 27.629402 NaN 40.654821 22.804547 23.633502
2 15.459256 NaN NaN NaN NaN
3 19.115203 4.002131 14.167508 23.796780 29.557706
4 27.180622 NaN 20.763618 15.923794 NaN
5 17.917170 NaN NaN 21.865184 9.867743
The expected outcome is as follows:
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here is one way
drop = ['C', 'E']
fill= 'B'
d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
df.groupby(d,axis=1).first()
Out[172]:
A B D
0 14.472915 30.598602 24.528571
1 22.010242 22.215140 15.412039
2 5.383674 NaN NaN
3 38.265940 24.746673 35.367622
4 22.730089 20.244289 27.570413
5 31.216037 15.496690 9.746814
IIUC, use bfill to backfill, then drop to remove unwanted columns.
df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here's a slightly more generalised version of the one above,
to_drop = ['C', 'E']
upd = 'B'
df.update(df[[upd, *to_drop]].bfill(axis=1)[upd]) # in-place
df.drop(to_drop, axis=1) # not in-place, need to assign
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184

Fill missing value

I have a data sets which has column with missing value 2439.
But the missing value is such that for specific index has some missing value and some fill value as shown below (Compare column 'Item_Identifier' and 'Item_Weight')
If carefully seen for specific item_identifier, there missing value in item_weight. Like this there many more Item_Identifier which missing value. Is there any way using python we fill missing value for only item_weight as same.
You can make the table into a pandas DataFrame then df['item_weight'].fillna(15.5, inplace=True)
Reproducible example:
df = pd.DataFrame({'col1': ['a', 'a', 'b','b', 'b', 'c'],
'col2': [10, np.nan, np.nan, np.nan, 20, 30]})
col1 col2
0 a 10.0
1 a NaN
2 b NaN
3 b NaN
4 b 20.0
5 c 30.0
You can groupby your col1 and agg using first
vals = df.groupby('col1').agg('first')
col2
col1
a 10.0
b 20.0
c 30.0
Then just use same indexing and fillna() to match and fill the values
df = df.set_index('col1').fillna(vals).reset_index()
col1 col2
0 a 10.0
1 a 10.0
2 b 20.0
3 b 20.0
4 b 20.0
5 c 30.0

Pandas update and add rows one dataframe with key column in another dataframe

I have 2 data frames with identical columns. Column 'key' will have unique values.
Data frame 1:-
A B key C
0 1 k1 2
1 2 k2 3
2 3 k3 5
Data frame 2:-
A B key C
4 5 k1 2
1 2 k2 3
2 3 k4 5
I would like to update rows in Dataframe-1 with values in Dataframe -2 if key in Dataframe -2 matches with Dataframe -1.
Also if key is new then add entire row from Dataframe-2 to Dataframe-1.
Final Output Dataframe is like this with same columns.
A B key C
4 5 k1 2 --> update
1 2 k2 3 --> no changes
2 3 k3 5 --> no changes
2 3 k4 5 --> new row
I have tried with below code. I need only 4 columns 'A', 'B','Key','C' without any suffixes after merge.
df3 = df1.merge(df2,on='key',how='outer')
>>> df3
A_x B_x key C_x A_y B_y C_y
0 0.0 1.0 k1 2.0 4.0 5.0 2.0
1 1.0 2.0 k2 3.0 1.0 2.0 3.0
2 2.0 3.0 k3 5.0 NaN NaN NaN
3 NaN NaN k4 NaN 2.0 3.0 5.0
It seems like you're looking for combine_first.
a = df2.set_index('key')
b = df1.set_index('key')
(a.combine_first(b)
.reset_index()
.reindex(columns=df1.columns))
A B key C
0 4.0 5.0 k1 2.0
1 1.0 2.0 k2 3.0
2 2.0 3.0 k3 5.0
3 2.0 3.0 k4 5.0
try this:
df1 = {'key': ['k1', 'k2', 'k3'], 'A':[0,1,2], 'B': [1,2,3], 'C':[2,3,5]}
df1 = pd.DataFrame(data=df1)
print (df1)
df2 = {'key': ['k1', 'k2', 'k4'], 'A':[4,1,2], 'B': [5,2,3], 'C':[2,3,5]}
df2 = pd.DataFrame(data=df2)
print (df2)
df3 = df1.append(df2)
df3.drop_duplicates(subset=['key'], keep='last', inplace=True)
df3 = df3.sort_values(by=['key'], ascending=True)
print (df3)
First, you need to indicate index columns:
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
Then, combine the dataframes to get all the index keys in place (this will not update the df1 values! See: combine_first manual):
df1 = df1.combine_first(df2)
Last step is updating the values in df1 with df2 and resetting the index
df1.update(df2)
df1.reset_index(inplace=True)
Try to append and remove duplicates:
df3 = pd.drop_duplicates(df1.append(df2))
assumes both dataframes have the same index columns
df3 = df1.combine_first(df2)
df3.update(df2)

Pandas: Merge two dataframe columns

Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0

Categories

Resources