Joining two dataframes in pandas using full outer join

Joining two dataframes in pandas using full outer join - python

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi

It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN

Related

Vlookups in Pandas across 2 dataframe column

I have 2 dataframes and I wish to grab IDs matching with DF2 into df1 merged as separate columns. There are multiple columns to be added due to df2 having many different country names for a specific ID.
df1 looks like below:
ID URL
A example.com/1
B example.com/2
C example.com/3
D example.com/4
df2 is like this:
ID country URL
A usa example.com/usa-A
B uk example.com/uk-B
C canada example.com/canada-C
A uk example.com/uk-A
C usa example.com/usa-C
What I am expecting df1 to look like:
ID URL uk USA Canada
A example.com/1 example.com/uk-A example.com/usa-A NaN
B example.com/2 example.com/uk-B NaN NaN
C example.com/3 NaN example.com/usa-C example.com/canada-C
D example.com/4 NaN NaN NaN
I wish to bring if DF1 ID A is found in DF2 ID against a country then bring the country URL up next to df1 ID in a specific country column.
The way I am trying to achieve this is using a for loop with a map call below:
final = pd.DataFrame()
for a in countries_list:
b = df2.loc[(df2["country"] == a)]
df1["country"] = df1['id'].map(df2.set_index('id')['url'])
final = pd.concat([final, df1])
It runs for a certain amount of countries and then start throwing InvalidIndexError: Reindexing only valid with uniquely valued Index objects which I tried to overcome using a reset_index() function on both df1 and df2 but still after a certain amount of iterations, it throws me the same error.
Can someone suggest a more efficient way to do this or any way i could run it over all possible iterations?
Thanks,

Try as follows:
res = df.merge(df2.pivot(index='ID',columns='country',values='URL'),
left_on='ID', right_index=True, how='left')
print(res)
ID URL canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C
3 D example.com/4 NaN NaN NaN
Explanation
First, use df.pivot on df2. We get:
print(df2.pivot(index='ID',columns='country',values='URL'))
country canada uk usa
ID
A NaN example.com/uk-A example.com/usa-A
B NaN example.com/uk-B NaN
C example.com/canada-C NaN example.com/usa-C
Next, use df.merge to merge df and pivoted df2, joining the two on ID, and passing left to the how parameter to "use only keys from left frame". (Leave how='left' out, if you are not interested in the row for ID with only NaN values.)
If you're set on a particular column order, use e.g.:
res = res.loc[:,['ID','URL','uk','usa','canada']]

# map df to df2 for URL
df2['URL2']=df2['ID'].map(df.set_index(['ID'])['URL'])
#pivot
(df2.pivot(index=['ID','URL2'], columns='country', values='URL')
.reset_index()
.rename_axis(columns=None)
.rename(columns={'URL2':'URL'}))
ID URL2 canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C

How to create a pandas dataframe with 2 dataframes one as columns and one as rows

I have a 2 dataframes as follow:
df1 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4']})
df2 = pd.DataFrame({'Date':['2020-10-10','2020-10-09','2020-10-08','2020-10-07','2020-10-06']})
How to have a dataframe which has df1 as rows and df2 as columns and consequently generate cells with null values.Something like below:
And the final step is to fill the cells with join on another table(df4):
df4 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4'],'2020-10-10':[1,2,5,np.nan],'2020-10-09':[np.nan,2,3,0],'2020-10-08':[0,0,2,3],'2020-10-07':[np.nan,1,np.nan,2]})
Final df should look like bellow:
Any help is truly appreciated.

I hope I've understood your question right. You have 3 dataframes:
df1 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4']})
df2 = pd.DataFrame({'Date':['2020-10-10','2020-10-09','2020-10-08','2020-10-07','2020-10-06']})
df4 = pd.DataFrame({'Barcode':[1,2,3,4],'Store':['s1','s2','s3','s4'],'2020-10-10':[1,2,5,np.nan],'2020-10-09':[np.nan,2,3,0],'2020-10-08':[0,0,2,3],'2020-10-07':[np.nan,1,np.nan,2]})
Then:
df1 = pd.DataFrame(df1, columns= df1.columns.tolist() + df2['Date'].tolist())
df1 = df1.set_index('Barcode')
df4 = df4.set_index('Barcode')
print(df1.fillna(df4))
Prints:
Store 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06
Barcode
1 s1 1.0 NaN 0.0 NaN NaN
2 s2 2.0 2.0 0.0 1.0 NaN
3 s3 5.0 3.0 2.0 NaN NaN
4 s4 NaN 0.0 3.0 2.0 NaN

First create a temporary DataFrame:
wrk = pd.DataFrame('', index=pd.MultiIndex.from_frame(df1),
columns=df2.Date.rename(None)); wrk
It is filled with empty strings, and has required column names from
df2. For now Barcode and Store are index columns.
This arrangement will be needed soon.
Then update it (in-place) with the data from df4:
wrk.update(df4.set_index(['Barcode', 'Store']))
And the last step is:
result = wrk.reset_index()
The result is:
Barcode Store 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06
0 1 s1 1 0
1 2 s2 2 2 0 1
2 3 s3 5 3 2
3 4 s4 0 3 2

for item in df2.Date.tolist():
df1[item] = np.nan
dfinal = df1.fillna(df4)
dfinal = dfinal.set_index('Barcode')
display(dfinal)

Derive multiple df from single df such that each df has no NaN values

I want to convert this table
0 thg John 3.0
1 thg James 4.0
2 mol NaN 5.0
3 mol NaN NaN
4 lob NaN NaN
In this following tables
df1
movie name rating
0 thg John 3.0
1 thg James 4.0
df2
movie rating
2 mol 5.0
df3
movie
3 mol
4 lob
Where each dataframe has no Nan value, Also tell method if I need to separate with respect to blank value instead of Nan.

I think that start of a new target DataFrame should occur not
only when the number of NaN values changes (compared to
previous row), but also when this number is the same, but
NaN values are in different columns.
So I propose the following formula:
dfs = [g.dropna(how='all',axis=1) for _,g in
df.groupby(df.isna().ne(df.isna().shift()).any(axis=1).cumsum())]
You can print partial DataFrames (any number of them) running:
n = 0
for grp in dfs:
print(f'\ndf No {n}:\n{grp}')
n += 1
The advantage of my solution over the other becomes obvious when you add
to the source DataFrame another row containing:
5 NaN NaN 3.0
It contains also 1 non-null value (like two previous rows).
The other solution will treat all these rows as one partial DataFrame
containing:
movie rating
3 mol NaN
4 lob NaN
5 NaN 3.0
as you can see, with NaN values, whereas my solution divides these
rows into 2 separate DataFrames, without any NaN.

create a list of dfs , with a groupby and dropna:
dfs = [g.dropna(how='all',axis=1) for _,g in df.groupby(df.isna().sum(1))]
print(dfs[0],'\n\n',dfs[1],'\n\n',dfs[2])
Or dict:
d = {f"df{e+1}": g[1].dropna(how='all',axis=1)
for e,g in enumerate(df.groupby(df.isna().sum(1)))}
print(d['df1'],'\n\n',d['df2'],'\n\n',d['df3']) #read the keys of d
movie name rating
0 thg John 3.0
1 thg James 4.0
movie rating
2 mol 5.0
movie
3 mol
4 lob

Merge DataFrames in Python without duplicating columns

I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.
df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
Company Name Data1 Data2
0 A 1 13
1 B 34 54
2 C 23 5354
3 D 66 443
A second DataFrame with additional information for some of the companies:
pd.DataFrame({'Company Name':['A','B'],'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})
Company Name Address Phone
0 A str1 str1a
1 B str2 str2a
If I wanted to combine these two it will successfully merge into one using on=Column:
df1=pd.merge(df1,df2, on='Company Name', how='outer')
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 NaN NaN
3 D 66 443 NaN NaN
However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:
df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
Company Name Data1 Data2 Address_x Phone_x Address_y Phone_y
0 A 1 13 str1 str1a NaN NaN
1 B 34 54 str2 str2a NaN NaN
2 C 23 5354 NaN NaN str3 str3a
3 D 66 443 NaN NaN NaN NaN
When what I really want is one DataFrame with the same columns, just populating any missing data.
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 str3 str3a
3 D 66 443 NaN NaN
Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.

As you look for merging one dataframe at the time in a loop for, here is a way you can do it, that the new dataframe has new company name or not, new column or not:
df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]
for df_other in list_dfo:
df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
# and other code
At the end in this example:
print(df1)
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
Instead of first, you can use last, which would keep the last valid value and not the first in each column per group, it depends on what data you need, the one from df1 or the one from df_other if available. In the example above, it does not change anything, but in the following case you will see:
#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})
#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
.first().reset_index())
Out[21]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a #address is str1 from df1
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
.last().reset_index())
Out[22]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 new_str1 str1a #address is new_str1 from df4
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN

IIUC, you might try this;
def update_df(df1, df_next):
if 'Company Name' not in list(df1):
pass
else:
df1.set_index('Company Name', inplace=True)
df_next.set_index('Company Name', inplace=True)
new_cols = [item for item in set(df_next) if item not in set(df1)]
for col in new_cols:
df1['{}'.format(col)] = col
df1.update(df_next)
update_df(df1, df2)
update_df(df1, df3)
df1
Data1 Data2 Address Phone
Company Name
A 1 13 str1 str1a
B 34 54 str2 str2a
C 23 5354 str3 str3a
D 66 443 Address Phone
note1; for being able to use df.update you have to set_index to 'Company Name', this function will check that for df1 once and a next time it will pass. The df added will have the index set to 'Company Name'.
note2; next the function will check whether there are new columns, add them and fill out with the column name (you might want to change that).
note3; lastly you perform df.update with the values you need.

how to merge two dataframes and sum the values of columns

I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!

I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()

pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining two dataframes in pandas using full outer join - python

Related

Vlookups in Pandas across 2 dataframe column

How to create a pandas dataframe with 2 dataframes one as columns and one as rows

Derive multiple df from single df such that each df has no NaN values

Merge DataFrames in Python without duplicating columns

how to merge two dataframes and sum the values of columns

Categories

Resources