Merging 3 dataframes with Pandas - python

I have 3 dataframes with the same ID column. I want to combine them into a single dataframe. I want to combine with inner join logic in SQL. When I try the code below it gives the following result. It correctly joins the two dataframes even though the ID column matches, but makes the last one wrong. How can I fix this? Thank you for your help in advance.
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
output
SOLVED: The data type of the ID column in DF1 was int, while the others were str. Before asking the question I had str the ID column in DF1 and got the following result. Then, when I converted all of them to int data type, I got the result I wanted.

Your IDs are not the same dtype:
>>> DF1
ID A
0 10 1
1 20 2
2 30 3
>>> DF2
ID K
0 30 3
1 10 1
2 20 2
>>> DF3
ID P
0 20 2
1 30 3
2 10 1
Your code:
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
The output:
>>> df_final
ID A K P
0 10 1 1 1
1 20 2 2 2
2 30 3 3 3

Use join:
# use set index to add 'join' key into the index and
# create a list of dataframes using list comprehension
l = [df.set_index('ID') for df in [df1,df2,df3])
# pd.DataFrame.join accepts a list of dataframes as 'other'
l[0].join(l[1:])

Related

Left joining multiple datasets with same column headers

not sure if this question is answered ,Please help me to solve this .
I have tried my max to explain this .
Please refer the images to understand my query .
I want my below query solved in Python .
The query is :
I need to left merge a dataframe with 3 other dataframes .
But the tricky part is all the dataframes are having same column headers , and I want the same column to overlap the preceeding column in my output dataframe .
But while I use left merge in python , the column headers of all the dataframes are printed along with sufix "_x" and "_y".
The below are my 4 dataframes:
df1 = pd.DataFrame({"Fruits":['apple','banana','mango','strawberry'],
"Price":[100,50,60,70],
"Count":[1,2,3,4],
"shop_id":['A','A','A','A']})
df2 = pd.DataFrame({"Fruits":['apple','banana','mango','chicku'],
"Price":[10,509,609,1],
"Count":[8,9,10,11],
"shop_id":['B','B','B','B']})
df3 = pd.DataFrame({"Fruits":['apple','banana','chicku'],
"Price":[1000,5090,10],
"Count":[5,6,7],
"shop_id":['C','C','C']})
df4 = pd.DataFrame({"Fruits":['apple','strawberry','mango','chicku'],
"Price":[50,51,52,53],
"Count":[11,12,13,14],
"shop_id":['D','D','D','D']})
Now I want to left join df1 , with df2 , df3 and df4.
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Fruits'],
how='left'), data_frames)
But this produces an output as below :
The same columns are printed in the o/p dataset with suffix _x and _y
I want only a single Price , shop_id and count column like below:
It looks like what you want is combine_first, not merge:
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: right.set_index('Fruits').combine_first(left.set_index('Fruits')).reset_index(),
data_frames)
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 chicku 53 14 D
3 mango 52 13 D
4 strawberry 51 12 D
To filter the output to get only the keys from df1:
df_merged.set_index('Fruits').loc[df1['Fruits']].reset_index()
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 mango 52 13 D
3 strawberry 51 12 D
NB. everything would actually be easier if you set Fruits as index

How do I merge two Pandas DataFrames and add the overlapping columns

I am trying to merge multiple DataFrames on same DocID then sum up the weights but when I do merge it creates Weight_x,Weight_y. This would be fine for only two DataFrames but the amount of Dataframes to merge changes based on user input so merging creates Weight_x, Weight_y multiple times. So how can I merge more than 2 DataFrames such that they are merging on DocID and Weight is Summed?
Example:
df1= DocID Weight
1 4
2 7
3 8
df2= DocID Weight
1 5
2 9
8 1
finalDf=
DocID Weight
1 9
2 16
You can merge, set the 'DocID' column as the index, then sum the remaining columns together. Then you can reindex and rename the columns in the resulting final_df as needed:
df_final = pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
df_final = pd.DataFrame({"DocID": df_final.index, "Weight":df_final}).reset_index(drop=True)
Output:
>>> df_final
DocID Weight
0 1 9
1 2 16
df1.set_index('DocID').add(df2.set_index('DocID')).dropna()
Weight
DocID
1 9.0
2 16.0
Can you try this pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
You can now give any name to the sum column.

Combining two dataframes with same column

I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE

Pandas concat not concatenating, but appending

I'm hoping for some help.
I am trying to concatenate three dataframes in pandas with a multiindex. Two of them work fine, but the third keeps appending, instead of concatenating.
They all have the same multiindex (I have tested this by df1.index.name == df2.index.name)
This is what I have tried:
df_final = pd.concat([df1, df2], axis = 1)
example:
df1
A B X
0 1 3
2 4
df2
A B Y
0 1 20
2 30
What I want to get is this:
df_final
A B X Y
0 1 3 20
2 4 30
But what I keep getting is this:
df_final
A B X Y
0 1 3 NaN
2 4 NaN
0 1 NaN 20
2 NaN 30
Any ideas? I have also tried
df_final = pd.concat([df1, df2], axis = 1, keys = ['A', 'B'])
But then df2 doesn't appear at all.
Thanks!
First way (and the better one in this case):
use merge:
pd.merge(left=df1, right=df2, on=['A','B'], how='inner')
Second way:
If you prefer using concat you can use groupby after it:
df_final = pd.concat([df1, df2])
df_final = df_final.groupby(['A','B']).first()
Thank you everyone for your help!
With your suggestions, I tried merging, but I got a new error:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Which led me to find that one of the indexes in the dataframe that was appending was an object instead of an integer. So I've changed that and now the concat works!
This has taken me days to get through...
So thank you again!
Try doing
pd.merge(df1, df2)
join() may also be used for your problem, provided you add the 'key' column to all your dataframes.

Check values in dataframe against another dataframe and append values if present

I have two dataframes as follows:
DF1
A B C
1 2 3
4 5 6
7 8 9
DF2
Match Values
1 a,d
7 b,c
I want to match DF1['A'] with DF2['Match'] and append DF2['Values'] to DF1 if the value exists
So my result will be:
A B C Values
1 2 3 a,d
7 8 9 b,c
Now I can use the following code to match the values but it's returning an empty dataframe.
df1 = df1[df1['A'].isin(df2['Match'])]
Any help would be appreciated.
Instead of doing a lookup, you can do this in one step by merging the dataframes:
pd.merge(df1, df2, how='inner', left_on='A', right_on='Match')
Specify how='inner' if you only want records that appear in both, how='left' if you want all of df1's data.
If you want to keep only the Values column:
pd.merge(df1, df2.set_index('Match')['Values'].to_frame(), how='inner', left_on='A', right_index=True)

Categories

Resources