Left joining multiple datasets with same column headers

Left joining multiple datasets with same column headers - python

not sure if this question is answered ,Please help me to solve this .
I have tried my max to explain this .
Please refer the images to understand my query .
I want my below query solved in Python .
The query is :
I need to left merge a dataframe with 3 other dataframes .
But the tricky part is all the dataframes are having same column headers , and I want the same column to overlap the preceeding column in my output dataframe .
But while I use left merge in python , the column headers of all the dataframes are printed along with sufix "_x" and "_y".
The below are my 4 dataframes:
df1 = pd.DataFrame({"Fruits":['apple','banana','mango','strawberry'],
"Price":[100,50,60,70],
"Count":[1,2,3,4],
"shop_id":['A','A','A','A']})
df2 = pd.DataFrame({"Fruits":['apple','banana','mango','chicku'],
"Price":[10,509,609,1],
"Count":[8,9,10,11],
"shop_id":['B','B','B','B']})
df3 = pd.DataFrame({"Fruits":['apple','banana','chicku'],
"Price":[1000,5090,10],
"Count":[5,6,7],
"shop_id":['C','C','C']})
df4 = pd.DataFrame({"Fruits":['apple','strawberry','mango','chicku'],
"Price":[50,51,52,53],
"Count":[11,12,13,14],
"shop_id":['D','D','D','D']})
Now I want to left join df1 , with df2 , df3 and df4.
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Fruits'],
how='left'), data_frames)
But this produces an output as below :
The same columns are printed in the o/p dataset with suffix _x and _y
I want only a single Price , shop_id and count column like below:

It looks like what you want is combine_first, not merge:
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: right.set_index('Fruits').combine_first(left.set_index('Fruits')).reset_index(),
data_frames)
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 chicku 53 14 D
3 mango 52 13 D
4 strawberry 51 12 D
To filter the output to get only the keys from df1:
df_merged.set_index('Fruits').loc[df1['Fruits']].reset_index()
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 mango 52 13 D
3 strawberry 51 12 D
NB. everything would actually be easier if you set Fruits as index

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0

You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.

If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0

Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.

#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Merging 3 dataframes with Pandas

I have 3 dataframes with the same ID column. I want to combine them into a single dataframe. I want to combine with inner join logic in SQL. When I try the code below it gives the following result. It correctly joins the two dataframes even though the ID column matches, but makes the last one wrong. How can I fix this? Thank you for your help in advance.
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
output
SOLVED: The data type of the ID column in DF1 was int, while the others were str. Before asking the question I had str the ID column in DF1 and got the following result. Then, when I converted all of them to int data type, I got the result I wanted.

Your IDs are not the same dtype:
>>> DF1
ID A
0 10 1
1 20 2
2 30 3
>>> DF2
ID K
0 30 3
1 10 1
2 20 2
>>> DF3
ID P
0 20 2
1 30 3
2 10 1
Your code:
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
The output:
>>> df_final
ID A K P
0 10 1 1 1
1 20 2 2 2
2 30 3 3 3

Use join:
# use set index to add 'join' key into the index and
# create a list of dataframes using list comprehension
l = [df.set_index('ID') for df in [df1,df2,df3])
# pd.DataFrame.join accepts a list of dataframes as 'other'
l[0].join(l[1:])

How do I merge two Pandas DataFrames and add the overlapping columns

I am trying to merge multiple DataFrames on same DocID then sum up the weights but when I do merge it creates Weight_x,Weight_y. This would be fine for only two DataFrames but the amount of Dataframes to merge changes based on user input so merging creates Weight_x, Weight_y multiple times. So how can I merge more than 2 DataFrames such that they are merging on DocID and Weight is Summed?
Example:
df1= DocID Weight
1 4
2 7
3 8
df2= DocID Weight
1 5
2 9
8 1
finalDf=
DocID Weight
1 9
2 16

You can merge, set the 'DocID' column as the index, then sum the remaining columns together. Then you can reindex and rename the columns in the resulting final_df as needed:
df_final = pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
df_final = pd.DataFrame({"DocID": df_final.index, "Weight":df_final}).reset_index(drop=True)
Output:
>>> df_final
DocID Weight
0 1 9
1 2 16

df1.set_index('DocID').add(df2.set_index('DocID')).dropna()
Weight
DocID
1 9.0
2 16.0

Can you try this pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
You can now give any name to the sum column.

Copy values only to a new empty dataframe with column names - Pandas

I have two dataframes.
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df2 = pd.DataFrame(columns=['pid','gen','ethn'])
As you can see, the second dataframe (df2) is empty. But may also contain few rows of data at times
What I would like to do is copy dataframe values (only) from df1 to df2 with column names of df2 remain unchanged.
I tried the below but both didn't work
df2 = df1.copy(deep=False)
df2 = df1.copy(deep=True)
How can I achieve my output to be like this? Note that I don't want the column names of df1. I only want the data

Do:
df1.columns = df2.columns.tolist()
df2 = df2.append(df1)
## OR
df2 = pd.concat([df1, df2])
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
Edit based on OPs comment linking to the nature of dataframes:
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})
final_df = pd.DataFrame(columns=['pid','gen','ethn'])
Now do:
frame = [df1, df2, df3]
for i in range(len(frame)):
frame[i].columns = final_df.columns.tolist()
final_df = final_df.append(frame[i])
print(final_df)
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
0 4 Male Chinese
1 5 Female Indian
2 6 Not disclosed European
0 7 Male Chinese
1 8 Female Indian
2 9 Not disclosed European

The cleanest solution I think is to just append the df1 after its column names have been set properly:
df2 = df2.append(pd.DataFrame(df1.values, columns=df2.columns))

Calculate weights for grouped data in pandas

I would like to calculate portfolio weights with a pandas dataframe. Here is some dummy data for an example:
df1 = DataFrame({'name' : ['ann','bob']*3}).sort('name').reset_index(drop=True)
df2 = DataFrame({'stock' : list('ABC')*2})
df3 = DataFrame({'val': np.random.randint(10,100,6)})
df = pd.concat([df1, df2, df3], axis=1)
Each person owns 3 stocks with a value val. We can calculate portfolio weights like this:
df.groupby('name').apply(lambda x: x.val/(x.val).sum())
which gives this:
If I want to add a column wgt to df I need to merge this result back to df on name and index. This seems rather clunky.
Is there a way to do this in one step? Or what is the way to do this that best utilizes pandas features?

Use transform, this will return a series with an index aligned to your original df:
In [114]:
df['wgt'] = df.groupby('name')['val'].transform(lambda x: x/x.sum())
df
Out[114]:
name stock val wgt
0 ann A 18 0.131387
1 ann B 43 0.313869
2 ann C 76 0.554745
3 bob A 16 0.142857
4 bob B 44 0.392857
5 bob C 52 0.464286

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Left joining multiple datasets with same column headers - python

Related

Pandas: Search and match based on two conditions

Merging 3 dataframes with Pandas

How do I merge two Pandas DataFrames and add the overlapping columns

Copy values only to a new empty dataframe with column names - Pandas

Calculate weights for grouped data in pandas

Categories

Resources