I am trying to merge multiple DataFrames on same DocID then sum up the weights but when I do merge it creates Weight_x,Weight_y. This would be fine for only two DataFrames but the amount of Dataframes to merge changes based on user input so merging creates Weight_x, Weight_y multiple times. So how can I merge more than 2 DataFrames such that they are merging on DocID and Weight is Summed?
Example:
df1= DocID Weight
1 4
2 7
3 8
df2= DocID Weight
1 5
2 9
8 1
finalDf=
DocID Weight
1 9
2 16
You can merge, set the 'DocID' column as the index, then sum the remaining columns together. Then you can reindex and rename the columns in the resulting final_df as needed:
df_final = pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
df_final = pd.DataFrame({"DocID": df_final.index, "Weight":df_final}).reset_index(drop=True)
Output:
>>> df_final
DocID Weight
0 1 9
1 2 16
df1.set_index('DocID').add(df2.set_index('DocID')).dropna()
Weight
DocID
1 9.0
2 16.0
Can you try this pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
You can now give any name to the sum column.
Related
I have 3 dataframes with the same ID column. I want to combine them into a single dataframe. I want to combine with inner join logic in SQL. When I try the code below it gives the following result. It correctly joins the two dataframes even though the ID column matches, but makes the last one wrong. How can I fix this? Thank you for your help in advance.
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
output
SOLVED: The data type of the ID column in DF1 was int, while the others were str. Before asking the question I had str the ID column in DF1 and got the following result. Then, when I converted all of them to int data type, I got the result I wanted.
Your IDs are not the same dtype:
>>> DF1
ID A
0 10 1
1 20 2
2 30 3
>>> DF2
ID K
0 30 3
1 10 1
2 20 2
>>> DF3
ID P
0 20 2
1 30 3
2 10 1
Your code:
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
The output:
>>> df_final
ID A K P
0 10 1 1 1
1 20 2 2 2
2 30 3 3 3
Use join:
# use set index to add 'join' key into the index and
# create a list of dataframes using list comprehension
l = [df.set_index('ID') for df in [df1,df2,df3])
# pd.DataFrame.join accepts a list of dataframes as 'other'
l[0].join(l[1:])
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I want to understand the pd.merge work nature. I have two dataframes that have unequal length. When trying to merge them through this command
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
The length was different from expected as follows
length of comps: 4829
length of surgical: 7939
length of merged: 9531
From my own understanding, merged dataframe should have as same as the length of comps dataframe since left join will look for matching keys in both dataframes and discard the rest. As long as comps length is less than surgical length, the merged length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how parameter to "right", merged has a larger number than expected.
Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:
comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.
Note: a teammate suggests this solution
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")
The length of the output was as follows
length of comps: 4829
length of surgical: 7939
length of merged: 7939
How can this help?
The "issue" is with duplicated merge keys, which can cause the resulting merge to be larger than the original. For a left merge you can expect the result to be in between N_rows_left and N_rows_left * N_rows_right rows long. The lower bound is in the case that both the left and right DataFrames have no duplicate merge keys, and the upper bound is the case when the left and right DataFrames have the single same value for the merge keys on every row.
Here's a worked example. All DataFrames are 4 rows long, but df2 has duplicate merge keys. As a result when df2 is merged to df the output is longer than df, because for the row with 2 as the key in df, both rows in df2 are matched.
import pandas as pd
df = pd.DataFrame({'key': [1,2,3,4]})
df1 = pd.DataFrame({'row': range(4), 'key': [2,3,4,5]})
df2 = pd.DataFrame({'row': range(4), 'key': [2,2,3,3]})
# Neither frame duplicated on merge key, result is same length (4) as left.
df.merge(df1, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0
#2 3 1.0
#3 4 2.0
# df2 is duplicated on the merge keys so we get >4 rows
df.merge(df2, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0 # Both `2` rows matched
#2 2 1.0 # ^
#3 3 2.0 # Both `3` rows matched
#4 3 3.0 # ^
#5 4 NaN
If the length of the merged dataframe is greater than the length of the left dataframe, it means that the right dataframe has multiple entries for the same joining key. For instance if you have these dataframes:
df1
---
id product
0 111 car
1 222 bike
df2
---
id color
0 111 blue
1 222 red
3 222 green
3 333 yellow
A merge will render 3 rows, because there are two possible matches for the row of df1 with id 222.
df1.merge(df2, on="id", how="left")
---
id product color
0 111 car blue
1 222 bike red
2 222 bike green
I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE
I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
I have two dataframes as follows:
DF1
A B C
1 2 3
4 5 6
7 8 9
DF2
Match Values
1 a,d
7 b,c
I want to match DF1['A'] with DF2['Match'] and append DF2['Values'] to DF1 if the value exists
So my result will be:
A B C Values
1 2 3 a,d
7 8 9 b,c
Now I can use the following code to match the values but it's returning an empty dataframe.
df1 = df1[df1['A'].isin(df2['Match'])]
Any help would be appreciated.
Instead of doing a lookup, you can do this in one step by merging the dataframes:
pd.merge(df1, df2, how='inner', left_on='A', right_on='Match')
Specify how='inner' if you only want records that appear in both, how='left' if you want all of df1's data.
If you want to keep only the Values column:
pd.merge(df1, df2.set_index('Match')['Values'].to_frame(), how='inner', left_on='A', right_index=True)