This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I want to understand the pd.merge work nature. I have two dataframes that have unequal length. When trying to merge them through this command
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
The length was different from expected as follows
length of comps: 4829
length of surgical: 7939
length of merged: 9531
From my own understanding, merged dataframe should have as same as the length of comps dataframe since left join will look for matching keys in both dataframes and discard the rest. As long as comps length is less than surgical length, the merged length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how parameter to "right", merged has a larger number than expected.
Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:
comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.
Note: a teammate suggests this solution
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")
The length of the output was as follows
length of comps: 4829
length of surgical: 7939
length of merged: 7939
How can this help?
The "issue" is with duplicated merge keys, which can cause the resulting merge to be larger than the original. For a left merge you can expect the result to be in between N_rows_left and N_rows_left * N_rows_right rows long. The lower bound is in the case that both the left and right DataFrames have no duplicate merge keys, and the upper bound is the case when the left and right DataFrames have the single same value for the merge keys on every row.
Here's a worked example. All DataFrames are 4 rows long, but df2 has duplicate merge keys. As a result when df2 is merged to df the output is longer than df, because for the row with 2 as the key in df, both rows in df2 are matched.
import pandas as pd
df = pd.DataFrame({'key': [1,2,3,4]})
df1 = pd.DataFrame({'row': range(4), 'key': [2,3,4,5]})
df2 = pd.DataFrame({'row': range(4), 'key': [2,2,3,3]})
# Neither frame duplicated on merge key, result is same length (4) as left.
df.merge(df1, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0
#2 3 1.0
#3 4 2.0
# df2 is duplicated on the merge keys so we get >4 rows
df.merge(df2, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0 # Both `2` rows matched
#2 2 1.0 # ^
#3 3 2.0 # Both `3` rows matched
#4 3 3.0 # ^
#5 4 NaN
If the length of the merged dataframe is greater than the length of the left dataframe, it means that the right dataframe has multiple entries for the same joining key. For instance if you have these dataframes:
df1
---
id product
0 111 car
1 222 bike
df2
---
id color
0 111 blue
1 222 red
3 222 green
3 333 yellow
A merge will render 3 rows, because there are two possible matches for the row of df1 with id 222.
df1.merge(df2, on="id", how="left")
---
id product color
0 111 car blue
1 222 bike red
2 222 bike green
Related
I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.
I have mutiple DataFrames, each containing a row called 'location' and another row called 'value' (both make up the index). for example, suppose i have the following 2:
df1 = pd.DataFrame(np.array([[-4,2,5],['nyc','sf','chi']]), columns=['col1','col2','col3'], index=['value','location'])
df2 = pd.DataFrame(np.array([[5,0,-3],['nyc','sf','chi']]), columns=['col1','col2','col3'], index=['value','location'])
the DataFrames will be housed in a dictionary that I can iterate through. Ultimately, I want to retrieve the list of 'value's for each 'location' in a separate DataFrame. so the desired output would look like:
this is a toy example, while my real one will have many more DataFrames and the source DataFrames will have other rows besides the 2 key ones I am interested in
I would recommend set_index and concat:
(pd.concat([df.T.set_index('location')['value'] for df in [df1, df2]], axis=1)
.T
.reset_index(drop=True))
location nyc sf chi
0 -4 2 5
1 5 0 -3
Using merge
df1.T.merge(df2.T,on='location').set_index('location').T
location nyc sf chi
value_x -4 2 5
value_y 5 0 -3
I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]
I've got a dataframe df_a with id information:
unique_id lacet_number
15 5570613 TLA-0138365
24 5025490 EMP-0138757
36 4354431 DXN-0025343
and another dataframe df_b, with the same number of rows that I know correspond to the rows in df_a:
latitude longitude
0 -93.193560 31.217029
1 -93.948082 35.360874
2 -103.131508 37.787609
What I want to do is simply concatenate the two horizontally (similar to cbind in R) and get:
unique_id lacet_number latitude longitude
0 5570613 TLA-0138365 -93.193560 31.217029
1 5025490 EMP-0138757 -93.948082 35.360874
2 4354431 DXN-0025343 -103.131508 37.787609
What I have tried:
df_c = pd.concat([df_a, df_b], axis=1)
which gives me an outer join.
unique_id lacet_number latitude longitude
0 NaN NaN -93.193560 31.217029
1 NaN NaN -93.948082 35.360874
2 NaN NaN -103.131508 37.787609
15 5570613 TLA-0138365 NaN NaN
24 5025490 EMP-0138757 NaN NaN
36 4354431 DXN-0025343 NaN NaN
The problem is that the indices for the two dataframes do not match. I read the documentation for pandas.concat, and saw that there is an option ignore_index. But that only applies to the concatenation axis, in my case the columns and it certainly is not the right choice for me. So my question is: is there a simple way to achieve this?
If you're sure the index row values are the same then to avoid the index alignment order then just call reset_index(), this will reset your index values back to start from 0:
df_c = pd.concat([df_a.reset_index(drop=True), df_b], axis=1)
DataFrame.join
While concat is fine, it's simpler to join:
C = A.join(B)
This still assumes aligned indexes, so reset_index as needed. In OP's example, B's index is already default, so we only need to reset A:
C = A.reset_index(drop=True).join(B)
# unique_id lacet_number latitude longitude
# 0 5570613 TLA-0138365 -93.193560 31.217029
# 1 5025490 EMP-0138757 -93.948082 35.360874
# 2 4354431 DXN-0025343 -103.131508 37.787609
You can use set_axis to make the index labels of one of the frames to be the same as the other's and concatenate horizontally or join. Unlike reset_index, this method preserves the index labels of one of the dataframes.
joined_df = pd.concat([df_a.set_axis(df_b.index), df_b], axis=1)
# or using `join`
joined_df = df_a.set_axis(df_b.index).join(df_b)
I have two DataFrames, df1:
ID value 1
0 5 162
1 7 185
2 11 156
and df2:
ID Comment
1 5
2 7 Yes!
6 11
... which I want to join using ID, with a result that looks like this:
ID value 1 Comment
5 162
7 185 Yes!
11 156
The real DataFrames are much larger and contain more columns, and I essentially want to add the Comment column from df2 to df1. I tried using
df1 = df1.join(df2['Comment'], on='ID')
... but that only gets me a new empty Comment column in df1, like .join somehow fails to use the ID column as the index. I have also tried
df1 = df1.join(df2['Comment'])
... but that uses the default indexes, which don't match between the two DataFrames (they also have different lengths), giving me a Comment value on the wrong place.
What am I doing wrong?
You can just do a merge to achieve what you want:
In [30]:
df1.merge(df2, on='ID')
Out[30]:
ID value1 Comment
0 5 162 None
1 7 185 Yes!
2 11 156 None
[3 rows x 3 columns]
The problem with join is that by default it performs a left index join, because your dataframes do not have a common index values that match then your comment column ends up being empty
EDIT
Following on from the comments, if you want to retain all values in df1 and add just the comments that are not empty and have ID's that exist in df1 then you can perform a left merge:
df1.merge(df2.dropna( subset=['Comment']), on='ID', how='left')
This will drop any rows with empty comments, use the ID column to merge both df1 and df2 to but perform a left merge so retains all values on left hand side but will merge comments that match ID column, the default is inner which retains IDs that are in both left and right dfs.
Further information on merge and further examples.