Pandas: join 'on' failing - python

I have two DataFrames, df1:
ID value 1
0 5 162
1 7 185
2 11 156
and df2:
ID Comment
1 5
2 7 Yes!
6 11
... which I want to join using ID, with a result that looks like this:
ID value 1 Comment
5 162
7 185 Yes!
11 156
The real DataFrames are much larger and contain more columns, and I essentially want to add the Comment column from df2 to df1. I tried using
df1 = df1.join(df2['Comment'], on='ID')
... but that only gets me a new empty Comment column in df1, like .join somehow fails to use the ID column as the index. I have also tried
df1 = df1.join(df2['Comment'])
... but that uses the default indexes, which don't match between the two DataFrames (they also have different lengths), giving me a Comment value on the wrong place.
What am I doing wrong?

You can just do a merge to achieve what you want:
In [30]:
df1.merge(df2, on='ID')
Out[30]:
ID value1 Comment
0 5 162 None
1 7 185 Yes!
2 11 156 None
[3 rows x 3 columns]
The problem with join is that by default it performs a left index join, because your dataframes do not have a common index values that match then your comment column ends up being empty
EDIT
Following on from the comments, if you want to retain all values in df1 and add just the comments that are not empty and have ID's that exist in df1 then you can perform a left merge:
df1.merge(df2.dropna( subset=['Comment']), on='ID', how='left')
This will drop any rows with empty comments, use the ID column to merge both df1 and df2 to but perform a left merge so retains all values on left hand side but will merge comments that match ID column, the default is inner which retains IDs that are in both left and right dfs.
Further information on merge and further examples.

Related

Delete row indices based on common columns in a Dataframe

I have following two dataframes df1 and df2
final raw st
abc 12 10
abc 17 15
abc 14 17
and
final raw
abc 12
abc 14
My expected output is
final raw st
abc 17 15
I would like to delete rows based on common column value.
My try:
df1.isin(df2)
This is giving me Boolean result. Another thing, I tried
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how = 'inner') so that we get all the common columns for df1 and df3.
You are closed with merge you just need extra step. First you need to perform an outer join to keep all rows from both dataframes and enable indicator of merge then filter on this indicator to keep right values (from df2). Finally, keep only columns from df1:
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how='outer', indicator=True) \
.query("_merge == 'left_only'")[df1.columns]
print(df3)
# Output
final raw st
1 abc 17 15
You need to refer to the correct column when using isin.
result = df1[~df1['raw'].isin(df2['raw'])]

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

Understanding the nature of merge in pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I want to understand the pd.merge work nature. I have two dataframes that have unequal length. When trying to merge them through this command
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
The length was different from expected as follows
length of comps: 4829
length of surgical: 7939
length of merged: 9531
From my own understanding, merged dataframe should have as same as the length of comps dataframe since left join will look for matching keys in both dataframes and discard the rest. As long as comps length is less than surgical length, the merged length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how parameter to "right", merged has a larger number than expected.
Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:
comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.
Note: a teammate suggests this solution
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")
The length of the output was as follows
length of comps: 4829
length of surgical: 7939
length of merged: 7939
How can this help?
The "issue" is with duplicated merge keys, which can cause the resulting merge to be larger than the original. For a left merge you can expect the result to be in between N_rows_left and N_rows_left * N_rows_right rows long. The lower bound is in the case that both the left and right DataFrames have no duplicate merge keys, and the upper bound is the case when the left and right DataFrames have the single same value for the merge keys on every row.
Here's a worked example. All DataFrames are 4 rows long, but df2 has duplicate merge keys. As a result when df2 is merged to df the output is longer than df, because for the row with 2 as the key in df, both rows in df2 are matched.
import pandas as pd
df = pd.DataFrame({'key': [1,2,3,4]})
df1 = pd.DataFrame({'row': range(4), 'key': [2,3,4,5]})
df2 = pd.DataFrame({'row': range(4), 'key': [2,2,3,3]})
# Neither frame duplicated on merge key, result is same length (4) as left.
df.merge(df1, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0
#2 3 1.0
#3 4 2.0
# df2 is duplicated on the merge keys so we get >4 rows
df.merge(df2, on='key', how='left')
# key row
#0 1 NaN
#1 2 0.0 # Both `2` rows matched
#2 2 1.0 # ^
#3 3 2.0 # Both `3` rows matched
#4 3 3.0 # ^
#5 4 NaN
If the length of the merged dataframe is greater than the length of the left dataframe, it means that the right dataframe has multiple entries for the same joining key. For instance if you have these dataframes:
df1
---
id product
0 111 car
1 222 bike
df2
---
id color
0 111 blue
1 222 red
3 222 green
3 333 yellow
A merge will render 3 rows, because there are two possible matches for the row of df1 with id 222.
df1.merge(df2, on="id", how="left")
---
id product color
0 111 car blue
1 222 bike red
2 222 bike green

Joining 101 columns from a dictionary of dataframes

For the love of God! I have 101 single column features and I just want to join, or merge, or concatenate them so they all have the index of the first frame. I have all the frames in a dict already! I thought that would be the hard part.
Below I've done manually what I'd like to do. What I'd like to do is loop through the dict and get all 101 columns.
a=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/1byd.xls']
b=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/2byd.xls']
c=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/3byd.xls']
d=a.join(b['Value'],lsuffix='_caller')
f=d.join(c['Value'],lsuffix='_caller')
f
You will need to
Create a first variable and set it to True. The first time we iterate through ou dict() we don't have anything to merge our dataframe with, so we will just assign the value to a variable
set the first variable to False so next time we will just merge our dataframe together
df.merge() and set left_index and right_index parameter to True so that our join happens on these index.
Below is a sample code.
Input
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4]})
df1 = pd.DataFrame({'col2': [11,12,13,14]})
df2 = pd.DataFrame({'col3': [111,112,113,114]})
d = {'df':df, 'df1':df1, 'df2':df2}
first = True
for key, value in d.items():
if first:
n = value
first = False
else:
n = n.merge(value, left_index=True, right_index=True)
n.head()
output
col1 col2 col3
0 1 11 111
1 2 12 112
2 3 13 113
3 4 14 114
Here is a link to merge() for more information link
I would like to add that, if you want to keep the keys of the dictionary as the column headers of the final dataframe you just need to add this in the end:
n.columns=d.keys()

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

Categories

Resources