From Pandas documentation:
result = pd.concat([df1, df4], axis=1, join='inner')
How can I perform a concat operation but without including the common columns twice? I only want to include them once. In this example, columns B and D are repeated twice after concat but have the same values.
Select the columns from df4 that are not in df1:
result = pd.concat([df1, df4[['F']]], axis=1, join='inner')
Or:
complementary = [c for c in df4 if c not in df1]
result = pd.concat([df1, df4[complementary], axis=1, join='inner')
The latter expression will choose the complementary columns automatically.
P.S. If the columns with the same name are different in df1 and df4 (as it seems to be in your case), you can apply the same trick symmetrically and select only the complementary columns from df1.
Related
I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y
You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())
Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')
What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))
One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)
I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?
If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')
I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')
I have to append two data sets. They have completely different rows and columns. I have tried the command:
df1 = pd.merge(df1, df2)but it gives an error.Data Frame 1
Data Frame 2
if they have the same number of columns and are on the same order, you could do :
df2.columns = df1.columns
df_concat = pd.concat([df1, df2], ignore_index=True)
I have two dataframes, A and B, and I want to get those in A but not in B, just like the one right below the top left corner.
Dataframe A has columns ['a','b' + others] and B has columns ['a','b' + others]. There are no NaN values. I tried the following:
1.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) | (~dfA['b'].isin(dfm['b'])
2.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) & (~dfA['b'].isin(dfm['b'])
3.
dfe = dfA[(~dfA['a'].isin(dfB['a']) | (~dfA['b'].isin(dfB['b'])
4.
dfe = dfA[(~dfA['a'].isin(dfB['a']) & (~dfA['b'].isin(dfB['b'])
but when I get len(dfm) and len(dfe), they don't sum up to dfA (it's off by a few numbers). I've tried doing this on dummy cases and #1 works, so maybe my dataset may have some peculiarities I am unable to reproduce.
What's the right way to do this?
Check out this link
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True)
df = df[df['_merge'] == 'left_only']
One liner :
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True
).query('_merge=="left_only"')
I think it would go something like the examples in: Pandas left outer join multiple dataframes on multiple columns
dfe = pd.merge(dFA, dFB, how='left', on=['a','b'], indicator=True)
dfe[dfe['_merge'] == 'left_only']