How to do left outer join exclusion in pandas - python

I have two dataframes, A and B, and I want to get those in A but not in B, just like the one right below the top left corner.
Dataframe A has columns ['a','b' + others] and B has columns ['a','b' + others]. There are no NaN values. I tried the following:
1.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) | (~dfA['b'].isin(dfm['b'])
2.
dfm = dfA.merge(dfB, on=['a','b'])
dfe = dfA[(~dfA['a'].isin(dfm['a']) & (~dfA['b'].isin(dfm['b'])
3.
dfe = dfA[(~dfA['a'].isin(dfB['a']) | (~dfA['b'].isin(dfB['b'])
4.
dfe = dfA[(~dfA['a'].isin(dfB['a']) & (~dfA['b'].isin(dfB['b'])
but when I get len(dfm) and len(dfe), they don't sum up to dfA (it's off by a few numbers). I've tried doing this on dummy cases and #1 works, so maybe my dataset may have some peculiarities I am unable to reproduce.
What's the right way to do this?

Check out this link
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True)
df = df[df['_merge'] == 'left_only']
One liner :
df = pd.merge(dfA, dfB, on=['a','b'], how="outer", indicator=True
).query('_merge=="left_only"')

I think it would go something like the examples in: Pandas left outer join multiple dataframes on multiple columns
dfe = pd.merge(dFA, dFB, how='left', on=['a','b'], indicator=True)
dfe[dfe['_merge'] == 'left_only']

Related

Pandas merge multiple dataframes on one temporal index, with latest value from all others

I'm merging some dataframes which have a time index.
import pandas as pd
df1 = pd.DataFrame(['a', 'b', 'c'],
columns=pd.MultiIndex.from_product([['target'], ['key']]),
index = [
'2022-04-15 20:20:20.000000',
'2022-04-15 20:20:21.000000',
'2022-04-15 20:20:22.000000'],)
df2 = pd.DataFrame(['a2', 'b2', 'c2', 'd2', 'e2'],
columns=pd.MultiIndex.from_product([['feature2'], ['keys']]),
index = [
'2022-04-15 20:20:20.100000',
'2022-04-15 20:20:20.500000',
'2022-04-15 20:20:20.900000',
'2022-04-15 20:20:21.000000',
'2022-04-15 20:20:21.100000',],)
df3 = pd.DataFrame(['a3', 'b3', 'c3', 'd3', 'e3'],
columns=pd.MultiIndex.from_product([['feature3'], ['keys']]),
index = [
'2022-04-15 20:20:19.000000',
'2022-04-15 20:20:19.200000',
'2022-04-15 20:20:20.000000',
'2022-04-15 20:20:20.200000',
'2022-04-15 20:20:23.100000',],)
then I use this merge procedure:
def merge(dfs:list[pd.DataFrame], targetColumn:'str|tuple[str]'):
from functools import reduce
if len(dfs) == 0:
return None
if len(dfs) == 1:
return dfs[0]
for df in dfs:
df.index = pd.to_datetime(df.index)
merged = reduce(
lambda left, right: pd.merge(
left,
right,
how='outer',
left_index=True,
right_index=True),
dfs)
for col in merged.columns:
if col != targetColumn:
merged[col] = merged[col].fillna(method='ffill')
return merged[merged[targetColumn].notna()]
like this:
merged = merge([df1, df2, df3], targetColumn=('target', 'key'))
which produces this:
And it all works great. Problem is efficiency - notice in the merge() procedure I use reduce and an outer merge to join the dataframes together, this can make a HUGE interim dataframe which then gets filtered down. But what if my pc doesn't have enough ram to handle that huge dataframe in memory? well that's the problem I'm trying to avoid.
I'm wondering if there's a way to avoid expanding the data out into a huge dataframe while merging.
Of course a regular old merge isn't sufficient because it only merges on exactly matching indexes rather than the latest temporal index before the target variable's observation:
df1.merge(df2, how='left', left_index=True, right_index=True)
Has this kind of thing been solved efficiently? Seems like a common data science issue, since no one wants to leak future information into their models, and everyone has various inputs to merge together...
You're in luck: pandas.merge_asof does exactly what you need!
We use the default direction='backward' argument:
A “backward” search selects the last row in the right DataFrame whose
‘on’ key is less than or equal to the left’s key.
Using your three example DataFrames:
import pandas as pd
from functools import reduce
# Convert all indexes to datetime
for df in [df1, df2, df3]:
df.index = pd.to_datetime(df.index)
# Perform as-of merges
res = reduce(lambda left, right:
pd.merge_asof(left, right, left_index=True, right_index=True),
[df1, df2, df3])
print(res)
target feature2 feature3
key keys keys
2022-04-15 20:20:20 a NaN c3
2022-04-15 20:20:21 b d2 d3
2022-04-15 20:20:22 c e2 d3
Here's some code that works for your example. I'm not sure about more general cases of multi-indexed columns, but in any event it contains the basic ideas for merging on a single temporal index.
merged = df1.copy(deep=True)
for df in [df2, df3]:
idxNew = df.index.get_indexer(merged.index, method='pad')
idxMerged = [i for i, x in enumerate(idxNew) if x != -1]
idxNew = [x for x in idxNew if x != -1]
n = len(merged.columns)
merged[df.columns] = None
merged.iloc[idxMerged,n:] = df.iloc[idxNew,:].set_index(merged.index[idxMerged])
print(merged)
Output:
target feature2 feature3
key keys keys
2022-04-15 20:20:20.000000 a None c3
2022-04-15 20:20:21.000000 b d2 d3
2022-04-15 20:20:22.000000 c e2 d3

pandas merge by excluding certain columns from merge

I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y
You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())
Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')
What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))
One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)

How to drop dataframe rows not in another dataframe?

I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?
If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')

Perform concat in pandas without including common column twice

From Pandas documentation:
result = pd.concat([df1, df4], axis=1, join='inner')
How can I perform a concat operation but without including the common columns twice? I only want to include them once. In this example, columns B and D are repeated twice after concat but have the same values.
Select the columns from df4 that are not in df1:
result = pd.concat([df1, df4[['F']]], axis=1, join='inner')
Or:
complementary = [c for c in df4 if c not in df1]
result = pd.concat([df1, df4[complementary], axis=1, join='inner')
The latter expression will choose the complementary columns automatically.
P.S. If the columns with the same name are different in df1 and df4 (as it seems to be in your case), you can apply the same trick symmetrically and select only the complementary columns from df1.

Pandas left join - how to replace values not present in second df with NaN

I have two dataframes which I am joining like so:
df3 = df1.join(df2.set_index('id'), on='id', how='left')
But I want to replace values for id-s which are present in df1 but not in df2 with NaN (left join will just leave the values in df1 as they are). Whats the easiest way to accomplish this?
I think you need Series.where with Series.isin:
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
Or numpy.where:
df1['id'] = np.where(df1['id'].isin(df2['id']), df1['id'], np.nan)
Sample:
df1 = pd.DataFrame({
'id':list('abc'),
})
df2 = pd.DataFrame({
'id':list('dmna'),
})
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
print (df1)
id
0 a
1 NaN
2 NaN
Or solution with merge and indicator parameter:
df3 = df1.merge(df2, on='id', how='left', indicator=True)
df3['id'] = df3['id'].mask(df3.pop('_merge').eq('left_only'))
print (df3)
id
0 a
1 NaN
2 NaN

Categories

Resources