`pd.concat` with `join=='inner'` doesn't produce intersection of pandas dataframes - python

I am trying to extract common rows from several dataframes using pd.concat:
>>> import numpy as np
>>> import pandas as pd
>>> x = np.random.random(size=(5, 3))
>>> df1 = pd.DataFrame(x)
>>> df2 = pd.DataFrame(x[1:3])
>>> df3 = pd.DataFrame(x[2:4])
>>> df1
0 1 2
0 0.257662 0.453542 0.805230
1 0.060493 0.463148 0.715994
2 0.452379 0.470137 0.965709
3 0.447546 0.964252 0.163247
4 0.187212 0.973557 0.871090
>>> df2
0 1 2
0 0.060493 0.463148 0.715994
1 0.452379 0.470137 0.965709
>>> df3
0 1 2
0 0.452379 0.470137 0.965709
1 0.447546 0.964252 0.163247
As you can see, only the row 0.452379 0.470137 0.965709 is common to all three dataframes. To extract it, I tried:
>>> pd.concat([df1, df2, df3], join='inner')
0 1 2
0 0.257662 0.453542 0.805230
1 0.060493 0.463148 0.715994
2 0.452379 0.470137 0.965709
3 0.447546 0.964252 0.163247
4 0.187212 0.973557 0.871090
0 0.060493 0.463148 0.715994
1 0.452379 0.470137 0.965709
0 0.452379 0.470137 0.965709
1 0.447546 0.964252 0.163247
Thus, join==inner doesn't seem to work! I should also point out that ignore_index=True has no effect on the behavior. In an article on Real Python, using axis=1 is suggested. However, it is wrong in my opinion:
>>> pd.concat([df1, df2, df3], join='inner', axis=1)
0 1 2 0 1 2 0 1 2
0 0.257662 0.453542 0.805230 0.060493 0.463148 0.715994 0.452379 0.470137 0.965709
1 0.060493 0.463148 0.715994 0.452379 0.470137 0.965709 0.447546 0.964252 0.163247
What is wrong with what I am doing? Also, how would I extract common rows from several dataframes if this way doesn't work? I am using Pandas version 0.25.3.

In short, go with reduce(lambda left,right: pd.merge(left,right,on=cols), dfs),
(see Method #2 - make sure to include from functools import reduce), but please see an explanation for pd.concat (Method #1):
Method #1 (concat): I think the most dynamic, robust pd.concat way (of the ways I've tried with concat specifically) is to use. The only major benefit of this solution over the second method below is that you don't have to use an additional library; however, I think you could also write similar code with merge without having to use another library:
dfs = [df1, df2, df3]
cols = [*df1.columns] # enclosing with [*] is the same as tolist()
for df in dfs:
df.set_index(cols, inplace=True) # can only use inplace when looping through dfs (at least using my simpler method)
pd.concat(dfs, join='inner', axis=1).reset_index() # see below paragraph for explanation
Out[1]:
0 1 2
0 0.452379 0.470137 0.965709
Please note that join='inner' means you are joining on the index NOT the unique rows. Also, join only matters if you pass axis=1, which is why effectively nothing happens.
Method #2: (merge with reduce):
#Anky pointed out that how='inner' is default with merge. This was actually the first answer I posted, but I got confused about expected output and went full circle. Please see the simplest answer below:
from functools import reduce
dfs = [df1, df2, df3]
cols = [*df1.columns]
reduce(lambda left,right: pd.merge(left,right,on=cols), dfs)
Out[2]:
0 1 2
0 0.452379 0.470137 0.965709

If you are attempting to look for common rows:
temp = pd.concat([df1, df2, df3])
temp[temp.duplicated()]
I'm sure there is a more elegant solution to this, however.

Try this,
df = pd.merge(df1, df2, how='inner', on=[col1, col2, col3])

# add extral tag column
df_list = [df1, df2, df3]
for i, dfi in enumerate(df_list):
dfi['tag'] = i + 1
# merge DataFrame
df = pd.concat([df1, df2, df3], ignore_index=True)
# find the duplicates rows
cols = df.columns[:-1].tolist()
cond = df[cols].duplicated(keep=False)
obj = df[cond].groupby(cols)['tag'].agg(tuple)
# filter
cond = obj.map(len) == len(df_list)
obj[cond]
obj example:
# 0 1 2
# 0.148080 0.837398 0.565498 (1, 3)
# 0.572673 0.256735 0.620923 (1, 2, 3)
# 0.822542 0.856137 0.645639 (1, 2)
# Name: tag, dtype: object

In a similar fashion to what #Ajay A said,
import numpy as np
import pandas as pd
x = np.random.random(size=(5, 3))
df1 = pd.DataFrame(x)
df2 = pd.DataFrame(x[1:3])
df3 = pd.DataFrame(x[2:4])
Then,
df1
Out[22]:
0 1 2
0 0.845894 0.530659 0.629198
1 0.697229 0.225557 0.314540
2 0.972633 0.685077 0.191109
3 0.069966 0.961317 0.352933
4 0.176633 0.663602 0.235032
df2
Out[23]:
0 1 2
0 0.697229 0.225557 0.314540
1 0.972633 0.685077 0.191109
df3
Out[24]:
0 1 2
0 0.972633 0.685077 0.191109
1 0.069966 0.961317 0.352933
Then you can use pd.merge with how='inner'
pd.merge(df2, df3, how='inner')
Out[25]:
0 1 2
0 0.972633 0.685077 0.191109
or if what you are looking for is soing the intersection of the three,
pd.merge(pd.merge(df1,df2,how='inner'), df3, how='inner')
Out[26]:
0 1 2
0 0.972633 0.685077 0.191109
Use a for loop to handle a df_list.
df_list = [df1, df2, df3]
df_intersection = df1
for df in df_list[1:]:
df_intersection = pd.merge(df_intersection, df, how='inner')

Related

How to concatenate dataframes considering column orders

I want to combine two dataframes:
df1=pd.DataFrame({'A':['a','a',],'B':['b','b']})
df2=pd.DataFrame({'B':['b','b'],'A':['a','a']})
pd.concat([df1,df2],ignore_index=True)
result:
But I want the output to be like this (I want the same code as SQL's union/union all):
Another way is to use numpy to stack the two dataframes and then use pd.DataFrame constructor:
pd.DataFrame(np.vstack([df1.values,df2.values]), columns = df1.columns)
Output:
A B
0 a b
1 a b
2 b a
3 b a
Here is a proposition to do an SQL UNION ALL with pandas by using pandas.concat :
list_dfs = [df1, df2]
out = (
pd.concat([pd.DataFrame(sub_df.to_numpy()) for sub_df in list_dfs],
ignore_index=True)
.set_axis(df1.columns, axis=1)
)
# Output :
print(out)
A B
0 a b
1 a b
2 b a
3 b a

Pandas Dataframe convert column of lists to multiple columns

I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help
explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)

How to merge many DataFrames by index combining values where columns overlap?

I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df

How to split pandas.DataFrame by index order?

If I have two pandas.DataFrame with the same columns.
df1 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
df2 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
I concatenate them into one:
df = pd.concat([df1, df2], ignore_index = False)
The index values now are not ignored.
After I perform some data manipulation without changing the index values, how can I reverse back the concatenation, so that I end up with a list of the two data frames again?
I recommend using keys in concat
df = pd.concat([df1, df2], ignore_index = False,keys=['df1','df2'])
df
Out[28]:
a b c d e f
df1 0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
df2 0 0.449926 0.330672 0.830240 0.861221 0.234013 0.299515
1 0.552645 0.620980 0.313907 0.039247 0.356451 0.849368
2 0.159485 0.620178 0.428837 0.315384 0.910175 0.020809
3 0.687249 0.824803 0.118434 0.661684 0.013440 0.611711
4 0.576244 0.915196 0.544099 0.750581 0.192548 0.477207
Convert back
df1,df2=[y.reset_index(level=0,drop=True) for _, y in df.groupby(level=0)]
df1
Out[30]:
a b c d e f
0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
If you prefer to do without groupby, you could use this.
list_dfs = [df1, df2]
df = pd.concat(list_dfs, ignore_index = False)
new_dfs = []
counter = 0
for i in list_dfs:
new_dfs.append(df[counter:counter+len(i)])
counter += len(i)

Element-wise ternary conditional operation on dataframes

Say given dateframes df1, df2, df3, what is the best way to get df = df1 if (df2>0) else df3 element-wise?
You can use df.where to achieve this:
In [3]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df2 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df3 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
print(df1)
print(df2)
print(df3)
a b c
0 -0.378401 1.456254 -0.327311
1 0.491864 -0.757420 -0.014689
2 0.028873 -0.906428 -0.252586
3 -0.686849 1.515643 1.065322
4 0.570760 -0.857298 -0.152426
a b c
0 1.273215 1.275937 -0.745570
1 -0.460257 -0.756481 1.043673
2 0.452731 1.071703 -0.454962
3 0.418926 1.395290 -1.365873
4 -0.661421 0.798266 0.384397
a b c
0 -0.641351 -1.469222 0.160428
1 1.164031 1.781090 -1.218099
2 0.096094 0.821062 0.815384
3 -1.001950 -1.851345 0.772869
4 -1.137854 1.205580 -0.922832
In [4]:
df = df1.where(df2 >0, df3)
df
Out[4]:
a b c
0 -0.378401 1.456254 0.160428
1 1.164031 1.781090 -0.014689
2 0.028873 -0.906428 0.815384
3 -0.686849 1.515643 0.772869
4 -1.137854 -0.857298 -0.152426
also
df = df1[df2 > 0].combine_first(df3)

Categories

Resources