Join two same columns from two dataframes, pandas - python

I am looking for fastest way to join columns with same names using separator.
my dataframes:
df1:
A,B,C,D
my,he,she,it
df2:
A,B,C,D
dog,cat,elephant,fish
expected output:
df:
A,B,C,D
my:dog,he:cat,she:elephant,it:fish
As you can see, I want to merge columns with same names, two cells in one.
I can use this code for A column:
df=df1.merge(df2)
df['A'] = df[['A_x','A_y']].apply(lambda x: ':'.join(x), axis = 1)
In my real dataset i have above 30 columns, and i dont want to write same lines for each of them, is there any faster way to receive my expected output?

How about concat and groupby ?
df3 = pd.concat([df1,df2],axis=0)
df3 = df3.groupby(df3.index).transform(lambda x : ':'.join(x)).drop_duplicates()
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish

How about this?
df3 = df1 + ':' + df2
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
This is good because if there's columns that doesn't match, you get NaN, so you can filter then later if you want:
df1 = pd.DataFrame({'A': ['my'], 'B': ['he'], 'C': ['she'], 'D': ['it'], 'E': ['another'], 'F': ['and another']})
df2 = pd.DataFrame({'A': ['dog'], 'B': ['cat'], 'C': ['elephant'], 'D': ['fish']})
df1 + ':' + df2
A B C D E F
0 my:dog he:cat she:elephant it:fish NaN NaN

you can do this by simply adding the two dataframe with a separator.
import pandas as pd
df1 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df2 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df1["A"] = "my"
df1["B"] = "he"
df1["C"] = "she"
df1["D"] = "it"
df2["A"] = "dog"
df2["B"] = "cat"
df2["C"] = "elephant"
df2["D"] = "fish"
print(df1)
print(df2)
df3 = df1 + ':' + df2
print(df3)
This will give you a result like:
A B C D
0 my he she it
A B C D
0 dog cat elephant fish
A B C D
0 my:dog he:cat she:elephant it:fish
Is this what you try to achieve? Although, this only works if you have same columns in both the dataframes. The extra columns will have nans. What do you want to do with the columns those are not same in df1 and df2? Please comment below to help me understand your problem better.

You can simply do:
df = df1 + ':' + df2
print(df)
Which is simple and effective
This should be your answer

Related

pd.merge and check changed data

if I have these dataframes
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
df1
index col1 col2
0 1 a h
1 2 b e
2 3 c l
3 4 d p
df2
index col1 col2
0 1 a h
1 2 e e
2 3 f lp
3 4 d p
I want to merge them and see whether or not the rows are different and get an output like this
index col1 col1_validation col2 col2_validation
0 1 a True h True
1 2 b False e True
2 3 c False l False
3 4 d True p True
how can I achieve that?
It looks like col1 and col2 from your "merged" dataframe are just taken from df1. In that case, you can simply compare the col1, col2 between the original data frames and add those as columns:
cols = ["col1", "col2"]
val_cols = ["col1_validation", "col2_validation"]
# (optional) new dataframe, so you don't mutate df1
df = df1.copy()
new_cols = (df1[cols] == df2[cols])
df[val_cols] = new_cols
You can merge and compare the two data frames with something similar to the following:
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
# give columns unique name when merging
df1.columns = df1.columns + '_df1'
df2.columns = df2.columns + '_df2'
# merge/combine data frames
combined = pd.concat([df1, df2], axis = 1)
# add calculated columns
combined['col1_validation'] = combined['col1_df1'] == combined['col1_df2']
combined['col12validation'] = combined['col2_df1'] == combined['col2_df2']

How to remove all rows from one dataframe that are part of another dataframe? [duplicate]

This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 2 years ago.
I have two dataframes like this
import pandas as pd
df1 = pd.DataFrame(
{
'A': list('abcaewar'),
'B': list('ghjglmgb'),
'C': list('lkjlytle'),
'ignore': ['stuff'] * 8
}
)
df2 = pd.DataFrame(
{
'A': list('abfu'),
'B': list('ghio'),
'C': list('lkqw'),
'stuff': ['ignore'] * 4
}
)
and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is
A B C ignore
0 c j j stuff
1 e l y stuff
2 w m t stuff
3 r b e stuff
One way of achieving this would be
comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)
keep_ind = [
ind for ind in df1.index if ind not in df2.index
]
new_df1 = df1.loc[keep_ind].reset_index()
Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.
Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:
df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]
Result:
# print(df3)
A B C ignore
2 c j j stuff
4 e l y stuff
5 w m t stuff
7 r b e stuff

pandas: dataframes row-wise comparison

I have two data frames that I would like to compare for equality in a row-wise manner. I am interested in computing the number of rows that have the same values for non-joined attributes.
For example,
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
I will be joining these two data frames on column a and b. There are two rows (first two) that have the same values for c and d in both the data frames.
I am currently using the following approach where I first join these two data frames, and then compute each row's values for equality.
df = df1.merge(df2, on=['a','b'])
cols1 = [c for c in df.columns.tolist() if c.endswith("_x")]
cols2 = [c for c in df.columns.tolist() if c.endswith("_y")]
num_rows_equal = 0
for index, row in df.iterrows():
not_equal = False
for col1,col2 in zip(cols1,cols2):
if row[col1] != row[col2]:
not_equal = True
break
if not not_equal: # row values are equal
num_rows_equal += 1
num_rows_equal
Is there a more efficient (pythonic) way to achieve the same result?
A shorter way of achieving that:
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
df = df1.merge(df2, on=['a','b'])
comparison_cols = [c.strip('_x') for c in df.columns.tolist() if c.endswith("_x")]
num_rows_equal = (df1[comparison_cols][df1[comparison_cols] == df2[comparison_cols]].isna().sum(axis=1) == 0).sum()
use pandas merge ordered, merging with 'inner'. From there, you can get your dataframe shape and by extension your number of rows.
df_r = pd.merge_ordered(df1,df2,how='inner')
a b c d
0 1 2 60 50
1 2 3 20 90
no_of_rows = df_r.shape[0]
#print(no_of_rows)
#2

Update a dataframe by dataframes with NaN values

I try to update a DataFrame
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
by another DataFrame
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]}).
Now, my aim is to update df1 by df2 and overwrite all values (NaN values too) using
df1.update(df2)
In contrast with the common usage it's important to me to get the NaN values finally in df1.
But as far as I see the update returns
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
Is there a way to get
>>> df1
A B
0 1 9
1 2 NaN
2 3 11
3 4 NaN
without building df1 manually?
I am late to the party but I was recently confronted to the same issue, i.e. trying to update a dataframe without ignoring NaN values like the Pandas built-in update method does.
For two dataframes sharing the same column names, a workaround would be to concatenate both dataframes and then remove duplicates, only keeping the last entry:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]
Depending on indexing, it might be necessary to sort the indices of the output dataframe:
df1=df1.sort_index()
To address you very specific example for which df2 does not have a column A, you could run:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1['B']=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]['B']
It also works fine for me. You could perhaps use np.nan instead of 'nan'?
I guess you meant [9, np.nan, 11, np.nan], not string "nan".
If there is no mandatory to use update() then do df1.B = df2.B instead, so that the new df1.B will contain NaN.
DataFrame.update() only updates non-NA values. See docs
Approach 1: Drop all affected columns
I achieved this by dropping the new columns and joining the data from the replacement DataFrame:
df1 = df1.drop(columns=df2.columns).join(df2)
This tells Pandas to remove the columns from df1 that you're about to recreate using the values from df2. Note that the column order changes since the new columns are appended to the end.
Approach 2: Preserve column order
Loop over all columns in the replacement DataFrame, inserting affected columns in the target DataFrame in their original place after dropping the original. If the replacement DataFrame includes a column not in the target DataFrame, it will be appended to the end.
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Caveat
With both of these approaches, if your indices do not match between df1 and df2, the missing indices from df2 will end up NaN in your output DataFrame:
df1 = pd.DataFrame(data = {'B' : [1,2,3,4,5], 'A' : [5,6,7,8,9]}) # Note the additional row
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1.update(df2)
Output:
>>> df1
B A
0 9.0 5
1 2.0 6
2 11.0 7
3 4.0 8
4 5.0 9
My version 1:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1 = df1.drop(columns=df2.columns).join(df2)
Output:
>>> df1
A B
0 5 9.0
1 6 NaN
2 7 11.0
3 8 NaN
4 9 NaN
My version 2:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Output:
>>> df1
B A
0 9.0 5
1 NaN 6
2 11.0 7
3 NaN 8
4 NaN 9
A usable trick is to fill with a string like 'n/a', then replace 'n/a' with np.nan, and convert column type back to float
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, 'n/a', 11, 'n/a']})
df1.update(df2)
df1['B'] = df1['B'].replace({'n/a':np.nan})
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
Some explanation about the type conversion: after the call to replace, the result is:
A B
0 1 9.0
1 2 NaN
2 3 11.0
3 4 NaN
This looks acceptable, but actually the type of column B has changed from float to object.
df1.dtypes
will give
A int64
B object
dtype: object
To set it back to float, you can use:
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
And then, you shall have the expected result:
df1.dtypes
will give the expected type:
A int64
B float64
dtype: object
The pandas.DataFrame.update doesn't replace values by nan by default so to circumvent this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df2.replace(np.nan, 'NAN', inplace = True)
df1.update(df2)
df1.replace('NAN', np.nan, inplace = True)

Pandas merge creates unwanted duplicate entries

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.
import pandas as pd
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
print(pd.merge(df1,df2))
output:
A
0 2
1 2
2 2
3 2
4 3
5 4
6 5
desired/expected output:
A
0 2
1 2
2 3
3 4
4 5
Please let me know how/if I can achieve the desired output using merge, thank you!
EDIT
To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?
import pandas as pd
df1 = df2 = pd.DataFrame(
{'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)
print(pd.merge(df1,df2))
output:
A B
0 2 red
1 2 orange
2 3 yellow
3 4 green
4 5 blue
However, based on the first example I would expect:
A B
0 2 red
1 2 orange
2 2 red
3 2 orange
4 3 yellow
5 4 green
6 5 blue
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()
df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
print(df)
Output:
A
0 2
1 2
2 3
3 4
4 5
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2).drop('index', 1, inplace = True)
The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.
If you try this, you will see what I am talking about.
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2, on = 'A')
did you try df.drop_duplicates() ?
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df=pd.merge(df1,df2)
df_new=df.drop_duplicates()
print df
print df_new
Seems that it gives the results that you want
The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
join_cols = ['A']
merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False], on=join_cols)
Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.
I have unfortunately stumbled upon a similar problem which I see is now old.
I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()
df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()
df=pd.merge(df1,df2)
print('df1:')
print( df1 )
print('df2:')
print( df2 )
print('df:')
print( df )

Categories

Resources