pandas: dataframes row-wise comparison - python

I have two data frames that I would like to compare for equality in a row-wise manner. I am interested in computing the number of rows that have the same values for non-joined attributes.
For example,
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
I will be joining these two data frames on column a and b. There are two rows (first two) that have the same values for c and d in both the data frames.
I am currently using the following approach where I first join these two data frames, and then compute each row's values for equality.
df = df1.merge(df2, on=['a','b'])
cols1 = [c for c in df.columns.tolist() if c.endswith("_x")]
cols2 = [c for c in df.columns.tolist() if c.endswith("_y")]
num_rows_equal = 0
for index, row in df.iterrows():
not_equal = False
for col1,col2 in zip(cols1,cols2):
if row[col1] != row[col2]:
not_equal = True
break
if not not_equal: # row values are equal
num_rows_equal += 1
num_rows_equal
Is there a more efficient (pythonic) way to achieve the same result?

A shorter way of achieving that:
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
df = df1.merge(df2, on=['a','b'])
comparison_cols = [c.strip('_x') for c in df.columns.tolist() if c.endswith("_x")]
num_rows_equal = (df1[comparison_cols][df1[comparison_cols] == df2[comparison_cols]].isna().sum(axis=1) == 0).sum()

use pandas merge ordered, merging with 'inner'. From there, you can get your dataframe shape and by extension your number of rows.
df_r = pd.merge_ordered(df1,df2,how='inner')
a b c d
0 1 2 60 50
1 2 3 20 90
no_of_rows = df_r.shape[0]
#print(no_of_rows)
#2

Related

Python : how to 'concatenate' 2 pd.DataFrame columns ? two columns into one

I have trouble with some pandas dataframes.
Its very simple, I have 4 columns, and I want to reshape them in 2...
For 'practical' reasons, I don't want to use 'header names', but I need to use 'index' (for the columns header names).
I have :
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
I want as a result :
df_res = pd.DataFrame({'NewName1': [1,2,3,4,5,6],'NewName2': [7,8,9,10,11,12]})
(in fact NewName1 doesn't matter, it can stay a or whatever the name...)
I tried with for loops, append, concat, but couldn't figured it out...
Any suggestions ?
Thanks for your help !
Bina
You can extract the desired columns and create a new pandas.DataFrame like so:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
first_col = np.concatenate((df.a.to_numpy(), df.b.to_numpy()))
second_col = np.concatenate((df.c.to_numpy(), df.d.to_numpy()))
df2 = pd.DataFrame({"NewName1": first_col, "NewName2": second_col})
>>> df2
NewName1 NewName2
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
This is probably not the most elegant solution, but I would isolate the two dataframes and then concatenate them. I needed to rename the column axis so that the four columns could be aligned correctly.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
af = df[['a', 'c']]
bf = df[['b', 'd']]
frames = (
af.rename({'a': 'NewName1', 'c': 'NewName2'}, axis=1),
bf.rename({'b': 'NewName1', 'd': 'NewName2'}, axis=1)
)
out = pd.concat(frames)
[EDIT] Replying to the comment.
I'm not that familiar with indexing but this might be one solution. You could avoid column names by using .iloc. Replace the af, and bf frames above with these lines.
af = df.iloc[:, ::2]
bf = df.iloc[:, 1::2]

How to remove all rows from one dataframe that are part of another dataframe? [duplicate]

This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 2 years ago.
I have two dataframes like this
import pandas as pd
df1 = pd.DataFrame(
{
'A': list('abcaewar'),
'B': list('ghjglmgb'),
'C': list('lkjlytle'),
'ignore': ['stuff'] * 8
}
)
df2 = pd.DataFrame(
{
'A': list('abfu'),
'B': list('ghio'),
'C': list('lkqw'),
'stuff': ['ignore'] * 4
}
)
and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is
A B C ignore
0 c j j stuff
1 e l y stuff
2 w m t stuff
3 r b e stuff
One way of achieving this would be
comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)
keep_ind = [
ind for ind in df1.index if ind not in df2.index
]
new_df1 = df1.loc[keep_ind].reset_index()
Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.
Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:
df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]
Result:
# print(df3)
A B C ignore
2 c j j stuff
4 e l y stuff
5 w m t stuff
7 r b e stuff

Applying Function to Rows of a Dataframe in Python

I have a dataframe and within 1 of the columns is a nested dictionary. I want to create a function where you pass each row and a column name and the function json_normalizes the the column into a dataframe. However, I keep getting and error 'function takes 2 positional arguments, 6 were given' There are more than 6 columns in the dataframe and more than 6 columns in the row[col] (see below) so I am confused as how 6 arguments are being provided.
import pandas as pd
from pandas.io.json import json_normalize
def fix_row_(row, col):
if type(row[col]) == list:
df = json_normalize(row[col])
df['id'] = row['id']
else:
df = pd.DataFrame()
return df
new_df = data.apply(lambda x: fix_po_(x, 'Items'), axis=1)
So new_df will be a dataframe of dataframes. In the example below, it would just be a dataframe with A,B,C as columns and 1,2,3 as the values.
Quasi-reproducible example:
my_dict = {'A': 1, 'B': 2, 'C': 3}
ids = pd.Series(['id1','id2','id3'],name='ids')
data= pd.DataFrame(ids)
data['my_column']=''
m = data['ids'].eq('id1')
data.loc[m, 'my_column'] = [my_dict] * m.sum()
Just pass your column using axis=1
df.apply(lambda x: fix_row_(x['my_column']), axis=1)

Join two same columns from two dataframes, pandas

I am looking for fastest way to join columns with same names using separator.
my dataframes:
df1:
A,B,C,D
my,he,she,it
df2:
A,B,C,D
dog,cat,elephant,fish
expected output:
df:
A,B,C,D
my:dog,he:cat,she:elephant,it:fish
As you can see, I want to merge columns with same names, two cells in one.
I can use this code for A column:
df=df1.merge(df2)
df['A'] = df[['A_x','A_y']].apply(lambda x: ':'.join(x), axis = 1)
In my real dataset i have above 30 columns, and i dont want to write same lines for each of them, is there any faster way to receive my expected output?
How about concat and groupby ?
df3 = pd.concat([df1,df2],axis=0)
df3 = df3.groupby(df3.index).transform(lambda x : ':'.join(x)).drop_duplicates()
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
How about this?
df3 = df1 + ':' + df2
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
This is good because if there's columns that doesn't match, you get NaN, so you can filter then later if you want:
df1 = pd.DataFrame({'A': ['my'], 'B': ['he'], 'C': ['she'], 'D': ['it'], 'E': ['another'], 'F': ['and another']})
df2 = pd.DataFrame({'A': ['dog'], 'B': ['cat'], 'C': ['elephant'], 'D': ['fish']})
df1 + ':' + df2
A B C D E F
0 my:dog he:cat she:elephant it:fish NaN NaN
you can do this by simply adding the two dataframe with a separator.
import pandas as pd
df1 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df2 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df1["A"] = "my"
df1["B"] = "he"
df1["C"] = "she"
df1["D"] = "it"
df2["A"] = "dog"
df2["B"] = "cat"
df2["C"] = "elephant"
df2["D"] = "fish"
print(df1)
print(df2)
df3 = df1 + ':' + df2
print(df3)
This will give you a result like:
A B C D
0 my he she it
A B C D
0 dog cat elephant fish
A B C D
0 my:dog he:cat she:elephant it:fish
Is this what you try to achieve? Although, this only works if you have same columns in both the dataframes. The extra columns will have nans. What do you want to do with the columns those are not same in df1 and df2? Please comment below to help me understand your problem better.
You can simply do:
df = df1 + ':' + df2
print(df)
Which is simple and effective
This should be your answer

Compare values based on common keys in pandas

Hi, I have two data frames. Both with two columns, identifier and weight.
What I would like is, for each "key" so A and B, if the second column have opposite signs accross the two dataframes (so one is positive and one is negative, then create a new column with the lowest absolute value).
import pandas as pd
A = {"ID":["A", "B"], "Weight":[500,300]}
B = {"ID":["A", "B"], "Weight":[-300,100]}
dfA = pd.DataFrame(data=A)
dfB = pd.DataFrame(data=B)
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
So expected output would be a new column on dfC with the lowest absolute value between both weight columns if they have an opposite signs
Here is one way via .loc accessor:
import pandas as pd
dfA = dfA.set_index('ID')
dfB = dfB.set_index('ID')
dfC = dfA.copy()
dfC['Result'] = 0
mask = (dfA['Weight'] > 0) != (dfB['Weight'] > 0)
dfC.loc[mask, 'Result'] = np.minimum(dfA['Weight'].abs(), dfB['Weight'].abs())
dfC = dfC.reset_index()
# ID Weight Result
# 0 A 500 300
# 1 B 300 0
Here is another way to get the result you want, using df.apply and df.concat
Step 1 : Create dfC with ID, WeightA and WeightB
import numpy as np
A = dfA.set_index('ID')
B = dfB.set_index('ID')
dfC = pd.concat([A,B], 1).reset_index()
dfC.columns = ['ID', 'WeightA', 'WeightB']
Edit :
You can use your dfC too, just rename the columns as such and use the Step2 for your result.
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
dfC.columns = ['ID', 'WeightA', 'WeightB']
Step2: Create column 'lowestAbsWeight' which is the lowest absolute of the two weights A and B
dfC['lowestAbsWeight'] = dfC.apply(lambda row: np.absolute(row['WeightA']) if np.absolute(row['WeightA'])<np.absolute(row['WeightB'] ) else np.absolute(row['WeightB']), axis=1 )
The output looks like:
ID WeightA WeightB lowestAbsWeight
0 A 500 -300 300
1 B 300 100 100
Hope this helps.

Categories

Resources