matching 2 dataframes in pandas

matching 2 dataframes in pandas - python

If I have 2 dataframes in pandas like below
but 2 dataframes don't have same columns, only few columns are same.
df1
no datas1 datas2 datas3 datas4
0 a b a a
1 b c b b
2 d b c a
df2
no datas1 datas2 datas3 data4 data 5 data6
0 c a a a a b
1 a c b b b b
2 a b c b c c
I'd like to know how much it's matched for each same column based on "no" filed using pandas functions
the result are below
data3 is 100% match
data4 is 66% match
or
data3 is 3 matched
data4 is 2 matched
What's the best way to make like that ?

You can do this - first run equals method and if True then print that dataframes match, otherwise use compare method and then calculate the percentage of rows that matched between dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [2, 2, 3], 'b': [4, 5, 7]})
if df1.equals(df2):
print('df1 matched df2')
else:
comp = df1.compare(df2)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
print(f'{match_perc * 100: .4f} match') # Out: 33.3333 match
You can simplify by just using compare and if the dataframes match perfectly then you print that they matched:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
comp = df1.compare(df3)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
if match_perc == 1:
print('dfs matched')
else:
print(f'{match_perc * 100: .4f} match')
# Out: dfs matched

Related

How to fill column with condition in polars

I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?

Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?

Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Apply function only on specific rows AND columns using Python Pandas

I have a dataframe below:
df = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [1, 1, 1],
'c2': [2, 2, 2],
'c3': [3, 3, 3]}
df = pd.DataFrame(df)
and a function:
def test_function(row):
return row['b']
How can I apply this function on the 'c' columns (i.e. c1, c2 and c3), BUT only for specific rows whose 'a' value matches the 2nd character of the 'c' columns?
For example, for the first row, the value of 'a' is 1, so for the first row, I would like to apply this function on column 'c1'.
For the second row, the value of 'a' is 2, so for the second row, I would like to apply this function on column 'c2'. And so forth for the rest of the rows.
The desired end result should be:
df_final = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [77, 1, 1],
'c2': [2, 88, 2],
'c3': [3, 3, 99]}
df_final = pd.DataFrame(df_final)

Use Series.mask with compare c columns filtered by DataFrame.filter and if match repalce by values of b:
c_cols = df.filter(like='c').columns
def test_function(row):
#for test integers from 0 to 9
#m = c_cols.str[1].astype(int) == row['a']
#for test integers from 0 to 100
m = c_cols.str.extract('(\d+)', expand=False).astype(int) == row['a']
row[c_cols] = row[c_cols].mask(m, row['b'])
return row
df = df.apply(test_function, axis=1)
print (df)
a b c1 c2 c3
0 1 77 77 2 3
1 2 88 1 88 3
2 3 99 1 2 99
Non loop faster alternative with broadcasting:
arr = c_cols.str.extract('(\d+)', expand=False).astype(int).to_numpy()[:, None]
m = df['a'].to_numpy() == arr
df[c_cols] = df[c_cols].mask(m, df['b'], axis=0)

How can I return a larger dataframe from two dataframes

e.g. I have two dataframes:
a = pd.DataFrame({'A':[1,2,3],'B':[6,5,4]})
b = pd.DataFrame({'A':[3,2,1],'B':[4,5,6]})
I want to get a dataframe c consisting of the larger value in each position of a & b:
c = max_function(a,b) = pd.DataFrame(max(a.iloc[i,j], b.iloc[i,j]))
c = pd.DataFrame({'A':[3,2,3],'B':[6,5,6]})
I don't want to generate c by comparing each value in a & b because the real dataframes in my work is very large.
So I wonder if there's a ready-made pandas function which can do this? Thanks!

You could use numpy.maximum:
import pandas as pd
import numpy as np
a = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]})
b = pd.DataFrame({'A': [3, 2, 1], 'B': [4, 5, 6]})
c = np.maximum(a, b)
print(c)
Output
A B
0 3 6
1 2 5
2 3 6

Pandas: Compare every two rows and output result to a new dataframe

import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.

Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

matching 2 dataframes in pandas - python

Related

How to fill column with condition in polars

Create and populate dataframe column simulating (excel) vlookup function

Apply function only on specific rows AND columns using Python Pandas

How can I return a larger dataframe from two dataframes

Pandas: Compare every two rows and output result to a new dataframe

Categories

Resources