select identical entries in two pandas dataframe

select identical entries in two pandas dataframe - python

I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2

You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2

Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.

Related

Merge duplicated cells instead of dropping them [duplicate]

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:

You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

lookup value in the pandas dataframe using the muliple values in the row of another dataframe

I have dataframes:
df1:
| |A|B|C|D|E|
|0|1|2|3|4|5|
|1|1|3|4|5|0|
|2|3|1|2|3|5|
|3|2|3|1|2|6|
|4|2|5|1|2|3|
df2:
| |K|L|M|N|
|0|1|3|4|2|
|1|1|2|5|3|
|2|3|2|3|1|
|3|1|4|5|0|
|4|2|2|3|6|
|5|2|1|2|7|
What I need to do is match column A of df1 with column k of df2; column C of df1 with L of df2; and column D of df1 with column M of df2. If the values are matched the corresponding value of N in df2 should be assigned to a new column F in df1. The output should be:
| |A|B|C|D|E|F|
|0|1|2|3|4|5|2|
|1|1|3|4|5|0|0|
|2|3|1|2|3|5|1|
|3|2|3|1|2|6|7|
|4|2|5|1|2|3|7|

Use DataFrame.merge with left join and rename columns for match:
df = df1.merge(df2.rename(columns={'K':'A','L':'C','M':'D', 'N':'F'}), how='left')
print (df)
A B C D E F
0 1 2 3 4 5 2
1 1 3 4 5 0 0
2 3 1 2 3 5 1
3 2 3 1 2 6 7
4 2 5 1 2 3 7

df3 = df1.join(df2)
F = []
for _, row in df3.iterrows():
if row['A'] == row['K'] and row['C'] == row['L'] and row['D'] == row['M']:
F.append(row['N'])
else:
F.append(0)
df1['F'] = F
df1

Pandas: assign value depending on another dataframe

I have to dataframes that look like this:
df1: condition
A
A
A
B
B
B
B
df2: condition value
A 1
B 2
I would like to assign to each condition its value, adding a column to df1 in order to obtain:
df1: condition value
A 1
A 1
A 1
B 2
B 2
B 2
B 2
how can I do this? thank you in advance!

Use map by Series created by set_index if need append one column only:
df1['value'] = df1['condition'].map(df2.set_index('condition')['value'])
print (df1)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
Or use merge with left join if df2 have more columns:
df = df1.merge(df2, on='condition', how='left')
print (df)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2

Calculating no of non-zero in a column corresponding to another column

I have a dataframe:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
which looks like-
A B class
0 0 4 0
1 4 1 1
2 8 0 1
3 1 0 0
4 0 3 1
5 0 1 0
I want to calculate for each column the corresponding a,b,c,d which are no of non-zero in column corresponding to class column 1,no of non-zero in column corresponding to class column 0,no of zero in column corresponding to class column 1,no of zero in column corresponding to class column 0
for example-
for column A the a,b,c,d are 2,1,1,2
explantion- In column A we see that where column[class]=1 the number of non zero values in column A are 2 therefore a=2(indices 1,2).Similarly b=1(indices 3)
My attempt(when the dataframe had equal no of 0 and 1 class)-
dataset = pd.read_csv('aaf.csv')
n=len(dataset.columns) #no of columns
X=dataset.iloc[:,1:n].values
l=len(X) #no or rows
score = []
for i in range(n-1):
#print(i)
X_column=X[:,i]
neg_array,pos_array=np.hsplit(X_column,2)##hardcoded
#print(pos_array.size)
a=np.count_nonzero(pos_array)
b=np.count_nonzero(neg_array)
c= l/2-a
d= l/2-b

Use:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
df = (df.set_index('class')
.ne(0)
.stack()
.groupby(level=[0,1])
.value_counts()
.unstack(1)
.sort_index(level=1, ascending=False)
.T)
print (df)
class 1 0 1 0
True True False False
A 2 1 1 2
B 2 2 1 1
df.columns = list('abcd')
print (df)
a b c d
A 2 1 1 2
B 2 2 1 1

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?

import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

select identical entries in two pandas dataframe - python

I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes df1 = a b c d 0 1 2 3 0 1 2 3 0 1 2 3 df2 = i j k 0 1 2 0 1 2 0 1 2 I want to select the entries that df1 is df2 I want to obtain df1= a b c 0 1 2 0 1 2 0 1 2

Related

Merge duplicated cells instead of dropping them [duplicate]

lookup value in the pandas dataframe using the muliple values in the row of another dataframe

Pandas: assign value depending on another dataframe

Calculating no of non-zero in a column corresponding to another column

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

Categories

Resources