compare multiple specific columns of all rows - python

I want to compare the particular columns of all the rows, if they are unique extract the value to the new column otherwise 0.
If the example dateframe as follows:
A B C D E F
13348 judte 1 1 1 1
54871 kfzef 1 1 0 1
89983 hdter 4 4 4 4
7543 bgfd 3 4 4 4
The result should be as follows:
A B C D E F Result
13348 judte 1 1 1 1 1
54871 kfzef 1 1 0 1 0
89983 hdter 4 4 4 4 4
7543 bgfd 3 4 4 4 0
I am pleased to hear some suggestions.

Use:
cols = ['C','D','E','F']
df['Result'] = np.where(df[cols].eq(df[cols[0]], axis=0).all(axis=1), df[cols[0]], 0)
print (df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0
Detail:
First compare all column filtered by list of columns names by eq with first column of cols df[cols[0]]:
print (df[cols].eq(df[cols[0]], axis=0))
C D E F
0 True True True True
1 True True False True
2 True True True True
3 True False False False
Then check if all Trues per row by all:
print (df[cols].eq(df[cols[0]], axis=0).all(axis=1))
0 True
1 False
2 True
3 False
dtype: bool
And last use numpy.where for assign first column values for Trues and 0 for False.

I think you need apply with nunique as:
df['Result'] = df[['C','D','E','F']].apply(lambda x: x[0] if x.nunique()==1 else 0,1)
Or using np.where:
df['Result'] = np.where(df[['C','D','E','F']].nunique(1)==1,df['C'],0)
print(df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?
You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"
df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

How to check if a column whether contains a specific element after grouping by?

I want to check whether the column app grouping by the column user contains a specific element, such as b.
import pandas as pd
df=pd.DataFrame({'user':[1,1,1,2,2,3,3],'app':['a','b','c','a','c','b','c']})
Input:
app user
0 a 1
1 b 1
2 c 1
3 a 2
4 c 2
5 b 3
6 c 3
Expected:
app user contains_b
0 a 1 1
1 b 1 1
2 c 1 1
3 a 2 0
4 c 2 0
5 b 3 1
6 c 3 1
transform with any
df.assign(contains_b=df.app.eq('b').groupby(df.user).transform('any').astype(int))
app user contains_b
0 a 1 1
1 b 1 1
2 c 1 1
3 a 2 0
4 c 2 0
5 b 3 1
6 c 3 1
Use:
df['contains_b'] = df['user'].isin(df.loc[df['app'].eq('b'), 'user'].unique()).astype(int)
print (df)
user app contains_b
0 1 a 1
1 1 b 1
2 1 c 1
3 2 a 0
4 2 c 0
5 3 b 1
6 3 c 1
Details:
First filter by eq (==) column app and get all user rows:
print (df.loc[df['app'].eq('b'), 'user'])
1 1
5 3
Name: user, dtype: int64
For better performance use unique:
print (df.loc[df['app'].eq('b'), 'user'].unique())
[1 3]
Then test user column for membership by isin:
print (df['user'].isin(df.loc[df['app'].eq('b'), 'user'].unique()))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: user, dtype: bool
And last cast to integer Trues are 1s and Falses - 0.
Using isin
df['contains_b'] = df.groupby('user').app.transform(lambda x: x.isin(['b']).any()).astype(int)
user app contains_b
0 1 a 1
1 1 b 1
2 1 c 1
3 2 a 0
4 2 c 0
5 3 b 1
6 3 c 1

Return entries with common columns values in pandas DataFrame - python

I have a DataFrame in python pandas which contains several different entries (rows) having also integer values in columns, for example:
A B C D E F G H
0 1 2 1 0 1 2 1 2
1 0 1 1 1 1 2 1 2
2 1 2 1 2 1 2 1 3
3 0 1 1 1 1 2 1 2
4 2 2 1 2 1 2 1 3
I would return just the rows which contain common values in columns, the result should be:
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Thanks in advance
You can use the boolean mask from duplicated passing param keep=False:
In [3]:
df[df.duplicated(keep=False)]
Out[3]:
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Here is the mask showing the rows that are duplicates, passing keep=False returns all duplicate rows, by default it would return the first duplicate row:
In [4]:
df.duplicated(keep=False)
Out[4]:
0 False
1 True
2 False
3 True
4 False
dtype: bool
Need duplicated with parameter keep=False for return all duplicates with boolean indexing:
print (df.duplicated(keep=False))
0 False
1 True
2 False
3 True
4 False
dtype: bool
df = df[df.duplicated(keep=False)]
print (df)
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Also if need remove first or last duplicates rows use:
df1 = df[df.duplicated()]
#same as 'first', default parameter, so an be omit
#df1 = df[df.duplicated(keep='first')]
print (df1)
A B C D E F G H
3 0 1 1 1 1 2 1 2
df2 = df[df.duplicated(keep='last')]
print (df2)
A B C D E F G H
1 0 1 1 1 1 2 1 2

Compare two columns in pandas to make them match

So I have two dataframes consisting of 6 columns each containing numbers. I need to compare 1 column from each dataframe to make sure they match and fix any values in that column that don't match. Columns are already sorted and they match in terms of length. So far I can find the differences in the columns:
df1.loc[(df1['col1'] != df2['col2'])]
then I get the index # where df1 doesn't match df2. Then I'll go to that same index # in df2 to find out what value in col2 is causing a mismatch then use this to change the value to the correct one found in df2:
df1.loc[index_number, 'col1'] = new_value
Is there a way I can automatically fix the mismatches without having to manually look up what the correct value should be in df2?
if df2 is the authoritative source, you don't need to check where df1 is equal
df1.loc[:, 'column_name'] = df2['column_name']
But if we must check
c = 'column_name'
df1.loc[df1[c] != df2[c], c] = df2[c]
I think you need compare by eq and then if need add value where dont match use combine_first:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,6,5],
'E':[5,3,6],
'F':[1,4,3]})
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3
df2 = pd.DataFrame({'A':[1,2,1],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 1 6 9 5 6 3
If need compare one column with all DataFrame:
print (df1.eq(df2.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 False False False False False False
print (df1.eq(df1.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 True False False False False True
And if need same column D:
df1.D = df1.loc[df1.D.eq(df2.D), 'D'].combine_first(df2.D)
print (df1)
A B C D E F
0 1 4 7 1.0 5 1
1 2 5 8 3.0 3 4
2 3 6 9 5.0 6 3
But then is easier only assign column D from df2 to D of df1:
df1.D = df2.D
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If indexes are different, is possible use values for convert column to numpy array:
df1.D = df1.D.values
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3

Categories

Resources