Pandas: check if column value is unique

Pandas: check if column value is unique - python

I have a DataFrame like:
value
0 1
1 2
2 2
3 3
4 4
5 4
I need to check if each value is unique or not, and mark that boolean value to new column. Expected result would be:
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
I have tried:
df['unique'] = ""
df.loc[df["value"].is_unique, 'unique'] = True
But this throws exception:
cannot use a single bool to index into setitem
Any advise would be highly appreciated. Thanks.

Use Series.duplicated witn invert mask by ~:
df['unique'] = ~df['value'].duplicated(keep=False)
print (df)
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
Or:
df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)

This works as well:
df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

Related

How to get column index which is matching with specific value in Pandas?

I have the following dataframe as below.
0 1 2 3 4 5 6 7
True False False False False False False False
[1 rows * 8 columns]
As you can see, there is one True value which is the first column.
Therefore, I want to get the 0 index which is True element in the dataframe.
In other case, there is True in the 4th column index, then I would like to get the 4 as 4th column has the True value for below dataframe.
0 1 2 3 4 5 6 7
False False False False True False False False
[1 rows * 8 columns]
I tried to google it but failed to get what I want.
And for assumption, there is no designated column name in the case.
Look forward to your help.
Thanks.

IIUC, you are looking for idxmax:
>>> df
0 1 2 3 4 5 6 7
0 True False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
>>> df
0 1 2 3 4 5 6 7
0 False False False False True False False False
>>> df.idxmax(axis=1)
0 4
dtype: object
Caveat: if all values are False, Pandas returns the first index because index 0 is the lowest index of the highest value:
>>> df
0 1 2 3 4 5 6 7
0 False False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
Workaround: replace False by np.nan:
>>> df.replace(False, np.nan).idxmax(axis=1)
0 NaN
dtype: float64

if you want every field that is true:
cols_true = []
for idx, row in df.iterrows():
for i in cols:
if row[i]:
cols_true.append(i)
print(cols_true)

Use boolean indexing:
df.columns[df.iloc[0]]
output:
Index(['0'], dtype='object')
Or numpy.where
np.where(df)[1]

You may want to index the dataframe's index by a column itself (0 in this case), as follows:
df.index[df[0]]
You'll get:
Int64Index([0], dtype='int64')

df.loc[:, df.any()].columns[0]
# 4
If you have several True values you can also get them all with columns
Generalization
Imagine we have the following dataframe (several True values in positions 4, 6 and 7):
0 1 2 3 4 5 6 7
0 False False False False True False True True
With the formula above :
df.loc[:, df.any()].columns
# Int64Index([4, 6, 7], dtype='int64')

df1.apply(lambda ss:ss.loc[ss].index.min(),axis=1).squeeze()
out：
0
or
df1.loc[:,df1.iloc[0]].columns.min()

drop the row only if all columns contains 0

I am trying to drop rows that have 0 for all 3 columns, i tried using these codes, but it dropped all the rows that have 0 in either one of the 3 columns instead.
indexNames = news[ news['contain1']&news['contain2'] &news['contain3']== 0 ].index
news.drop(indexNames , inplace=True)
My CSV file
contain1 contain2 contain3
1 0 0
0 0 0
0 1 1
1 0 1
0 0 0
1 1 1
Using the codes i used, all of my rows would be deleted. Below are the result i wanted instead
contain1 contain2 contain3
1 0 0
0 1 1
1 0 1
1 1 1

First filter by DataFrame.ne for not equal 0 and then get rows with at least one match - so removed only 0 rows by DataFrame.any:
df = news[news.ne(0).any(axis=1)]
#cols = ['contain1','contain2','contain3']
#if necessary filter only columns by list
#df = news[news[cols].ne(0).any(axis=1)]
print (df)
contain1 contain2 contain3
0 1 0 0
2 0 1 1
3 1 0 1
5 1 1 1
Details:
print (news.ne(0))
contain1 contain2 contain3
0 True False False
1 False False False
2 False True True
3 True False True
4 False False False
5 True True True
print (news.ne(0).any(axis=1))
0 True
1 False
2 True
3 True
4 False
5 True
dtype: bool

If this is a pandas dataframe you can sum the indexes with .sum().
news_sums = news.sum(axis=0)
indexNames = news.loc[news_sums == 0].index
news.drop(indexNames, inplace=True)
(note: Not tested, just from memory)

A simple solution would be to filter on the sum of your columns. You can do this by running this code news[news.sum(axis=1)!=0].
Hope this will help you :)

You might want to try this.
news[(news.T != 0).any()]

Conditional Formatting on duplicate values using pandas

I have a dataFrame with 2 columns a A and B. I have to separate out subset of dataFrames using pandas to delete all the duplicate values.
For Example
My dataFrame looks like this
**A B**
1 1
2 3
4 4
8 8
5 6
4 7
Then the output should be
**A B**
1 1 <--- both values Highlighted
2 3
4 4 <--- both values Highlighted
8 8 <--- both values Highlighted
5 6
4 7 <--- value in column A highlighted
How do I do that?
Thanks in advance.

You can use this:
def color_dupes(x):
c1='background-color:red'
c2=''
cond=x.stack().duplicated(keep=False).unstack()
df1 = pd.DataFrame(np.where(cond,c1,c2),columns=x.columns,index=x.index)
return df1
df.style.apply(color_dupes,axis=None)
# if df has many columns: df.style.apply(color_dupes,axis=None,subset=['A','B'])
Example working code:
Explanation:
First we stack the dataframe so as to bring all the columns into a series and find duplicated with keep=False to mark all duplicates as true:
df.stack().duplicated(keep=False)
0 A True
B True
1 A False
B False
2 A True
B True
3 A True
B True
4 A False
B False
5 A True
B False
dtype: bool
After this we unstack() the dataframe which gives a boolean dataframe with the same dataframe structure:
df.stack().duplicated(keep=False).unstack()
A B
0 True True
1 False False
2 True True
3 True True
4 False False
5 True False
Once we have this we assign the background color to values if True else no color using np.where

How to identify unique ID's that have only 1 true condition?

Simple question: How to identify unique ID's that have only 1 true condition?
Index ID value condition
0 1 1 False
1 1 3 True
2 1 2 False
3 1 1 False
4 2 3 True
5 2 4 True
6 2 5 True
In the case above, ID 1(1 true) would only be identified while ID 2(3 trues) would not.
How would I go about editing the code below? I need to keep the original index and ID in a segmented data frame.
df[df['condition']==True]['ID'].unique()
Expected output:
Index ID value condition
1 1 3 True
All the best,
Thank you for your time.

Using filter
df.groupby('ID').filter(lambda x : sum(x['condition'])==1)
Out[685]:
Index ID value condition
0 0 1 1 False
1 1 1 3 True
2 2 1 2 False
3 3 1 1 False

Compare columns of 2 dataframes with a combination of index and row value

There are quite a few similar questions out there, but I am not sure if there is one that tackles both index and row values. (relevant to binary classification df)
So what I am trying to do is compare the columns with the same name to have the same values and index. If not, simply return an error.
Let's say DataFrame df has columns a, b and c and df_orginal has columns from a to z.
How can we first find the columns that have the same name between those 2 DataFrames, and then check the contents of those columns such that they match row by row in value and index between a, b and c from df and df_orginal
The contents of all the columns are numerical, that's why I want to compare the combination of index and values
Demo:
In [1]: df
Out[1]:
a b c
0 0 1 2
1 1 2 0
2 0 1 0
3 1 1 0
4 3 1 0
In [3]: df_orginal
Out[3]:
a b c d e f g ......
0 4 3 1 1 0 0 0
1 3 1 2 1 1 2 1
2 1 2 1 1 1 2 1
3 3 4 1 1 1 2 1
4 0 3 0 0 1 1 1
In the above example, for those columns that have the same column name, compare the combination of index and value and flag an error if the combination of index and value is not correct

common_cols = df.columns.intersection(df_original.columns)
for col in common_cols:
df1_ind_val_pair = df[col].index.astype(str) + ' ' + df[col].astype(str)
df2_ind_val_pair = df_original[col].index.astype(str) + ' ' + df_original[col].astype(str)
if any(df1_ind_val_pair != df2_ind_val_pair):
print('Found one or more unequal (index, value) pairs in col {}'.format(col))

IIUC:
Use pd.DataFrame.align with a join method of inner. Then pass the resulting tuple unpacked to pd.DataFrame.eq
pd.DataFrame.eq(*df.align(dfo, 'inner'))
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True
To see rows that have all columns True, filter with this mask:
pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)
0 False
1 False
2 False
3 False
4 False
dtype: bool
with the sample data however, the result will be empty
df[pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)]
Empty DataFrame
Columns: [a, b, c]
Index: []
Same answer but with clearer code
def eq(d1, d2):
d1, d2 = d1.align(d2, 'inner')
return d1 == d2
eq(df, dfo)
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: check if column value is unique - python

Use Series.duplicated witn invert mask by ~: df['unique'] = ~df['value'].duplicated(keep=False) print (df) value unique 0 1 True 1 2 False 2 2 False 3 3 True 4 4 False 5 4 False Or: df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)

This works as well: df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

Related

How to get column index which is matching with specific value in Pandas?

drop the row only if all columns contains 0

Conditional Formatting on duplicate values using pandas

How to identify unique ID's that have only 1 true condition?

Compare columns of 2 dataframes with a combination of index and row value

Categories

Resources