Conditional Formatting on duplicate values using pandas - python

I have a dataFrame with 2 columns a A and B. I have to separate out subset of dataFrames using pandas to delete all the duplicate values.
For Example
My dataFrame looks like this
**A B**
1 1
2 3
4 4
8 8
5 6
4 7
Then the output should be
**A B**
1 1 <--- both values Highlighted
2 3
4 4 <--- both values Highlighted
8 8 <--- both values Highlighted
5 6
4 7 <--- value in column A highlighted
How do I do that?
Thanks in advance.

You can use this:
def color_dupes(x):
c1='background-color:red'
c2=''
cond=x.stack().duplicated(keep=False).unstack()
df1 = pd.DataFrame(np.where(cond,c1,c2),columns=x.columns,index=x.index)
return df1
df.style.apply(color_dupes,axis=None)
# if df has many columns: df.style.apply(color_dupes,axis=None,subset=['A','B'])
Example working code:
Explanation:
First we stack the dataframe so as to bring all the columns into a series and find duplicated with keep=False to mark all duplicates as true:
df.stack().duplicated(keep=False)
0 A True
B True
1 A False
B False
2 A True
B True
3 A True
B True
4 A False
B False
5 A True
B False
dtype: bool
After this we unstack() the dataframe which gives a boolean dataframe with the same dataframe structure:
df.stack().duplicated(keep=False).unstack()
A B
0 True True
1 False False
2 True True
3 True True
4 False False
5 True False
Once we have this we assign the background color to values if True else no color using np.where

Related

How to get column index which is matching with specific value in Pandas?

I have the following dataframe as below.
0 1 2 3 4 5 6 7
True False False False False False False False
[1 rows * 8 columns]
As you can see, there is one True value which is the first column.
Therefore, I want to get the 0 index which is True element in the dataframe.
In other case, there is True in the 4th column index, then I would like to get the 4 as 4th column has the True value for below dataframe.
0 1 2 3 4 5 6 7
False False False False True False False False
[1 rows * 8 columns]
I tried to google it but failed to get what I want.
And for assumption, there is no designated column name in the case.
Look forward to your help.
Thanks.
IIUC, you are looking for idxmax:
>>> df
0 1 2 3 4 5 6 7
0 True False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
>>> df
0 1 2 3 4 5 6 7
0 False False False False True False False False
>>> df.idxmax(axis=1)
0 4
dtype: object
Caveat: if all values are False, Pandas returns the first index because index 0 is the lowest index of the highest value:
>>> df
0 1 2 3 4 5 6 7
0 False False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
Workaround: replace False by np.nan:
>>> df.replace(False, np.nan).idxmax(axis=1)
0 NaN
dtype: float64
if you want every field that is true:
cols_true = []
for idx, row in df.iterrows():
for i in cols:
if row[i]:
cols_true.append(i)
print(cols_true)
Use boolean indexing:
df.columns[df.iloc[0]]
output:
Index(['0'], dtype='object')
Or numpy.where
np.where(df)[1]
You may want to index the dataframe's index by a column itself (0 in this case), as follows:
df.index[df[0]]
You'll get:
Int64Index([0], dtype='int64')
df.loc[:, df.any()].columns[0]
# 4
If you have several True values you can also get them all with columns
Generalization
Imagine we have the following dataframe (several True values in positions 4, 6 and 7):
0 1 2 3 4 5 6 7
0 False False False False True False True True
With the formula above :
df.loc[:, df.any()].columns
# Int64Index([4, 6, 7], dtype='int64')
df1.apply(lambda ss:ss.loc[ss].index.min(),axis=1).squeeze()
out:
0
or
df1.loc[:,df1.iloc[0]].columns.min()

How to forward propagate/fill a specific value in a Pandas DataFrame Column/Series?

I have a boolean column in a dataframe that looks like the following:
True
False
False
False
False
True
False
False
False
I want to forward propagate/fill the True values n number of times. e.g. 2 times:
True
True
True
False
False
True
True
True
False
the ffill does something similar for NaN values, but I can't find anything for a specific value as described. Is the easiest way to do this just to do a standard loop and just iterate over the rows and modify the column in question with a counter?
Each row is an equi-distant time series entry
EDIT:
The current answers all solve my specific problem with a bool column, but one answer can be modified to be more general purpose:
>> s = pd.Series([1, 2, 3, 4, 5, 1, 2, 3])
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
>> condition_mask = s == 2
>> s.mask(~(condition_mask)).ffill(limit=2).fillna(s).astype(int)
0 1
1 2
2 2
3 2
4 5
5 1
6 2
7 2
You can still use ffill but first you have to mask the False values
s.mask(~s).ffill(limit=2).fillna(s)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: 0, dtype: bool
For 2 times you could have:
s = s | s.shift(1) | s.shift(2)
You could generalize to n-times from there.
Try with rolling
n = 3
s.rolling(n, min_periods=1).max().astype(bool)
Out[147]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: s, dtype: bool

Populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas

I am looking to populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas, otherwise with a value of False.
For instance, this is original output dataframe I am constructing.
ID Type
1 A
2 B
3 A
4 A
5 C
6 A
7 D
8 A
9 B
10 A
And the smaller subset of the dataframe selected based on some criteria:
ID Type
1 A
3 A
4 A
5 C
7 D
10 A
What I am trying to accomplish is when ID and Type in the output dataframe match with the smaller subset datadrame, I want to populate a new column called 'Result' and value equals to True. Otherwise, value equals to False.
ID Type Result
1 A True
2 B False
3 A True
4 A True
5 C True
6 A False
7 D True
8 A False
9 B False
10 A True
You can .merge() the 2 dataframes using a left merge with the original dataframe as base and turn on the indicator= parameter to show the merge result. Then change the merge result to True for the rows that appear in both dataframes and False otherwise.
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
df_out['Result'] = (df_out['Result'] == 'both')
Explanation:
With indicator= parameter turn on, Pandas will show you the merge result of which dataframe the current row are from (in terms of both, left_only and right_only)
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
print(df_out)
ID Type Result
0 1 A both
1 2 B left_only
2 3 A both
3 4 A both
4 5 C both
5 6 A left_only
6 7 D both
7 8 A left_only
8 9 B left_only
9 10 A both
Then, we transform the both and others to True/False by boolean mask, as follows:
df_out['Result'] = (df_out['Result'] == 'both')
print(df_out)
ID Type Result
0 1 A True
1 2 B False
2 3 A True
3 4 A True
4 5 C True
5 6 A False
6 7 D True
7 8 A False
8 9 B False
9 10 A True

Pandas: check if column value is unique

I have a DataFrame like:
value
0 1
1 2
2 2
3 3
4 4
5 4
I need to check if each value is unique or not, and mark that boolean value to new column. Expected result would be:
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
I have tried:
df['unique'] = ""
df.loc[df["value"].is_unique, 'unique'] = True
But this throws exception:
cannot use a single bool to index into setitem
Any advise would be highly appreciated. Thanks.
Use Series.duplicated witn invert mask by ~:
df['unique'] = ~df['value'].duplicated(keep=False)
print (df)
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
Or:
df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)
This works as well:
df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

How to randomly sample elements between 2 rows in Pandas dataframe?

I have a Pandas dataframe df in the following format:
ColumnA. ColumnB. IsMatching
0 sadasdsad. asdsadsad True
1 asdsadsadas. asdsadasd. False
2 asdsadasd. asdsadsad. False
3 dfsdfsdfi ijijiiijj. False
4 sdasdsads. asdsadsad True
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
7 assdssads. asd222sad True
I would like to create a new dataframe, say new_df, which contains n randomly sampled rows with IsMatching == False from the original df between 2 True instances. For example, randomly select n rows between indices 0 and 4, similarly randomly select n rows between indices 4 and 7 etc.
A sample desired output for new_df would be (with sampling 2 rows randomly between the True instances in df). Note that it is possible that there is less than 2 rows between True instances - and in that case I would like the new_df to have whatever rows are there.
1 asdsadsadas. asdsadasd. False
3 dfsdfsdfi ijijiiijj. False
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
I searched about the df.sample() method in Pandas, but it doesn't seem to have a provision to sample between 2 rows. Any help and suggestion would be appreciated.
This should work:
from io import StringIO
data = StringIO("""
ColumnA. ColumnB. IsMatching
0 sadasdsad. asdsadsad True
1 asdsadsadas. asdsadasd. False
2 asdsadasd. asdsadsad. False
3 dfsdfsdfi ijijiiijj. False
4 sdasdsads. asdsadsad True
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
7 assdssads. asd222sad True
""")
import pandas as pd
df = pd.read_csv(data, sep='\s+')
df['between_rows_group'] = df['IsMatching'].cumsum()
# take 1 sample
df.query('IsMatching==False').groupby('between_rows_group').sample(1)
# take samples with replacement
df.query('IsMatching==False').groupby('between_rows_group').sample(5, replace=True)
# take as many samples as are possible
df.query('IsMatching==False').groupby('between_rows_group').apply(lambda x: x.sample(min(5, len(x)))).reset_index(drop=True)

Categories

Resources