Pandas Conditional formatting by comparing the column values of dataframe

Pandas Conditional formatting by comparing the column values of dataframe - python

import io
import pandas as pd
csv_data = '''App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine'''
df = pd.read_csv(io.StringIO(csv_data), sep=",")
html_table = df.tohtml()
Is there a way to compare the values of columns in dataframe and use it in conditional formatting ? I want compare if the 'prod','pre-prod' and 'stage' values are mismatching, if yes then then its bg-color should be red. I have tired the following methods present in pandas but none of them works.
df.style.apply()
df.style.apply_index()
df.style.applymap()
Current Output:
Desired output:

You can add style conditionally by applying style to a subset of your dataframe like:
import io
import pandas as pd
csv_data = '''App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine'''
def add_color(row):
return ['background-color: red'] * len(row)
df = pd.read_csv(io.StringIO(csv_data), sep=",")
df.loc[(df["pre-prod"] == df["prod"]) & (df["prod"] == df["stage"])].style.apply(add_color, axis=1)

import io
import pandas as pd
csv_data = '''
App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine
mismatching-image,nginx,nginx,nginx:1.23.3-alpine
'''
df = pd.read_csv(io.StringIO(csv_data), sep=",")
def match_checker(row):
if row['prod'] == row['pre-prod'] == row['stage']:
return [''] * len(row)
else:
return ['background-color: red'] * len(row)
df = df.style.apply(match_checker, axis=1)
html_table = df.to_html()
with open('testpandas.html','w+') as html_file:
html_file.write(html_table)
html_file.close()
Updated #PeterSmith answer.

It's also possible to style the entire DataFrame in one go by passing axis=None to apply.
We can identify rows which have differing values in the specified columns by comparing the first column (column 0) with the remaining columns (column 1-2) and identifying where there are unequal values using ne on axis=0.
df[['prod', 'stage']].ne(df['pre-prod'], axis=0)
# prod stage
# 0 False False
# 1 False True
Then we can check across rows for any rows which have any True values (meaning there is something that's not equal in the row).
df[['prod', 'stage']].ne(df['pre-prod'], axis=0).any(axis=1)
# 0 False
# 1 True
# dtype: bool
We can then simply apply the styles anywhere there's a True value in the resulting Series.
Altogether this could look something like:
def colour_rows_that_dont_match(df_: pd.DataFrame, comparison_cols: List[str]):
# Sanity check that comparison_cols is what we expect
assert isinstance(comparison_cols, list) and len(comparison_cols) > 1, \
'Must be a list and provide at least 2 column to compare'
# Create an Empty DataFrame to hold styles of the same shape as the original df
styles_df = pd.DataFrame('', index=df_.index, columns=df_.columns)
# Compare the first column's (col 0) values to the remaining columns.
# Find rows where any values are not equal (ne)
rows_that_dont_match = df[comparison_cols[1:]].ne(df[comparison_cols[0]], axis=0).any(axis=1)
# Apply styles to rows which meet the above criteria
styles_df.loc[rows_that_dont_match, :] = 'background-color: red'
return styles_df
df.style.apply(
colour_rows_that_dont_match,
# This gets passed to the function
comparison_cols=['pre-prod', 'prod', 'stage'],
# Apply to the entire DataFrame at once
axis=None
).to_html(buf='test_df.html')
Which produces the following:
Setup, version, and imports:
from typing import List
import pandas as pd # version 1.5.2
df = pd.DataFrame({
'App_name': ['matching-image', 'mismatching-image'],
'pre-prod': ['nginx', 'nginx'],
'prod': ['nginx', 'nginx'],
'stage': ['nginx', 'nginx:1.23.3-alpine']
})

Related

Pandas include single row in df after filtering with .loc

So, in this function:
def filter_by_freq(df, frequency):
filtered_df = df.copy()
if frequency.upper() == 'DAY':
pass
else:
date_obj = filtered_df['Date'].values[0]
target_day = pd.to_datetime(date_obj).day
target_month = pd.to_datetime(date_obj).month
final_date_obj = filtered_df['Date'].values[-1]
if frequency.upper() == 'MONTH':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
elif frequency.upper() == 'YEAR':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
filtered_df = filtered_df.loc[filtered_df['Date'].dt.month.eq(target_month)]
return filtered_df
How can I also include in the .loc the very last row from the original df? Tried doing (for month frequency): filtered_df = filtered_df.loc[(filtered_df['Date'].dt.day.eq(target_day)) | (filtered_df['Date'].dt.date.eq(final_date_obj))] but didn't work.
Thanks for your time!

Here's one way you could do it. In this example I have a df and I want to filter out all rows that have c1 > 0.5, but I want to keep the last row no matter what. I create a boolean series called lte_half to keep track of the first condition, and then I create another boolean series/list/array (all interchangeable) called end_ind which is True only for the last row. The filtered table is created by taking all rows that pass either condition with the |
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'c1':np.random.rand(20)})
lte_half = df['c1'].le(0.5)
end_ind = df.index == df.index[-1]
filt_df = df[lte_half | end_ind]
print(filt_df)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.

Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).

You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])

Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])

you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Comparing two Data Frames and getting differences

I want to compare two Data Frames and print out my differences in a selective way. Here is what I want to accomplish in pictures:
Dataframe 1
Dataframe 2
Desired Output - Dataframe 3
What I have tried so far?
import pandas as pd
import numpy as np
df1 = pd.read_excel("01.xlsx")
df2 = pd.read_excel("02.xlsx")
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else: # need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'Naziv usluge']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'Service Previous': changed_from, 'Service Current': changed_to},
index=changed.index)
df3 = diff_pd(df1, df2)
df3 = df3.fillna(0)
df3 = df3.reset_index()
print(df3)
To be fair i found that code on another thread, but it does get job done, but I still have some issues.
My dataframes are not equal, what do I do?
I don't fully understand the code I provided.
Thank you!

How about something easier to start with ...
Try this
import pandas as pd
data1={'Name':['Tom','Bob','Mary'],'Age':[20,30,40],'Pay':[10,10,20]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
df2=pd.DataFrame.from_records(data2)
# Checking Columns
for col in df1.columns:
if col not in df2.columns:
print(f"DF2 Missing Col {col}")
# Check Col Values
for col in df1.columns:
if col in df2.columns:
# Ok we have the same column
if list(df1[col]) == list(df2[col]):
print(f"Columns {col} are the same")
else:
print(f"Columns {col} have differences")
It should output
DF2 Missing Col Pay
Columns Age have differences
Columns Name are the same
Python3.7 needed or change the f-string formatting.

Pandas set element style dependent on another dataframe mith multi index

I have previously asked the question Pandas set element style dependent on another dataframe, which I have a working solution to, but now I am trying to apply it to a data frame with a multi index and I am getting an error, which I do not understand.
Problem
I have a pandas df and accompanying boolean matrix. I want to highlight the df depending on the boolean matrix.
Data
import pandas as pd
import numpy as np
from datetime import datetime
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
i = len(date)
dic = {'X':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B'])}
df = pd.concat(dic.values(),axis=1,keys=dic.keys())
boo = [True, False]
bool_matrix = {'X':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B'])}
bool_matrix =pd.concat(bool_matrix.values(),axis=1,keys=bool_matrix.keys())
My attempted solution
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in df[column].index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i, column])
my_style
Results
The above throws an AttributeError: 'Series' object has no attribute 'applymap'
I do not understand what is returning as a Series. This is a single value I am subsetting and this solution worked for non multi-indexed df's as shown below.
Without Multi-index
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(24)
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
df = pd.DataFrame({'A': np.linspace(1, 100, len(date))})
df = pd.concat([df, pd.DataFrame(np.random.randn(len(date), 4), columns=list('BCDE'))],
axis=1)
df['date'] = date
df.set_index("date", inplace = True)
boo = [True, False]
bool_matrix = pd.DataFrame(np.random.choice(boo, (len(date), 5),p=[0.3,.7]), index = date,columns=list('ABCDE'))
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in bool_matrix.index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i,column])
my_style
Documentation
The docs make reference to CSS Classes and say that "Index label cells include level where k is the level in a MultiIndex." I am obviouly indexing this wrong, but am stumped on how to proceed.

It's very nice that there is a runable example.
You can use df.style.apply(..., axis=None) to apply a highlight method to the whole dataframe.
With your df and bool_matrix, try this:
def highlight(value):
d = value.copy()
for c in d.columns:
for r in df.index:
if bool_matrix.loc[r, c]:
d.loc[r, c] = 'background-color: green'
else:
d.loc[r, c] = ''
return d
df.style.apply(highlight, axis=None)
Or to make codes simple, you can try:
def highlight(value):
return bool_matrix.applymap(lambda x: 'background-color: green' if x else '')
df.style.apply(highlight, axis=None)
Hope this is what you need.

How can I count a specific value in group_by in pandas?

I have a dataframe and I use groupby to group it by Season. One of the columns of the original df is named Check and consists of True and False. My aim it to count the True values for each group and put it in the new dataframe.
import pandas as pd
df = ....
df['Check'] = df['Actual'] == df['Prediction']
grouped_per_year = df.groupby('Season')
df_2= pd.DataFrame()
df_2['Seasons'] = total_matches_per_year.keys()
df_2['Successes'] = ''
df_2['Total_Matches'] = list(grouped_per_year.size())
df_2['SR'] = df_2['Successes'] / df_2['Total_Matches']
df_2['Money_In'] = list(grouped_per_year['Money_In'].apply(sum))
df_2['Profit (%)'] = (df_profit['Money_In'] - df_profit['Total_Matches']) / df_profit['Total_Matches'] * 100.
I have tried:
successes_per_year = grouped_per_year['Pred_Check'].value_counts()
but I don't know how to get only the True count.

For counting True, you can also use sum (as True=1 and False=0 when doing a numerical operation):
grouped_per_year['Pred_Check'].sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Conditional formatting by comparing the column values of dataframe - python

Related

Pandas include single row in df after filtering with .loc

Compare entire rows for equality if some condition is satisfied

Comparing two Data Frames and getting differences

Pandas set element style dependent on another dataframe mith multi index

How can I count a specific value in group_by in pandas?

Categories

Resources