formatting colour for cell and row in excel using pandas - python

my requirement highlight the different of those cell to "yellow" and new rows added will be highlight in red colour in excel as a output.
load1 dataset
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-04-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-05-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-05-20
7 22-04-20 29-04-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
load2 table
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-09-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-06-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-06-20
7 22-04-20 29-08-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
9 15-05-20 16-05-20 17-05-20 18-05-20
Output. I am looking for the output in excel in the below format with colours hi-lighted.
I am new to this pandas, I was able to get the different difference at dataframe level. Not at the outputed excel where to format each cell and row with colours.Please help .
the below are the script i tried to get the difference.
import pandas as pd
import os
import numpy as np
colour1= pd.read_excel('pandas.xlsx',sheet_name='load1')
colour2= pd.read_excel('pandas.xlsx',sheet_name='load2')
colour_merge=colour1.merge(colour2,left_on='StudentID', right_on='StudentID',how='outer')
colour_merge['Visit1dif']= np.where(colour_merge['Visit1_x']==colour_merge['Visit1_y'],0,1)
colour_merge['Visit2dif']= np.where(colour_merge['Visit 2_x']==colour_merge['Visit 2_y'],0,1)
colour_merge['Visit3dif']= np.where(colour_merge['Visit 3_x']==colour_merge['Visit 3_y'],0,1)
colour_merge['Visit4dif']= np.where(colour_merge['Visit 4_x']==colour_merge['Visit 4_y'],0,1)
colour_merge[['StudentID','Visit1_x','Visit1_y','Visit1dif','Visit 2_x','Visit 2_y','Visit2dif','Visit 3_x','Visit 3_y','Visit3dif','Visit 4_x','Visit 4_y','Visit4dif']]

I think you have two sheets load1 and load2, and want to and the third one to be displayed in the picture.
In order to add a new sheet, pandas.ExcelWriter should be opened in append mode and openpyxl library is needed along with pandas.DataFrame.to_excel and conditional_formatting functions :
import pandas as pd
import openpyxl
import xlrd
from openpyxl.styles import Alignment, Font, NamedStyle, PatternFill, colors
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import Rule, ColorScaleRule, CellIsRule, FormulaRule
file = r'C:\\app\\excel\\pandas.xlsx'
xls = xlrd.open_workbook(file)
sht = xls.sheet_names()
colour1= pd.read_excel(file,sheet_name=sht[0])
colour2= pd.read_excel(file,sheet_name=sht[1])
colour_merge = colour1.merge(colour2,left_on='StudentID',right_on='StudentID',how='outer')
colour_merge = colour_merge[['StudentID','Visit 1_x','Visit 1_y','Visit 2_x','Visit 2_y','Visit 3_x','Visit 3_y','Visit 4_x','Visit 4_y']]
l_sid = []
mxc1 = max(colour1['StudentID'])+1
mxc2 = max(colour2['StudentID'])+1
for j in range(min(mxc1,mxc2),max(mxc1,mxc2)):
l_sid.append(j+1)
writer_args = { 'path': file, 'mode': 'a', 'engine': 'openpyxl'}
with pd.ExcelWriter(**writer_args) as xlsx:
colour_merge.to_excel(xlsx, 'load3', index=False)
ws = xlsx.sheets['load3']
mxc= ws.max_column
title_row = '1'
yellow = PatternFill(bgColor=colors.YELLOW)
red = PatternFill(bgColor=colors.RED)
i=1
while i <= mxc*2:
l_col = []
l_col.append(chr(i+65))
l_col.append(chr(i+65+1))
for j in range(2,mxc+1):
for k in l_col:
if j not in l_sid:
ws.conditional_formatting.add(k+str(j), FormulaRule(formula=[l_col[0]+'$'+str(j)+'<>'+l_col[1]+'$'+str(j)], stopIfTrue=True, fill=yellow))
i+=2
r = Rule(type="expression", dxf=DifferentialStyle(fill=red), stopIfTrue=False)
r.formula = ['1=1']
while l_sid:
el=l_sid.pop(-1)
ws.conditional_formatting.add('A'+str(el)+':'+chr(64+mxc)+str(el), r)
xlsx.save()

Related

Comparing 2 revisions of excel files in python pandas

I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products
You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)
I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')

Pandas to_csv progress bar with tqdm

As the title suggests, I am trying to display a progress bar while performing pandas.to_csv.
I have the following script:
def filter_pileup(pileup, output, lists):
tqdm.pandas(desc='Reading, filtering, exporting', bar_format=BAR_DEFAULT_VIEW)
# Reading files
pileup_df = pd.read_csv(pileup, '\t', header=None).progress_apply(lambda x: x)
lists_df = pd.read_csv(lists, '\t', header=None).progress_apply(lambda x: x)
# Filtering pileup
intersection = pd.merge(pileup_df, lists_df, on=[0, 1]).progress_apply(lambda x: x)
intersection.columns = [i for i in range(len(intersection.columns))]
intersection = intersection.loc[:, 0:5]
# Exporting filtered pileup
intersection.to_csv(output, header=None, index=None, sep='\t')
On the first few lines I have found a way to integrate a progress bar but this method doesn't work for the last line, How can I achieve that?
You can divide the dataframe into chunks of n rows and save the dataframe to a csv chunk by chunk using mode='w' for the first row and mode="a" for the rest:
Example:
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(data=[i for i in range(0, 10000000)], columns = ["integer"])
print(df.head(10))
chunks = np.array_split(df.index, 100) # split into 100 chunks
for chunck, subset in enumerate(tqdm(chunks)):
if chunck == 0: # first row
df.loc[subset].to_csv('data.csv', mode='w', index=True)
else:
df.loc[subset].to_csv('data.csv', header=None, mode='a', index=True)
Output:
integer
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
100%|██████████| 100/100 [00:12<00:00, 8.12it/s]

savefig png where its file name is exactly the same as the excel sheet for loop

I have an excel (screen shot below) file with two sheets and tried to plot the data from each sheets using for in loop. I already succeeded creating two plots from these two sheets using this code below.
The problem is I also want to automatically save the plots into different png files where each png file name is exactly as the same as the sheet name from the excel. The png file name that I got is '83' and '95' not 'E1' nor 'E4. Screenshot below.
'
Before the savefig there is two more for in loop for annotating. Does these two loops variable need to be changed?
thank you in advance
import pandas as pd
import matplotlib.pyplot as plt
path ='F:\Backup\JN\TOR\TOR HLS.xlsx'
data= pd.ExcelFile(path)
sheets = data.sheet_names
for i in sheets:
well=pd.read_excel(data, sheet_name=i)
plt.plot(well['T'], well['mdpl pt'], marker='o', color='blue')
plt.plot(well['P'], well['mdpl pt'], marker='o', color='red')
for i, txt in enumerate(well['csg']):
plt.annotate(txt, ((well['x csg']+5)[i], well['mdpl csg'][i]))
for i, txt in enumerate(well['liner']):
plt.annotate(txt, ((well['x liner']+5)[i], well['mdpl liner'][i]))
plt.savefig(str(i), dpi=300, transparent='True')
plt.close(i)
I tried a snippet following your code and it works well for me and it creates 2 images with name E3 and E4 as my sheet names are E3 and E4. I have attached my excel data as output also. Please check it too
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
path ='HLS.xlsx'
data= pd.ExcelFile(path)
sheets = data.sheet_names
print(sheets)
#['E3', 'E4']
for i in sheets:
well=pd.read_excel(data, sheet_name=i)
print(well)
plt.plot(well['A'], well['B'], color='blue')
plt.savefig(i)
plt.close(i)
#well
#E3 first #E4 second
""" A B
0 1 6
1 2 5
2 3 4
3 4 3
4 5 2
5 6 1
A B
0 6 1
1 5 2
2 4 3
3 3 4
4 2 5
5 1 6"""

How to save a pandas pivot table with xlsxwriter in excel

I want to save a pandas pivot table proberly and nice formatted into an excel workbook.
I have an pandas pivot table, based on this formula:
table = pd.pivot_table(d2, values=['Wert'], index=['area', 'Name'], columns=['Monat'],
aggfunc={'Wert': np.sum}, margins=True).fillna('')
From my original dataframe:
df
Area Name2 Monat Wert
A A 1 2
A A 1 3
A B 1 2
A A 2 1
so the pivot table looks like this:
Wert
Monat 1 2 All
Area Name
A A 5 1 6
B 2 2
All 7 1 8
Then I want to save this in an excel workbook with the following code:
import pandas as pd
import xlsxwriter
workbook = xlsxwriter.Workbook('myexcel.xlsx)
worksheet1 = workbook.add_worksheet('table1')
caption = 'Table1'
worksheet1.set_column(1, 14, 25) #irrelevant, just a random size right now
worksheet1.write('B1', caption)
worksheet1.add_table('B3:K100', {'data': table.values.tolist()}) #also wrong size from B3 to K100
workbook.close()
But this looks like this (with different values), so the headers are missing:
How can I adjust it and save a pivot table in excel?
If I am using the pandas command .to_excel it looks like this:
Which is fine, but the column name is not respecting the width of the names and the background color is not looking nice, and I am also missing a capturing.
I found the solution with combination of this topic:
flattened = pd.DataFrame(table.to_records())
flattened.columns = [column.replace("('Wert', ", "Monat: ").replace(")", "") for column in flattened.columns] ##only for renaming the column headers
And then:
workbook = xlsxwriter.Workbook(excelfilename, options={'nan_inf_to_errors': True})
worksheet = workbook.add_worksheet('Table1')
worksheet1.set_column(0, flattned.shape[1], 25)
worksheet.add_table(0, 0, flattened.shape[0], flattened.shape[1]-1,
{'data': flattened.values.tolist(),
'columns': [{'header': c} for c in flattened.columns.tolist()],
'style': 'Table Style Medium 9'})
workbook.close()

for loop for openpyxl multiple chart creation

I'm trying to create a for loop to create multiple line charts in openpyxl, all at once. Certain indices in an array would be the bookends for the data the chart would draw data from. Is this possible in openpyxl?
My data in the excel spreadsheet looks like this:
1 Time Battery Voltage
2 2019-06-05 00:00:00 45
3 2019-06-05 00:01:50 49
4 2019-06-05 00:02:30 51
5 2019-06-05 00:04:58 34
...
import os
import openpyxl
from openpyxl import Workbook
from openpyxl.chart import LineChart, Reference, Series
from openpyxl.chart.axis import DateAxis
from datetime import date, datetime, timedelta, time
os.chdir('C:\\Users\user\test')
wb = openpyxl.load_workbook('log.xlsx')
sheet = wb['sheet2']
ws2 = wb['sheet2']
graphIntervals = [0,50,51,100,101,150] # filled with tuples of two integers,
# representing the top-left and bottom right of the rectangular
# selection of cells containing chart data I'm trying to graph
starts = graphIntervals[::2]
ends = graphIntervals[1::2]
for i in graphIntervals:
c[i] = LineChart()
c[i].title = "Chart Title"
c[i].style = 12
c[i].y_axis.crossAx = 500
c[i].x_axis = DateAxis(crossAx=100)
c[i].x_axis.number_format = 'd-HH-MM-SS'
c[i].x_axis.majorTimeUnit = "days"
c[i].y_axis.title = "Battery Voltage"
c[i].x_axis.title = "Time"
data = Reference(ws2, min_col=2, min_row=starts, max_col=2, max_row=ends)
c[i].add_data(data, titles_from_data=True)
dates = Reference(ws2, min_col=1, min_row=starts, max_row=ends)
c[i].set_categories(dates)
s[i] = c[i].series[0]
s[i].graphicalProperties.line.solidFill = "BE4B48"
s[i].graphicalProperties.line.width = 25000 # width in EMUs.
s[i].smooth = True # Make the line smooth
ws2.add_chart(c[i], "C[i+15]") # +15 for spacing
wb.save('log.xlsx')
Ideally I would end up making (however many values are in graphIntervals/2) charts.
I know I need to incorporate zip() in my data variable, otherwise it has no way to move on to the next set of values to create spreadsheets from. I think it would something like zip(starts, ends) but I'm not sure.
Is any of this possible through openpyxl? Although I haven't found any, does anyone have examples I could reference?
Followed advice in the comments. Here's that function called in a for loop:
for i in range(0, len(graphIntervals), 2):
min_row = graphIntervals[i] + 1
max_row = graphIntervals[i+1] + 1
# skip headers on first row
if min_row == 1:
min_row = 2
dates = chart.Reference(ws2, min_col=1, min_row=min_row, max_row=max_row)
vBat = chart.Reference(ws2, min_col=2, min_row=min_row, max_col=2, max_row=max_row)
qBat = chart.Reference(ws2, min_col=3, min_row=min_row, max_col=3, max_row=max_row)

Categories

Resources