for loop for openpyxl multiple chart creation - python

I'm trying to create a for loop to create multiple line charts in openpyxl, all at once. Certain indices in an array would be the bookends for the data the chart would draw data from. Is this possible in openpyxl?
My data in the excel spreadsheet looks like this:
1 Time Battery Voltage
2 2019-06-05 00:00:00 45
3 2019-06-05 00:01:50 49
4 2019-06-05 00:02:30 51
5 2019-06-05 00:04:58 34
...
import os
import openpyxl
from openpyxl import Workbook
from openpyxl.chart import LineChart, Reference, Series
from openpyxl.chart.axis import DateAxis
from datetime import date, datetime, timedelta, time
os.chdir('C:\\Users\user\test')
wb = openpyxl.load_workbook('log.xlsx')
sheet = wb['sheet2']
ws2 = wb['sheet2']
graphIntervals = [0,50,51,100,101,150] # filled with tuples of two integers,
# representing the top-left and bottom right of the rectangular
# selection of cells containing chart data I'm trying to graph
starts = graphIntervals[::2]
ends = graphIntervals[1::2]
for i in graphIntervals:
c[i] = LineChart()
c[i].title = "Chart Title"
c[i].style = 12
c[i].y_axis.crossAx = 500
c[i].x_axis = DateAxis(crossAx=100)
c[i].x_axis.number_format = 'd-HH-MM-SS'
c[i].x_axis.majorTimeUnit = "days"
c[i].y_axis.title = "Battery Voltage"
c[i].x_axis.title = "Time"
data = Reference(ws2, min_col=2, min_row=starts, max_col=2, max_row=ends)
c[i].add_data(data, titles_from_data=True)
dates = Reference(ws2, min_col=1, min_row=starts, max_row=ends)
c[i].set_categories(dates)
s[i] = c[i].series[0]
s[i].graphicalProperties.line.solidFill = "BE4B48"
s[i].graphicalProperties.line.width = 25000 # width in EMUs.
s[i].smooth = True # Make the line smooth
ws2.add_chart(c[i], "C[i+15]") # +15 for spacing
wb.save('log.xlsx')
Ideally I would end up making (however many values are in graphIntervals/2) charts.
I know I need to incorporate zip() in my data variable, otherwise it has no way to move on to the next set of values to create spreadsheets from. I think it would something like zip(starts, ends) but I'm not sure.
Is any of this possible through openpyxl? Although I haven't found any, does anyone have examples I could reference?

Followed advice in the comments. Here's that function called in a for loop:
for i in range(0, len(graphIntervals), 2):
min_row = graphIntervals[i] + 1
max_row = graphIntervals[i+1] + 1
# skip headers on first row
if min_row == 1:
min_row = 2
dates = chart.Reference(ws2, min_col=1, min_row=min_row, max_row=max_row)
vBat = chart.Reference(ws2, min_col=2, min_row=min_row, max_col=2, max_row=max_row)
qBat = chart.Reference(ws2, min_col=3, min_row=min_row, max_col=3, max_row=max_row)

Related

How to style columns in pandas without overlapping and deleting previous work

I am doing some styling to pandas columns where I want to highlight green or red values + or - 2*std of the corresponding column, but when I loop over to go to the next column, previous work is essentially deleted and only the last column shows any changes.
Function:
def color_outliers(value):
if value <= (mean - (2*std)):
# print(mean)
# print(std)
color = 'red'
elif value >= (mean + (2*std)):
# print(mean)
# print(std)
color = 'green'
else:
color = 'black'
return 'color: %s' % color
Code:
comp_holder = []
titles = []
i = 0
for value in names:
titles.append(names[i])
i+=1
#Number of Articles and Days of search
num_days = len(page_list[0]['items']) - 2
num_arts = len(titles)
arts = 0
days = 0
# print(num_days)
# print(num_arts)
#Sets index of dataframe to be timestamps of articles
for days in range(num_days):
comp_dict = {}
comp_dict = {'timestamp(YYYYMMDD)' : int(int(page_list[0]['items'][days]['timestamp'])/100)}
#Adds each article from current day in loop to dictionary for row append
for arts in range(num_arts):
comp_dict[titles[arts]] = page_list[arts]['items'][days]['views']
comp_holder.append(comp_dict)
comp_df = pd.DataFrame(comp_holder)
arts = 0
days = 0
outliers = comp_df
for arts in range(num_arts):
mean = comp_df[titles[arts]].mean()
std = comp_df[titles[arts]].std()
outliers = comp_df.style.applymap(color_outliers, subset = [titles[arts]])
Each time I go through this for loop, the 'outliers' styling data frame resets itself and only works on the current subset, but if I remove the subset, it uses one mean and std for the entire data frame. I have tried style.apply using axis=0 but i can't get it to work.
My data frame consists of 21 columns, the first being the timestamp and the next twenty being columns of ints based upon input files. I also have two lists indexed from 0 to 19 of means and std of each column.
I would apply on the whole column instead of applymap. I'm not sure I can follow your code since I don't know how your data look like, but this is what I would do:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100, [10,3]))
# compute the statistics
stats = df.agg(['mean','std'])
# format function on columns
def color_outlier(col, thresh=2):
# extract mean and std of the column
mean, std = stats[col.name]
return np.select((col<=mean-std*thresh, col>=mean+std*thresh),
('color: red', 'color: green'),
'color: black')
# thresh changes for demonstration, remove when used
df.style.apply(color_outlier, thresh=0.5)
Output:

formatting colour for cell and row in excel using pandas

my requirement highlight the different of those cell to "yellow" and new rows added will be highlight in red colour in excel as a output.
load1 dataset
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-04-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-05-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-05-20
7 22-04-20 29-04-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
load2 table
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-09-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-06-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-06-20
7 22-04-20 29-08-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
9 15-05-20 16-05-20 17-05-20 18-05-20
Output. I am looking for the output in excel in the below format with colours hi-lighted.
I am new to this pandas, I was able to get the different difference at dataframe level. Not at the outputed excel where to format each cell and row with colours.Please help .
the below are the script i tried to get the difference.
import pandas as pd
import os
import numpy as np
colour1= pd.read_excel('pandas.xlsx',sheet_name='load1')
colour2= pd.read_excel('pandas.xlsx',sheet_name='load2')
colour_merge=colour1.merge(colour2,left_on='StudentID', right_on='StudentID',how='outer')
colour_merge['Visit1dif']= np.where(colour_merge['Visit1_x']==colour_merge['Visit1_y'],0,1)
colour_merge['Visit2dif']= np.where(colour_merge['Visit 2_x']==colour_merge['Visit 2_y'],0,1)
colour_merge['Visit3dif']= np.where(colour_merge['Visit 3_x']==colour_merge['Visit 3_y'],0,1)
colour_merge['Visit4dif']= np.where(colour_merge['Visit 4_x']==colour_merge['Visit 4_y'],0,1)
colour_merge[['StudentID','Visit1_x','Visit1_y','Visit1dif','Visit 2_x','Visit 2_y','Visit2dif','Visit 3_x','Visit 3_y','Visit3dif','Visit 4_x','Visit 4_y','Visit4dif']]
I think you have two sheets load1 and load2, and want to and the third one to be displayed in the picture.
In order to add a new sheet, pandas.ExcelWriter should be opened in append mode and openpyxl library is needed along with pandas.DataFrame.to_excel and conditional_formatting functions :
import pandas as pd
import openpyxl
import xlrd
from openpyxl.styles import Alignment, Font, NamedStyle, PatternFill, colors
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import Rule, ColorScaleRule, CellIsRule, FormulaRule
file = r'C:\\app\\excel\\pandas.xlsx'
xls = xlrd.open_workbook(file)
sht = xls.sheet_names()
colour1= pd.read_excel(file,sheet_name=sht[0])
colour2= pd.read_excel(file,sheet_name=sht[1])
colour_merge = colour1.merge(colour2,left_on='StudentID',right_on='StudentID',how='outer')
colour_merge = colour_merge[['StudentID','Visit 1_x','Visit 1_y','Visit 2_x','Visit 2_y','Visit 3_x','Visit 3_y','Visit 4_x','Visit 4_y']]
l_sid = []
mxc1 = max(colour1['StudentID'])+1
mxc2 = max(colour2['StudentID'])+1
for j in range(min(mxc1,mxc2),max(mxc1,mxc2)):
l_sid.append(j+1)
writer_args = { 'path': file, 'mode': 'a', 'engine': 'openpyxl'}
with pd.ExcelWriter(**writer_args) as xlsx:
colour_merge.to_excel(xlsx, 'load3', index=False)
ws = xlsx.sheets['load3']
mxc= ws.max_column
title_row = '1'
yellow = PatternFill(bgColor=colors.YELLOW)
red = PatternFill(bgColor=colors.RED)
i=1
while i <= mxc*2:
l_col = []
l_col.append(chr(i+65))
l_col.append(chr(i+65+1))
for j in range(2,mxc+1):
for k in l_col:
if j not in l_sid:
ws.conditional_formatting.add(k+str(j), FormulaRule(formula=[l_col[0]+'$'+str(j)+'<>'+l_col[1]+'$'+str(j)], stopIfTrue=True, fill=yellow))
i+=2
r = Rule(type="expression", dxf=DifferentialStyle(fill=red), stopIfTrue=False)
r.formula = ['1=1']
while l_sid:
el=l_sid.pop(-1)
ws.conditional_formatting.add('A'+str(el)+':'+chr(64+mxc)+str(el), r)
xlsx.save()

How to detect "strikethrough" style from xlsx file in R

I have to check the data which contain "strikethrough" format when importing excel file in R
Do we have any method to detect them ?
Welcome for both R and Python approach
R-solution
the tidyxl-package can help you...
example temp.xlsx, with data on A1:A4 of the first sheet. Below is an excel-screenshot:
library(tidyxl)
formats <- xlsx_formats( "temp.xlsx" )
cells <- xlsx_cells( "temp.xlsx" )
strike <- which( formats$local$font$strike )
cells[ cells$local_format_id %in% strike, 2 ]
# A tibble: 2 x 1
# address
# <chr>
# 1 A2
# 2 A4
I present below a small sample program that filters out text with strikethrough applied, using the openpyxl package (I tested it on version 2.5.6 with Python 3.7.0). Sorry it took so long to get back to you.
import openpyxl as opx
from openpyxl.styles import Font
def ignore_strikethrough(cell):
if cell.font.strike:
return False
else:
return True
wb = opx.load_workbook('test.xlsx')
ws = wb.active
colA = ws['A']
fColA = filter(ignore_strikethrough, colA)
for i in fColA:
print("Cell {0}{1} has value {2}".format(i.column, i.row, i.value))
print(i.col_idx)
I tested it on a new workbook with the default worksheets, with the letters a,b,c,d,e in the first five rows of column A, where I had applied strikethrough formatting to b and d. This program filters out the cells in columnA which have had strikethrough applied to the font, and then prints the cell, row and values of the remaining ones. The col_idx property returns the 1-based numeric column value.
I found a method below:
'# Assuming the column from 1 - 10 has value : A , the 5th A contains "strikethrough"
TEST_wb = load_workbook(filename = 'TEST.xlsx')
TEST_wb_s = TEST_wb.active
for i in range(1, TEST_wb_s.max_row+1):
ck_range_A = TEST_wb_s['A'+str(i)]
if ck_range_A.font.strikethrough == True:
print('YES')
else:
print('NO')
But it doesn't tell the location (this case is the row numbers),which is hard for knowing where contains "strikethrough" when there is a lot of result , how can i vectorize the result of statement ?

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

openpyxl - change width of n columns

I am trying to change the column width for n number of columns.
I am able to do this for rows as per the below code.
rowheight = 2
while rowheight < 601:
ws.row_dimensions[rowheight].height = 4
rowheight += 1
The problem I have is that columns are in letters and not numbers.
As pointed out by ryachza the answer was to use an openpyxl utility, however the utility to use is get_column_letter and not column_index_from_string as I want to convert number to letter and not visa versa.
Here is the working code
from openpyxl.utils import get_column_letter
# Start changing width from column C onwards
column = 3
while column < 601:
i = get_column_letter(column)
ws.column_dimensions[i].width = 4
column += 1
To get the column index, you should be able to use:
i = openpyxl.utils.column_index_from_string(?)
And then:
ws.column_dimensions[i].width = ?

Categories

Resources