I have an xlxs file ( calculations file) with a number of sheets where the values for a few parameters are calculated using macros. These calculations are based of a particular column values in SHEET1 and this column values are copied from a set of different files ( Input files) repetitively and after a shortwhile the parameter values are copied over to another file. I am trying to automate this process with pandas. I was able copy column data from input files to the calculations file but the macros doesn't seem to be doing anything as I can not see any changed values even after waiting for a minute. Usually these macros take only a few seconds. On the other hand if I print the column from the calculations dataframe, it shows the copied values from input. not sure what's missed here.
here is the code I am using for the process
input_file = os.path.join('./data',configs.input_file)
calculations_file = os.path.join('./resources', configs.calculations_file)
input_df = pd.read_csv(normal_file)
calcs_df = pd.read_excel(calculations_file)
# first column is what is used in calculating the parameter values.
# so copy input data to this column
calcs_df[calcs_df.columns[0]] = input_df[input_df.columns[0]]
# wait for sometime so that the macros can run and derive the parameter values
time.sleep(90)
# check the columns for parameters for their updated values, 10, 11, 12 are
# the columns for the parameters.
print(calcs_df[calcs_df.columns[0]][1],calcs_df[calcs_df.columns[10]][0],calcs_df[calcs_df.columns[11]][0],calcs_df[calcs_df.columns[12]][0] )
I appreciate if anyone can help me understand this sort of process is even possible.
xlwings does this well for me.
instead of using pandas used xlwings as follows and it worked
nwbook = xw.Book(n_file)
nwsheet = nwbook.sheets[0]
rwbook = xw.Book(r_file)
isheet = rwbook.sheets['Test']
ndata = nwsheet.range('A:A').options(ndim=2,expand='table',
transpose=True).value
# calculations file edits
cwbook = xw.Book(c_file)
csheet = cwbook.sheets['Data']
for no, data in enumerate(ndata):
# this is where the calculations are done
# get result in column orientation by setting
# transpose to true.
csheet.range('A:A').options(ndim=2,
expand='table',transpose=True).value = data
time.sleep(1)
# by now the macros in the calculations sheet are run
# and the results ready
# read values and pump them to report
isheet.range('A'+str(no)).value = data[0]
isheet.range('C'+str(no)).value = csheet.range('E2').value
isheet.range('D'+str(no)).value = csheet.range('F2').value
instant_sheet.range('E'+str(no)).value = csheet.range('G2').value
Related
I have a problem with an excel file! and I want to automate it by using python script to complete a column based on the information of the first column: for example:
if data == 'G711Alaw 64k' or 'G711Ulaw 64k'
print('1-Jan) till find it == '2-Jan' then print('2-Jan') and so on.
befor automate
I need its looks like this after automate:
after automate
Is there anyone can help me to do solve this issue?
The file:
the excel file
Thanks a lot for your help.
Try this, pandas reads your jan-1 is datetime type, if you need to change it to a string you can set it directly in the code, the following code will directly assign the value read to the second column:
import pandas as pd
df = pd.read_excel("add_date_column.xlsx", engine="openpyxl")
sig = []
def t(x):
global sig
if not isinstance(x.values[0], str):
tmp_sig = x.values[0]
if tmp_sig not in sig:
sig = [tmp_sig]
x.values[1] = sig[-1]
return x
new_df = df.apply(t, axis=1)
new_df.to_excel("new.xlsx", index=False)
The concept is very simple :
If the value is date/time, copy to the [same row, next column].
If not, [same row, next column] is copied from [previous row, next
column].
You do not specifically need Python for this task. The excel formula for this would be;
=IF(ISNUMBER(A:A),A:A,B1)
Instead of checking if it is date/time, I took adavantage of the fact that the rest of the entries are alphanumeric (including both alphabets and numbers). This formula is applied on the new column.
Of course, you might already be in Python and just work within the same environment. So, here's the loop :
for i in range(len(df)):
if type(df["Orig. Codec"][i]) is datetime:
df["Column1"][i] = df["Orig. Codec"][i]
else:
df["Column1"][i] = df["Column1"][i-1]
There might be ways to lambda function for the same concept, not that I am aware of how to apply lambda and shift at the same time.
I am learning python in order to automate some tasks at my work. What I need to do is to read many big TSV files and use filter each one of them on a list of keys that may be different from a TSV to another and then return the result in an Excel report using certain template. In order to do that I load at once all the common data to all the files: the list of TSV files to process, some QC values, the filter to be used for each TSV then in a loop I read all the files that I need to process and load one TSV on each loop round in pandas data frame then filter and fill the Excel report. The problem is that the filter does not seem to work and I am getting Excel files of the same length each time (number of rows)!!. I suspect that the data frame does not get reinitialized on each round but trying to del the df and releasing the memory do not work as well. How can I solve the issue? Is there a better data structure to handle my problem. Below a part of the code.
Thanks
# This how I load the filter for each file in each round of the the loop (as an array)
target = load_workbook(target_input)
target_sheet = target.active
targetProbes = []
for k in range (2, target_sheet.max_row+1):
targetProbes.append(target_sheet.cell(k, 1).value)
# This is how I load each TSV file in each round of the loop (an df)
tsv_reader = pd.read_csv(calls, delimiter='\t', comment = '#', low_memory=False)
# This is how I filter the df based on the filter
tsv_reader = tsv_reader[tsv_reader['probeset_id'].isin(targetProbes)]
tsv_reader.reset_index(drop = True, inplace = True)
# This how I try to del the df and clear the current filter so I can re-use them for another TSV file with different values in the next round of the loop
del tsv_reader
targetProbes.clear()
I am currently learning Python in order to create excel files in an automatic way.
However I got some troubles with the conditional formatting part of the openpyxl library.
Here is the piece of code that does not work :
import openpyxl as xl
wb = xl.Workbook()
ws = wb.active
red_color = "ffc7ce"
red_color_font = "9c0103"
red_font = xl.styles.fonts.Font(size=14, bold=True, color=red_color_font)
red_fill = xl.styles.fills.PatternFill(start_color=red_color, end_color=red_color, fill_type="solid")
#------
for i in range(3,131,1):
ws["B"+str(i)] = "=SUM(C{}:D{})".format(str(i),str(i))
ws["C"+str(i)] = 0
ws["D"+str(i)] = 0
ws["E"+str(i)] = "=SUM(F{}:G{})".format(str(i),str(i))
ws["F"+str(i)] = 0
ws["G"+str(i)] = 0
# conditional formating - putting red cells if second applications have been forgotten
if i==4:
threshold = [str(int(ws["B"+str(i-1)].value))]
ws.conditional_formatting.add("B4",xl.formatting.rule.CellIsRule(operator="lessThan",formula=threshold,fill=red_fill,font=red_font))
if i>=5:
threshold = [str(int(ws["B"+str(i-1)].value)-int(ws["B"+str(i-2)].value))]
ws.conditional_formatting.add("B{}".format(str(i)),xl.formatting.rule.CellIsRule(operator="lessThan",formula=threshold,fill=red_fill,font=red_font))
As you can see it, the problem comes from the threshold variable.
When I run the code, I get the following error message :
ValueError: invalid literal for int() with base 10: '=SUM(C3:D3)'
Does someone know how it is possible to extract the value from the formula used for the B column ?
Thank's in advance !
It depends on whether the excel has been opened. In your case, it looks like the excel is generated by you from python, then there is no evaluated value stored in the excel.
What you can do:
A. Since you generate the code in python, you know what value are assigned to those cells. You don't need to read those values back from excel. You can create a dataframe to store the structure and if you want to see the value of C[3], then call df['C'].iloc[3]
B. Before trying to access the evaluated value of a cell, you need to open the excel and save it. This way, Mircosoft Excel will save information on both the formula and the evaluated value of that cell. Then, you can read back the excel in your python code. More specifically, you can read in both the formula and value of a cell:
wb_values = load_workbook(path_xlsx+wb_name,data_only=True)
wb_formulas = load_workbook(path_xlsx+wb_name)
I am trying to copy all the columns from consolidated file to summary file and run a excel macro from python, summary file have columns from A to BB, and i want to copy only upto AI, I tried the below code but its not giving me any result
wbpath = 'C:\\Users\\Summary.xlsb'
excel = Dispatch("Excel.Application")
workbook = excel.Workbooks.Open(wbpath)
strcode = \
'''
Sub MacroCopy()
'
' MacroCopy Macro
'
'
Dim sourceColumn As Range, targetColumn As Range
Set sColumn = Workbooks("C:\\Users\\Consolidated.xlsx").Worksheets(1).Columns("A:AI")
Set tColumn = Workbooks("C:\\Users\\Summary.xlsb").Worksheets(2).Columns("A2")
sColumn.Copy Destination:=tColumn
End Sub
'''
excelModule = workbook.VBProject.VBComponents.Add(1)
excelModule.CodeModule.AddFromString(strcode.strip())
excel.Workbooks(1).Close(SaveChanges=1)
excel.Application.Quit()
when i ran the macro in the excel sheet its giving me subscript out of range error. Please let me know where i am going wrong
There are a number of errors in your VBA code. First of all
Set tColumn = Workbooks("C:\\Users\\Summary.xlsb").Worksheets(2).Columns("A2")
is not valid because A2 is a cell reference, not a column reference. But you can't copy the whole of a column from one sheet into row 2 of another sheet, because there aren't enough rows on the destination sheet. You can do one of the following:
Set sColumn = Workbooks("C:\\Users\\Consolidated.xlsx").Worksheets(1).Columns("A:AI")
Set tColumn = Workbooks("C:\\Users\\Summary.xlsb").Worksheets(2).Range("A1")
sColumn.Copy Destination:=tColumn
which will paste the whole of columns A:AI of the source sheet into the corresponding columns of the destination sheet, or:
Set sColumn = Workbooks("C:\\Users\\Consolidated.xlsx").Worksheets(1).Range("A1:AI1048575")
Set tColumn = Workbooks("C:\\Users\\Summary.xlsb").Worksheets(2).Range("A2")
sColumn.Copy Destination:=tColumn
which will copy the largest range from the source sheet that will actually fit when pasted to row 2 of the destination - since an Excel 2007 or later worksheet has a maximum of 1048576 rows.
Finally your DIM statement defines variables called sourceColumn and targetColumn but your code uses different variables that you haven't declared. If you aren't using Option Explicit, which is usually recommended but I guess can be ignored for a small throwaway macro like this, you don't need to DIM the variables.
I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.
I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.