Openpyxl optimizing cells search speed - python

I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patterns are usually row after row so I use iter_rows(row_offset=x). Unfortunately the code below finds the given pattern an increasing number of times in each for loop (starting from milliseconds and getting up to almost a minute). What am I doing wrong?
import openpyxl
import datetime
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "test_sheet"
print("Generating quite big excel file")
for i in range(1,10000):
for j in range(1,20):
ws.cell(row = i, column = j).value = "Cell[{},{}]".format(i,j)
print("Saving test excel file")
wb.save('test.xlsx')
def FindXlCell(search_str, last_r):
t = datetime.datetime.utcnow()
for row in ws.iter_rows(row_offset=last_r):
for cell in row:
if (search_str == cell.value):
print(search_str, last_r, cell.row, datetime.datetime.utcnow() - t)
last_r = cell.row
return last_r
print("record not found ",search_str, datetime.datetime.utcnow() - t)
return 1
wb = openpyxl.load_workbook("test.xlsx", data_only=True)
t = datetime.datetime.utcnow()
ws = wb["test_sheet"]
last_row = 1
print("Parsing excel file in a loop for 3 cells")
for i in range(1,100,1):
last_row = FindXlCell("Cell[0,0]", last_row)
last_row = FindXlCell("Cell[1000,6]", last_row)
last_row = FindXlCell("Cell[6000,6]", last_row)

Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row) means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1.
If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.
Something like:
matrix = {}
for row in ws:
for cell in row:
matrix[cell.value] = (cell.row, cell.col_idx)
In a real-world example you'd probably want to use a defaultdict to be able to handle multiple cells with the same text.
This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.

Related

How can I loop through and increment the rows in an excel workbook formula in Python?

This is a continuation of the this question How can I iterate through excel files sheets and insert formula in Python?
I decided to have it on new thread as its another issue. I'm interested in copying a formula to a column across the rows in a number of workbooks. My code is below and the problem is in the for loop.
import openpyxl
in_folder = r'C:\xxx' #Input folder
out_folder = r'C:\yyy' #Output folder
if not os.path.exists(out_folder):
os.makedirs(out_folder)
dir_list = os.listdir(in_folder)
print(dir_list)
for xlfile in dir_list:
if xlfile.endswith('.xlsx') or xlfile.endswith('.xls'):
str_file = xlfile
work_book = openpyxl.load_workbook(os.path.join(in_folder,str_file))
work_sheet = work_book['Sheet1']
for i, cellObj in enumerate(work_sheet['U'], 1): #The cell where the formula is to be inserted and iterated down to the last row
cellObj.value = '=Q2-T2' #Cells value. This is where I'm going wrong but I'm not sure of the best way to have '=Q3-T3' etc till the last row. For each iteration, Q2 and T2 will be incremented to Q3 and T3 till the last row in the dataset.
work_book.save(os.path.join(out_folder, xlfile)) #Write the excel sheet with formulae to another folder
How can I increment the rows in the formula as I loop through the active worksheet to the end? More details in the comments next to the code.
maybe you could just try formatting the string?
...
row_count = 2
for i, cellObj in enumerate(work_sheet['U'], 1):
cellObj.value = f'=Q{row_count}-T{row_count}'
work_book.save(os.path.join(out_folder, xlfile))
row_count += 1

How to paste values only in Excel using Python and openpyxl

I have an Excel worksheet.
In column J i have some some source data which i used to make calculations in column K.
Column K has the values I need, but when i click on a cell the formula shows up.
I only want the values from column K, not the formula.
I read somewhere that i need to set data only=True, which I have done.
I then pasted data from Column K to Column L(with the intention of later deleting Columns J and K).
I thought that Column L will have only the values from K but if i click on a cell, the formula still shows up.
How do I simply paste values only from one column to another?
import openpyxl
wb = openpyxl.load_workbook('edited4.xlsx', data_only=True)
sheet = wb['Sheet1']
last_row = 100
for i in range(2, last_row):
cell = "K" + str(i)
a_cell = "J" + str(i)
sheet[cell] = '=IF(' + a_cell + '="R","Yes","No")'
rangeselected = []
for i in range (1, 100,1):
rangeselected.append(sheet.cell(row = i, column = 11).value)
for i in range (1, 1000,1):
sheet.cell(row=i, column=12).value = rangeselected[i-1]
wb.save('edited4.xlsx')
It's been a while since I've used openpyxl. But:
Openpyxl doesn't run an Excel formula. It reads either the formula string or the results of the last calculation run by Excel*. This means that if a calculation is created outside of Excel, and the file has never been open by Excel, then only the formula will be available. Unless you need to display (for historical purposes, etc.) what the formula is, you should do the calculation in Python - which will be faster and more efficient anyway.
* When I say Excel, I also include any Excel-like spreadsheet that will cache the results of the last run.
Try this (adjust column numbers as desired):
import openpyxl
wb = openpyxl.load_workbook('edited4.xlsx', data_only=True)
sheet = wb['Sheet1']
last_row = 100
data_column = 11
test_column = 12
result_column = 13
for i in range(2, last_row):
if sheet.cell(row=i, column=test_column).value == "R":
sheet.cell(row=i, column=result_column).value = "Yes"
else:
sheet.cell(row=i, column=result_column).value = "No"
wb.save('edited4.xlsx')
If you have a well-formed data sheet, you could probably shorten this by another step or two by using enumerate() and Worksheet.iter_rows() but I'll leave that to your imagination.

Performance evaluating excel cells with python

I am trying to compare 2 columns of an excel file with a 2D-matrix row by row with python. My excel file contains 20'100 rows and the computing time via Pycharm is more than 1 hour. Is there any way how to do these value comparisons more time efficient?
import openpyxl as xl
from IDM import idm_matrix
# load and create excel file
wb = xl.load_workbook('Auswertung_C33.xlsx')
result_wb = xl.Workbook() #new workbook for results
result_sheet = result_wb.create_sheet('Ergebnisse') #create new sheet in result file
result_wb.remove(result_wb['Sheet'])
sheet = wb['TriCad_Format']
# copy 1st row
first_row = sheet[1:1]
list_first_row =[]
for item in first_row:
list_first_row.append(item.value)
result_sheet.append(list_first_row)
# Value check
for row in range(2, sheet.max_row + 1):
row_list = []
for col in range(1, sheet.max_column + 1):
cell = sheet.cell(row, col)
row_list.append(cell.value)
for matrix in idm_matrix:
if row_list[7] is None:
continue
elif matrix[0] in row_list[7]:
if row_list[14] is None or matrix[1] != row_list[14]:
result_sheet.append(row_list)
print("saving file...")
result_wb.save('Auswertung.xlsx') #saves the file in a new wb
print("Done!")
Thanks for your help!
Alex
----- Sample of Data ------
Input:
BEZ | _IDM
Schirmsprinkler-SU5 | EAL
--> if column BEZ contains the string 'Schirmsprinkler' and column _IDM has any value, the row should be copied. If the column _IDM is empty, the row is fine and should not be copied. There are many strings in BEZ where _IDM should be empty, so thats why I am trying to put them all in the df_idm lists. However, it doesn't work with an empty string "".
Update 20th of May 2020:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns=['LongName', 'ShortName'])
# REMOVE ROWS WHICH HAVE NO VALUE IN COLUMN 6
df_excel.dropna(subset=['BEZ'], inplace=True)
# MATCH ON CORRESPONDING COLUMNS
search_list = df_idm['LongName'].unique().tolist()
matches1 = df_excel[(df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) &
(~df_excel["_IDM"].isin(df_idm["ShortName"].tolist()))]
matches2 = df_excel[(~df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) & (~pd.isnull(df_excel["_IDM"]))]
# CREATE LIST OF MATCHING DATAFRAMES
sum_of_idm = [matches1, matches2]
# CREATE NEW WORKBOOK
writer = pd.ExcelWriter('Ergebnisse.xlsx')
pd.concat(sum_of_idm).to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()
Since you are handling data requiring comparison checks, consider Pandas, the third-party data analytics library of Python for several reasons:
Import and export Excel features that can interface with openpyxl
Ability to interact with many Python data objects (list, dict, tuple, array, etc.)
Vectorized, comparison logic that is more efficient than nested for loops
Use whole, named objects (DataFrame and Series) for bulk, single call, set-based operations
Avoid working with unnamed, numbered rows and cells that impacts readability
Specifically, you can migrate your idm_matrix to a data frame and import Excel data to a data frame for column comparison or by a single call merge (for exact match) or Series.str.contains (for partial match) followed by logic filter.
Note: Without reproducible example the below code uses information from posted code but needs to be tested on actual data. Adjust any Column# from original Excel worksheet as needed:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns = ['LongName', 'ShortName'])
# EXACT MATCH
# matches = df_excel.merge(df_idm, left_on=['Column6'], right_on=['LongName'])
# PARTIAL MATCH
search_list = df_idm['LongName'].unique().tolist()
matches = df_excel[(~df_excel["Column6"].str.contains("|".join(search_list), regex=True)) &
(pd.isnull(df_excel["_IDM"])) &
(df_excel["Column6"].isin(df_idm["ShortName"].tolist())]
# ADJUST EXISTING WORKBOOK
with pd.ExcelWriter(xl_file, engine='openpyxl') as writer:
writer.book = xl.load_workbook(xl_file)
try:
# REMOVE SHEET IF EXISTS
writer.book.remove(writer.book['Ergebnisse'])
writer.save()
except Exception as e:
print(e)
finally:
# ADD NEW SHEET OF RESULTS
matches.to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()

How can I find the last non-empty row of excel using openpyxl 3.03?

How can I find the number of the last non-empty row of an whole xlsx sheet using python and openpyxl?
The file can have empty rows between the cells and the empty rows at the end could have had content that has been deleted. Furthermore I don't want to give a specific column, rather check the whole table.
For example the last non-empty row in the picture is row 13.
I know the subject has been extensively discussed but I haven't found an exact solution on the internet.
# Open file with openpyxl
to_be = load_workbook(FILENAME_xlsx)
s = to_be.active
last_empty_row = len(list(s.rows))
print(last_empty_row)
## Output: 13
s.rows is a generator and its list contains arrays of each rows cells.
If you are looking for the last non-empty row of an whole xlsx sheet using python and openpyxl.
Try this:
import openpyxl
def last_active_row():
workbook = openpyxl.load_workbook(input_file)
wp = workbook[sheet_name]
last_row = wp.max_row
last_col = wp.max_column
for i in range(last_row):
for j in range(last_col):
if wp.cell(last_row, last_col).value is None:
last_row -= 1
last_col -= 1
else:
print(wp.cell(last_row,last_col).value)
print("The Last active row is: ", (last_row+1)) # +1 for index 0
if __name__ = '___main__':
last_active_row()
This should help.
openpyxl's class Worksheet has the attribute max_rows

Using generators outside of a loop

Relatively new to python so please excuse the newbie question, but google isn't helpful at this time.
I have 100 very large xlsx files from which I need to extract the first row (specifically cell A2). I found this gem of a tool called openpyxl which will iterate through my data files without loading everything in memory. It uses a generaotor to get the relevant row on each call
The thing that I can't get is how to initialize a generator outside of a loop. Right now my code is:
from openpyxl import load_workbook
wb = load_workbook(filename = "merged01.xlsx", use_iterators= True)
sheetName = wb.get_sheet_names()
ws = wb.get_sheet_by_name(name = sheetName[0])
row = ws.iter_rows() #row is a generator
for cell in row:
break
print (cell[1].internal_value) # A2
But there has to be a better way of doing this such as:
...
row = ws.iter_rows() #row is a generator
cell = row.first # line I'm trying to KISS
print (cell[1].internal_value) # A2
cell = next(row)
The next function retrieves the next value from any iterator.
You're looking for next().
cell = next(row)

Categories

Resources