Target: I am trying to split an excel file into multiple files based on some filter given within the sheet.
Problem: An issue is arising while copying the formula columns as it is not updating the row numbers inside the formula while splitting them into multiple sheets.
For Ex: In the master file, the formula is "=LEFT(B11, FIND(" ", B11,1))" for row 11, however, this becomes the first row in the new split file but the formula is still referring to row 11 which gives "#VALUE" error in the new file.
Any ideas on how to resolve this one?
I have tried achieving this using pandas and openpyxl and failed, PFB the code.
To Load the file
wb = load_workbook(filepath)
sheets = wb.get_sheet_names()
sheet_name = wb[sheets[0]]
master_df = pd.DataFrame(sheet_name.values, index=False)
master_df.columns = master_df.iloc[0]
master_df = master_df[1:]
print(master_df)
To split amd export the file
temp_df = master_df[master_df['Filter Column'] == filter_criteria]
sp.export_file(temp_df, output_path + "/" + <"output file name">)
def update_formula(df: pd.DataFrame, formula_col):
'''
Function to update formulas for each Manager
:param df: DataFrame for one specific manager.
'''
for _col in formula_col:
col_alpha = formula_col[_col][0]
formula = formula_col[_col][1]
index = 2
for ind, row in df.iterrows():
df.at[ind, _col] = Translator(formula, origin=col_alpha + '2').translate_formula(col_alpha + str(index))
index = index + 1
Here I am giving DataFrame and a list of columns which have formula in them as input. Later I am iterating over DataFrame and updating formula for each cell in that column using OpenpyXl Translator method.
This is the best solution I have figured yet.
Please let me know if there is a better way.
Related
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I am trying to compare 2 columns of an excel file with a 2D-matrix row by row with python. My excel file contains 20'100 rows and the computing time via Pycharm is more than 1 hour. Is there any way how to do these value comparisons more time efficient?
import openpyxl as xl
from IDM import idm_matrix
# load and create excel file
wb = xl.load_workbook('Auswertung_C33.xlsx')
result_wb = xl.Workbook() #new workbook for results
result_sheet = result_wb.create_sheet('Ergebnisse') #create new sheet in result file
result_wb.remove(result_wb['Sheet'])
sheet = wb['TriCad_Format']
# copy 1st row
first_row = sheet[1:1]
list_first_row =[]
for item in first_row:
list_first_row.append(item.value)
result_sheet.append(list_first_row)
# Value check
for row in range(2, sheet.max_row + 1):
row_list = []
for col in range(1, sheet.max_column + 1):
cell = sheet.cell(row, col)
row_list.append(cell.value)
for matrix in idm_matrix:
if row_list[7] is None:
continue
elif matrix[0] in row_list[7]:
if row_list[14] is None or matrix[1] != row_list[14]:
result_sheet.append(row_list)
print("saving file...")
result_wb.save('Auswertung.xlsx') #saves the file in a new wb
print("Done!")
Thanks for your help!
Alex
----- Sample of Data ------
Input:
BEZ | _IDM
Schirmsprinkler-SU5 | EAL
--> if column BEZ contains the string 'Schirmsprinkler' and column _IDM has any value, the row should be copied. If the column _IDM is empty, the row is fine and should not be copied. There are many strings in BEZ where _IDM should be empty, so thats why I am trying to put them all in the df_idm lists. However, it doesn't work with an empty string "".
Update 20th of May 2020:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns=['LongName', 'ShortName'])
# REMOVE ROWS WHICH HAVE NO VALUE IN COLUMN 6
df_excel.dropna(subset=['BEZ'], inplace=True)
# MATCH ON CORRESPONDING COLUMNS
search_list = df_idm['LongName'].unique().tolist()
matches1 = df_excel[(df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) &
(~df_excel["_IDM"].isin(df_idm["ShortName"].tolist()))]
matches2 = df_excel[(~df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) & (~pd.isnull(df_excel["_IDM"]))]
# CREATE LIST OF MATCHING DATAFRAMES
sum_of_idm = [matches1, matches2]
# CREATE NEW WORKBOOK
writer = pd.ExcelWriter('Ergebnisse.xlsx')
pd.concat(sum_of_idm).to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()
Since you are handling data requiring comparison checks, consider Pandas, the third-party data analytics library of Python for several reasons:
Import and export Excel features that can interface with openpyxl
Ability to interact with many Python data objects (list, dict, tuple, array, etc.)
Vectorized, comparison logic that is more efficient than nested for loops
Use whole, named objects (DataFrame and Series) for bulk, single call, set-based operations
Avoid working with unnamed, numbered rows and cells that impacts readability
Specifically, you can migrate your idm_matrix to a data frame and import Excel data to a data frame for column comparison or by a single call merge (for exact match) or Series.str.contains (for partial match) followed by logic filter.
Note: Without reproducible example the below code uses information from posted code but needs to be tested on actual data. Adjust any Column# from original Excel worksheet as needed:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns = ['LongName', 'ShortName'])
# EXACT MATCH
# matches = df_excel.merge(df_idm, left_on=['Column6'], right_on=['LongName'])
# PARTIAL MATCH
search_list = df_idm['LongName'].unique().tolist()
matches = df_excel[(~df_excel["Column6"].str.contains("|".join(search_list), regex=True)) &
(pd.isnull(df_excel["_IDM"])) &
(df_excel["Column6"].isin(df_idm["ShortName"].tolist())]
# ADJUST EXISTING WORKBOOK
with pd.ExcelWriter(xl_file, engine='openpyxl') as writer:
writer.book = xl.load_workbook(xl_file)
try:
# REMOVE SHEET IF EXISTS
writer.book.remove(writer.book['Ergebnisse'])
writer.save()
except Exception as e:
print(e)
finally:
# ADD NEW SHEET OF RESULTS
matches.to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()
I know this is alot of code and there is alot to do, but i am really stuck and don't know how to continue after i got the function that the program can match identical files. I am pretty sure you know how the lookup from excel works. This Program does basicly the same. I tried to comment out the important parts and hope you can give me some help how i can continue this project. Thank you very much!
import pandas as pd
import xlrd
File1 = pd.read_excel("Excel_test.xlsx", usecols=[0], header=None, index=False) #the two excel files with the columns that should be compared
File2 = pd.read_excel("Excel_test02.xlsx", usecols=[0], header=None, index=False)
fullFile1 = pd.read_excel("Excel_test.xlsx", header=None, index=False)#the full excel files
fullFile2 = pd.read_excel("Excel_test02.xlsx", header=None, index=False)
i = 0
writer = pd.ExcelWriter("output.xlsx")
def loadingTime(): #just a loader that shows the percentage of the matching process
global i
loading = (i / len(File1)) * 100
loading = round(loading, 2)
print(str(loading) + "%/100%")
def matcher():
global i
while(i < len(File1)):#goes in column that should be compared and goes on higher if there is a match found in second file
for o in range(len(File2)):#runs through the column in second file
matching = File1.iloc[i].str.lower() == File2.iloc[o].str.lower() #matches the column contents of the two files
if matching.bool() == True:
print("Match")
"""
df.append(File1.iloc[i])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
df.append(File2.iloc[o])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
"""
i += 1
matcher()
df.to_excel(writer, "Sheet")
writer.save() #After the two files have been compared to each other, now a file containing both excel contents and is also arranged correctly
Hello everyone and thank you in advance.
I have a python script where I am opening a template excel file, adding data (while preserving the style) and saving again. I would like to be able to remove rows that I did not edit before saving out the new xls file. My template xls file has a footer so I want to delete the extra rows before the footer.
Here is how I am loading the xls template:
self.inBook = xlrd.open_workbook(file_path, formatting_info=True)
self.outBook = xlutils.copy.copy(self.inBook)
self.outBookCopy = xlutils.copy.copy(self.inBook)
I then write the info to outBook while grabbing the style from outBookCopy and applying it to each row that I modify in outbook.
so how do I delete rows from outBook before writing it? Thanks everyone!
I achieved using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)
xlwt does not provide a simple interface for doing this, but I've had success with a somewhat similar problem (inserting multiple copies of a row into a copied workbook) by directly changing the worksheet's rows attribute and the row numbers on the row and cell objects.
The rows attribute is a dict, indexed on row number, so iterating a row range takes a little care and you can't slice it.
Given the number of rows you want to delete and the initial row number of the first row you want to keep, something like this might work:
rows_indices_to_move = range(first_kept_row, worksheet.last_used_row + 1)
max_used_row = 0
for row_index in rows_indices_to_move:
new_row_number = row_index - number_to_delete
if row_index in worksheet.rows():
row = worksheet.rows[row_index]
row._Row__idx = new_row_number
for cell in row._Row__cells.values():
if cell:
cell.rowx = new_row_number
worksheet.rows[new_row_number] = row
max_used_row = new_row_number
else:
# There's no row in the block we're trying to slide up at this index, but there might be a row already present to clear out.
if new_row_number in worksheet.rows():
del worksheet.rows[new_row_number]
# now delete any remaining rows
del worksheet.rows[new_row_number + 1:]
# and update the internal marker for the last remaining row
if max_used_row:
worksheet.last_used_row = max_used_row
I would believe that there are bugs in that code, it's untested and relies on direct manipulation of the underlying data structures, but it should show the general idea. Modify the row and cell objects and adjust the rows dictionary so that the indices are correct.
Do you have merged ranges in the rows you want to delete, or below them? If so you'll also need to run through the worksheet's merged_ranges attribute and update the rows for them. Also, if you have multiple groups of rows to delete you'll need to adjust this answer - this is specific to the case of having a block of rows to delete and shifting everything below up.
As a side note - I was able to write text to my worksheet and preserve the predefined style thus:
def write_with_style(ws, row, col, value):
if ws.rows[row]._Row__cells[col]:
old_xf_idx = ws.rows[row]._Row__cells[col].xf_idx
ws.write(row, col, value)
ws.rows[row]._Row__cells[col].xf_idx = old_xf_idx
else:
ws.write(row, col, value)
That might let you skip having two copies of your spreadsheet open at once.
For those of us still stuck with xlrd/xlwt/xlutils, here's a filter you could use:
from xlutils.filter import BaseFilter
class RowFilter(BaseFilter):
rows_to_exclude: "Iterable[int]"
_next_output_row: int
def __init__(
self,
rows_to_exclude: "Iterable[int]",
):
self.rows_to_exclude = rows_to_exclude
self._next_output_row = -1
def _should_include_row(self, rdrowx):
return rdrowx not in self.rows_to_exclude
def row(self, rdrowx, wtrowx):
if self._should_include_row(rdrowx):
# Proceed with writing out the row to the output file
self._next_output_row += 1
self.next.row(
rdrowx, self._next_output_row,
)
# After `row()` has been called, `cell()` is called for each cell of the row
def cell(self, rdrowx, rdcolx, wtrowx, wtcolx):
if self._should_include_row(rdrowx):
self.next.cell(
rdrowx, rdcolx, self._next_output_row, wtcolx,
)
Then put it to use with e.g.:
from xlrd import open_workbook
from xlutils.filter import DirectoryWriter, XLRDReader
xlutils.filter.process(
XLRDReader(open_workbook("input_filename.xls", "output_filename.xls")),
RowFilter([3, 4, 5]),
DirectoryWriter("output_dir"),
)
I have a really large excel file and i need to delete about 20,000 rows, contingent on meeting a simple condition and excel won't let me delete such a complex range when using a filter. The condition is:
If the first column contains the value, X, then I need to be able to delete the entire row.
I'm trying to automate this using python and xlwt, but am not quite sure where to start. Seeking some code snippits to get me started...
Grateful for any help that's out there!
Don't delete. Just copy what you need.
read the original file
open a new file
iterate over rows of the original file (if the first column of the row does not contain the value X, add this row to the new file)
close both files
rename the new file into the original file
I like using COM objects for this kind of fun:
import win32com.client
from win32com.client import constants
f = r"h:\Python\Examples\test.xls"
DELETE_THIS = "X"
exc = win32com.client.gencache.EnsureDispatch("Excel.Application")
exc.Visible = 1
exc.Workbooks.Open(Filename=f)
row = 1
while True:
exc.Range("B%d" % row).Select()
data = exc.ActiveCell.FormulaR1C1
exc.Range("A%d" % row).Select()
condition = exc.ActiveCell.FormulaR1C1
if data == '':
break
elif condition == DELETE_THIS:
exc.Rows("%d:%d" % (row, row)).Select()
exc.Selection.Delete(Shift=constants.xlUp)
else:
row += 1
# Before
#
# a
# b
# X c
# d
# e
# X d
# g
#
# After
#
# a
# b
# d
# e
# g
I usually record snippets of Excel macros and glue them together with Python as I dislike Visual Basic :-D.
You can try using the csv reader:
http://docs.python.org/library/csv.html
You can use,
sh.Range(sh.Cells(1,1),sh.Cells(20000,1)).EntireRow.Delete()
will delete rows 1 to 20,000 in an open Excel spreadsheet so,
if sh.Cells(1,1).Value == 'X':
sh.Cells(1,1).EntireRow.Delete()
If you just need to delete the data (rather than 'getting rid of' the row, i.e. it shifts rows) you can try using my module, PyWorkbooks. You can get the most recent version here:
https://sourceforge.net/projects/pyworkbooks/
There is a pdf tutorial to guide you through how to use it. Happy coding!
I have achieved this using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)