Performance evaluating excel cells with python - python

I am trying to compare 2 columns of an excel file with a 2D-matrix row by row with python. My excel file contains 20'100 rows and the computing time via Pycharm is more than 1 hour. Is there any way how to do these value comparisons more time efficient?
import openpyxl as xl
from IDM import idm_matrix
# load and create excel file
wb = xl.load_workbook('Auswertung_C33.xlsx')
result_wb = xl.Workbook() #new workbook for results
result_sheet = result_wb.create_sheet('Ergebnisse') #create new sheet in result file
result_wb.remove(result_wb['Sheet'])
sheet = wb['TriCad_Format']
# copy 1st row
first_row = sheet[1:1]
list_first_row =[]
for item in first_row:
list_first_row.append(item.value)
result_sheet.append(list_first_row)
# Value check
for row in range(2, sheet.max_row + 1):
row_list = []
for col in range(1, sheet.max_column + 1):
cell = sheet.cell(row, col)
row_list.append(cell.value)
for matrix in idm_matrix:
if row_list[7] is None:
continue
elif matrix[0] in row_list[7]:
if row_list[14] is None or matrix[1] != row_list[14]:
result_sheet.append(row_list)
print("saving file...")
result_wb.save('Auswertung.xlsx') #saves the file in a new wb
print("Done!")
Thanks for your help!
Alex
----- Sample of Data ------
Input:
BEZ | _IDM
Schirmsprinkler-SU5 | EAL
--> if column BEZ contains the string 'Schirmsprinkler' and column _IDM has any value, the row should be copied. If the column _IDM is empty, the row is fine and should not be copied. There are many strings in BEZ where _IDM should be empty, so thats why I am trying to put them all in the df_idm lists. However, it doesn't work with an empty string "".
Update 20th of May 2020:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns=['LongName', 'ShortName'])
# REMOVE ROWS WHICH HAVE NO VALUE IN COLUMN 6
df_excel.dropna(subset=['BEZ'], inplace=True)
# MATCH ON CORRESPONDING COLUMNS
search_list = df_idm['LongName'].unique().tolist()
matches1 = df_excel[(df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) &
(~df_excel["_IDM"].isin(df_idm["ShortName"].tolist()))]
matches2 = df_excel[(~df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) & (~pd.isnull(df_excel["_IDM"]))]
# CREATE LIST OF MATCHING DATAFRAMES
sum_of_idm = [matches1, matches2]
# CREATE NEW WORKBOOK
writer = pd.ExcelWriter('Ergebnisse.xlsx')
pd.concat(sum_of_idm).to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()

Since you are handling data requiring comparison checks, consider Pandas, the third-party data analytics library of Python for several reasons:
Import and export Excel features that can interface with openpyxl
Ability to interact with many Python data objects (list, dict, tuple, array, etc.)
Vectorized, comparison logic that is more efficient than nested for loops
Use whole, named objects (DataFrame and Series) for bulk, single call, set-based operations
Avoid working with unnamed, numbered rows and cells that impacts readability
Specifically, you can migrate your idm_matrix to a data frame and import Excel data to a data frame for column comparison or by a single call merge (for exact match) or Series.str.contains (for partial match) followed by logic filter.
Note: Without reproducible example the below code uses information from posted code but needs to be tested on actual data. Adjust any Column# from original Excel worksheet as needed:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns = ['LongName', 'ShortName'])
# EXACT MATCH
# matches = df_excel.merge(df_idm, left_on=['Column6'], right_on=['LongName'])
# PARTIAL MATCH
search_list = df_idm['LongName'].unique().tolist()
matches = df_excel[(~df_excel["Column6"].str.contains("|".join(search_list), regex=True)) &
(pd.isnull(df_excel["_IDM"])) &
(df_excel["Column6"].isin(df_idm["ShortName"].tolist())]
# ADJUST EXISTING WORKBOOK
with pd.ExcelWriter(xl_file, engine='openpyxl') as writer:
writer.book = xl.load_workbook(xl_file)
try:
# REMOVE SHEET IF EXISTS
writer.book.remove(writer.book['Ergebnisse'])
writer.save()
except Exception as e:
print(e)
finally:
# ADD NEW SHEET OF RESULTS
matches.to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()

Related

How to copy tables from a pdf file to excel file, except the headers using python

I have extracted Tables from a pdf file to an excel(xlsx) file using python. Now I want All the data except the headers to appear in the excel file. What changes should I make to the code. I am attaching the code below for you.
The code:-
import camelot
import PyPDF2
import pandas as pd
# PDF file to extract tables from
file = "C:/Users/mahma_dv2pq9y/Downloads/santander.pdf"
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('C:/Users/mahma_dv2pq9y/OneDrive/santander_agg_mortgage.xlsx', engine='xlsxwriter')
# extract all the tables in the PDF file
tables = camelot.read_pdf(file, pages='all',flavor="stream" , encoding="utf-8")
except('Maximum loan to value, Initial rate, Differential to BBR, Product fee, Completion deadline, Minimum loan size, Maximum loan size, Early repayment
charge plus Benefit package, Payable if you repay on or before, product code')
#number of tables extracted
print("Total tables extracted:", tables.n)
# print the first table as Pandas DataFrame
#print(tables[1].df)
# export individually as Excel (.xlsx extension)
#tables[1].to_excel("/mnt/projetcs/pdf_excel/agg.xlsx")
columns = ['Additional Info']
for i in range (1, tables.n):
#print(tables[i].df)
#tables[i].df.to_excel(writer, sheet_name='Sheet'+str(i), index=False)
temp_df = tables[i].df
#temp_df.rename(columns=temp_df.iloc[0]).drop(temp_df.index[0])
#temp_df.drop(columns, inplace=True, axis=1,errors='ignore')
# iterating the columns
for col in temp_df.columns:
#print(temp_df.iloc[:, [col]])
if(col>7):
print(col)
#print(temp_df.drop(temp_df.iloc[:, [col]], axis=1,errors='ignore'))
#print(temp_df)
col_length=len(temp_df. columns);
print("count"+str(col_length))
if(col_length > 6):
print("save")
temp_df.to_excel(writer, sheet_name='Sheet', index=False)
writer.save()
I have no Santander Bank Statement to test, but I am almost sure you will manage to adjust it to your needs:
import camelot
import pandas as pd
# PDF file to extract tables from
file = r"C:/Users/mahma_dv2pq9y/Downloads/santander.pdf"
# extract all the tables in the PDF file
tables = camelot.read_pdf(file, pages='all', flavor="stream", encoding="utf-8")
master_DF = pd.DataFrame() # Creates empty dataframe to update it with data later
for i in range(tables.n):
if i == 0: # if table is from the firs Statement page to below:
new_header = tables[i].df.iloc[4] # choose row with headers from tables[0].df (header's names to list)
tables[i].df = tables[i].df[5:] # read only data after the header row (top rows usually have lots bank account related info which is not needed)
tables[i].df.columns = new_header # rename empty headers using list
master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True) # append empty DataFrame with data
else:
tables[i].df = tables[i].df[1:] # reading DataFrame without top row (usually bank Statement has Info/Logo row)
tables[i].df.columns = new_header # rename empty headers using list
master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True) # appending data
print(tables[i].df)
print("")
float_cols = ['Debit', 'Credit', 'Balance'] # list of columns with numeric data (float)
for col in float_cols:
master_DF[col]=master_DF[col].str.replace(",","") # striping out commas before datatype conversion
master_DF[float_cols] = master_DF[float_cols].apply(pd.to_numeric) # datatype conversion: strings converted to numeric
master_DF.to_excel(r"C:/Users/mahma_dv2pq9y/OneDrive/NEW_santander_agg_mortgage.xlsx", index=False, sheet_name="BS_Statement")
print("DATA Saved")

How to merge multiple .xls files with hyperlinks in python?

I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.
Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).
In order to improve reproducibility, I have added .xls1 and .xls2 samples below.
xls1:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
.xls2:
Title
Publication_Number
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
Desired outcome:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:
1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.
2.) Reading .xls files with openpyxl.load
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) Converting .xls files to .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.
Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.
You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly.
Assuming all input files have same format, you can use below code.
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
References:
reading hyperlinks - https://stackoverflow.com/a/7057076/17256762
pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html
Without a clear reproducible example, the problem is not clear. Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.
Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks. Here is some demo code and the resulting file:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')
Inspired by #Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")
I assume the same as daedalus in terms of the excel files. Instead of pandas I use openpyxl to read and create a new excel file.
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

How to obtain the mean of selected columns from multiple sheets within same Excel File

I am working with a large excel file having 22 sheets, where each sheet has the same coulmn headings but do not have equal number of rows. I would like to obtain the mean values (excluding zeros) for columns AA to AX for all the 22 sheets. The columns have titles which I use in my code.
Rather than reading each sheet, I want to loop through the sheets and get as output the mean values.
With help from answers to other posts, I have this:
import pandas as pd
xls = pd.ExcelFile('myexcelfile.xlsx')
xls.sheet_names
#print(xls.sheet_names)
out_df = pd.DataFrame()
for sheets in xls.sheet_names:
df = pd.read_excel('myexcelfile.xlsx', sheet_names= None)
df1= df[df[:]!=0]
df2=df1.loc[:,'aa':'ax'].mean()
out_df.append(df2) ## This will append rows of one dataframe to another(just like your expected output)
print(out_df2)
## out_df will have data from all the sheets
The code works so far, but only one of the sheets. How do I get it to work for all 22 sheets?
You can use numpy to perform basic math on pandas.DataFrame or pandas.Series
take a look at my code below
import pandas as pd, numpy as np
XL_PATH = r'C:\Users\YourName\PythonProject\Book1.xlsx'
xlFile = pd.ExcelFile(XL_PATH)
xlSheetNames = xlFile.sheet_names
dfList = [] # variable to store all DataFrame
for shName in xlSheetNames:
df = pd.read_excel(XL_PATH, sheet_name=shName) # read sheet X as DataFrame
dfList.append(df) # put DataFrame into a list
for df in dfList:
print(df)
dfAverage = np.average(df) # use numpy to get DataFrame average
print(dfAverage)
#Try code below
import pandas as pd, numpy as np, os
XL_PATH = "YOUR EXCEL FULL PATH"
SH_NAMES = "WILL CONTAINS LIST OF EXCEL SHEET NAME"
DF_DICT = {} """WILL CONTAINS DICTIONARY OF DATAFRAME"""
def readExcel():
if not os.path.isfile(XL_PATH): return FileNotFoundError
SH_NAMES = pd.ExcelFile(XL_PATH).sheet_names
# pandas.read_excel() have argument 'sheet_name'
# when you put a list to 'sheet_name' argument
# pandas will return dictionary of dataframe with sheet_name as keys
DF_DICT = pd.read_excel(XL_PATH, sheet_name=SH_NAMES)
return SH_NAMES, DF_DICT
#Now you have DF_DICT that contains all DataFrame for each sheet in excel
#Next step is to append all rows data from Sheet1 to SheetX
#This will only works if you have same column for all DataFrame
def appendAllSheets():
dfAp = pd.DataFrame()
for dict in DF_DICT:
df = DF_DICT[dict]
dfAp = pd.DataFrame.append(self=dfAp, other=df)
return dfAp
#you can now call the function as below:
dfWithAllData = appendAllSheets()
#now you have one DataFrame with all rows combine from Sheet1 to SheetX
#you can fixed the data, for example to drop all rows which contain '0'
dfZero_Removed = dfWithAllData[[dfWithAllData['Column_Name'] != 0]]
dfNA_removed = dfWithAllData[not[pd.isna(dfWithAllData['Column_Name'])]]
#last step, to find average or other math operation
#just let numpy do the job
average_of_all_1 = np.average(dfZero_Removed)
average_of_all_2 = np.average(dfNA_Removed)
#show result
#this will show the average of all
#rows of data from Sheet1 to SheetX from your target Excel File
print(average_of_all_1, average_of_all_2)

Can't save excel file using openpyxl

I'm having an issue with saving an Excel file in openpyxl.
I'm trying to create a processing script which would grab data from one excel file, dump it into a dump excel file, and after some tweaking around with formulas in excel, I will have all of the processed data in the dump excel file. My current code is as so.
from openpyxl import load_workbook
import os
import datetime
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = dumplocation
desktop = desktoplocation
date = datetime.datetime.now().strftime("%Y-%m-%d")
excel = load_workbook(dump+date+ ".xlsx", use_iterators = True)
sheet = excel.get_sheet_by_name("Sheet1")
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
for row in sheet.iter_rows():
for cell in row:
for rows in ws.iter_rows():
for cells in row:
n+=1
if (n>=17) and (n<=32):
cell.internal_value = cells.internal_value
#adding column between time in UTC and the data
column_index = 1
new_cells = {}
sheet.column_dimensions = {}
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
# shifting columns
if column >= column_index:
column += 1
column_letter = get_column_letter(column)
coordinate = '%s%s' % (column_letter, row)
# it's important to create new Cell object
new_cells[coordinate] = Cell(sheet, column_letter, row, cell.value)
sheet.cells = new_cells
#setting columns to be hidden
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
if (column<=3) and (column>=18):
column.set_column(column, options={'hidden': True})
A lot of my code is messy I know since I just started Python two or three weeks ago. I also have a few outstanding issues which I can deal with later on.
It doesn't seem like a lot of people are using openpyxl for my purposes.
I tried using the normal Workbook module but that didn't seem to work because you can't iterate in the cell items. (which is required for me to copy and paste relevant data from one excel file to another)
UPDATE: I realised that openpyxl can only create workbooks but can't edit current ones. So I have decided to change tunes and edit the new workbook after I have transferred data into there. I have resulted to using back to Workbook to transfer data:
from openpyxl import Workbook
from openpyxl import worksheet
from openpyxl import load_workbook
import os
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = "c:/users/y.lai/desktop/data/201501.xlsx"
desktop = "c:/users/y.lai/desktop/"
excel = Workbook()
sheet = excel.add_sheet
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
q=0
for x in range(6,int(query)):
for s in range(65,90):
for cell in Cell(sheet,chr(s),x):
for rows in ws.iter_rows():
for cells in rows:
q+=1
if q>=5:
n+=1
if (n>=17) and (n<=32):
cell.value = cells.internal_value
But this doesn't seem to work still
Traceback (most recent call last):
File "xxx\Desktop\xlspostprocessing.py", line 40, in <module>
for cell in Cell(sheet,chr(s),x):
File "xxx\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl\cell.py", line 181, in __init__
self._shared_date = SharedDate(base_date=worksheet.parent.excel_base_date)
AttributeError: 'function' object has no attribute 'parent'
Went through the API but..I'm overwhelmed by the coding in there so I couldn't make much sense of the API. To me it looks like I have used the Cell module wrongly. I read the definition of the Cell and its attributes, thus having the chr(s) to give the 26 alphabets A-Z.
You can iterate using the standard Workbook mode. use_iterators=True has been renamed read_only=True to emphasise what this mode is used for (on demand reading of parts).
Your code as it stands cannot work with this method as the workbook is read-only and cell.internal_value is always a read only property.
However, it looks like you're not getting that far because there is a problem with your Excel files. You might want to submit a bug with one of the files. Also the mailing list might be a better place for discussion.
You could try using xlrd and xlwt instead of pyopenxl but you might find exactly what you are looking to do already available in xlutil - all are from python-excel.

How to sort Excel sheet using Python

I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.
I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.

Categories

Resources