Python splitting an Excel workbook

Python splitting an Excel workbook - python

I am finding a way to split an Excel workbook, contains multiple tabs/sheets, into multiple workbooks, according to the numbers of tabs/sheets the original workbook has:
Worked out:
from xlrd import open_workbook
from xlwt import Workbook
rb = open_workbook('c:\\original file.xls',formatting_info=True)
for a in range(5): #for example there're only 5 tabs/sheets
rs = rb.sheet_by_index(a)
new_book = Workbook()
new_sheet = new_book.add_sheet('Sheet 1')
for row in range(rs.nrows):
for col in range(rs.ncols):
new_sheet.write(row, col, rs.cell(row, col).value)
new_book.save("c:\\" + str(a) + ".xls")
This is actually nothing but reading the sheets one by one, and save them one by one. Is there a better, or more direct way?

Related

How to merge multiple .xls files with hyperlinks in python?

I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.
Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).
In order to improve reproducibility, I have added .xls1 and .xls2 samples below.
xls1:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
.xls2:
Title
Publication_Number
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
Desired outcome:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:
1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.
2.) Reading .xls files with openpyxl.load
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) Converting .xls files to .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.
Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.

You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly.
Assuming all input files have same format, you can use below code.
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
References:
reading hyperlinks - https://stackoverflow.com/a/7057076/17256762
pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html

Without a clear reproducible example, the problem is not clear. Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.
Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks. Here is some demo code and the resulting file:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')

Inspired by #Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")

I assume the same as daedalus in terms of the excel files. Instead of pandas I use openpyxl to read and create a new excel file.
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

splitting an Excel workbook into multiple excel files

i have an excel workbook with 29 different sheets. i used the following code to save each sheet as an individual excel file:
from xlrd import open_workbook
from xlwt import Workbook
rb = open_workbook('c:\\original file.xls',formatting_info=True)
for a in range(5): #for example there're only 5 tabs/sheets
rs = rb.sheet_by_index(a)
new_book = Workbook()
new_sheet = new_book.add_sheet('Sheet 1')
for row in range(rs.nrows):
for col in range(rs.ncols):
new_sheet.write(row, col, rs.cell(row, col).value)
new_book.save("c:\\" + str(a) + ".xls")
i got this code from: stackoverflow.com/questions/28873252/python-splitting-an-excel-workbook. it worked well but is there a way i could save the workbooks by sheet name. so the sheet name should be what the file is called. i tried replacing
new_book.save("c:\\" + str(a) + ".xls")
with
new_book.save(sheet.names + str(a) + ".xls")
But it didnt work

If I understand your requirement correctly.
You can use pandas with pd.ExcelFile and read the whole workbook as a dictionary.
import pandas as pd
xl = pd.ExcelFile('c:\\original file.xls')
for sheet in xl.sheet_names:
df = pd.read_excel(xl,sheet_name=sheet)
df.to_excel(f"{sheet}.xls",index=False)

How to create multiple excel sheets through Python?

I have an optimization problem that runs in a for loop. I want the results of each new iteration to be saved in a different tab in the same workbook.
This is what I'm doing. Instead of giving me multiple tabs in the same workbook, I'm getting multiple workbooks.
from openpyxl import Workbook
wb1 = Workbook()
for i in range(n):
ws = wb1.active()
ws.title = str(i)
#code on formatting sheet, optimization problem
wb1.save('outfile'+str(i)+'.xlsx')

Every iteration you are grabbing the same worksheet - ws = wb1.active() - and then simply saving your results to a different workbook.
You simply need to create a new sheet on each iteration. Something like this:
from openpyxl import Workbook
wb1 = Workbook()
for i in range(n):
ws = wb1.create_sheet("run " + str(i))
#code on formatting sheet, optimization problem
wb1.save('outfile.xlsx')
Notice that the save is indented out to simply save the file once all worksheets have been formatted. It is not necessary to save on each iteration. The saving operation can take time, especially when adding more tabs.

This code will create Excel Workbook containing worksheets same as the number of strings in a text file taken as the input. Here i have a text file named 'sample.txt' having 3strings. This code will so create 3 worksheets in a workbook named 'reformatted.data.xls'.
Also i have removed the default worksheets that get created automatically when the workbook object is created.
import xlwt
from openpyxl import Workbook
wb1 = Workbook()
row = 0
f = open('C:\Desktop\Mytestcases\sample.txt')
lines = f.readlines()
for i in range(len(lines)):
ws = wb1.create_sheet("worksheet" + str(i))
ws.cell(row=1, column=1).value = lines[i]
row += 1
sheet = wb1.get_sheet_by_name('Sheet')
wb1.remove_sheet(sheet)
wb1.save('reformatted.data.xls')

Insert a title at the beginning of the Excel worksheet

row = 5
column = 0
writer = pd.ExcelWriter(file_name, engine='openpyxl')
response = send_request('2017-2018-regular', item).content
df = pd.read_csv(io.StringIO(response.decode('utf-8')))
df.to_excel(writer, sheets, startrow=row, startcol=column, index=False)
I would like to put a simple title at the top of my Excel sheet in considering I am working with pandas and openpyxl. How could I do such thing? I want that title could be displayed on the top of the sheet (startrow=0, startcol=0). Please show me an example how to use it.
I know the question Write dataframe to excel with a title is related, but I can't use it for the simple reason that the engine is different. I use openpyxl lib and they used xlsxwriter lib in their answer. What is the equivalent for write_string, but with pandas?

well in openpyxl first row/column start with 1 instead of 0 so row=1,column=1 will be first (0,0) top-left cell where you need to start writing
check following example.
from openpyxl import Workbook
wb = Workbook()
dest_filename = 'empty_book.xlsx'
ws1 = wb.active #first default sheet if you want to create new one use wb.create_sheet(title="xyz")
ws1.title = "Title set example"
for col in range(1, 10):
ws1.cell(column=col, row=1, value="Title_{0}".format(col))
wb.save(filename = dest_filename)

Can't save excel file using openpyxl

I'm having an issue with saving an Excel file in openpyxl.
I'm trying to create a processing script which would grab data from one excel file, dump it into a dump excel file, and after some tweaking around with formulas in excel, I will have all of the processed data in the dump excel file. My current code is as so.
from openpyxl import load_workbook
import os
import datetime
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = dumplocation
desktop = desktoplocation
date = datetime.datetime.now().strftime("%Y-%m-%d")
excel = load_workbook(dump+date+ ".xlsx", use_iterators = True)
sheet = excel.get_sheet_by_name("Sheet1")
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
for row in sheet.iter_rows():
for cell in row:
for rows in ws.iter_rows():
for cells in row:
n+=1
if (n>=17) and (n<=32):
cell.internal_value = cells.internal_value
#adding column between time in UTC and the data
column_index = 1
new_cells = {}
sheet.column_dimensions = {}
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
# shifting columns
if column >= column_index:
column += 1
column_letter = get_column_letter(column)
coordinate = '%s%s' % (column_letter, row)
# it's important to create new Cell object
new_cells[coordinate] = Cell(sheet, column_letter, row, cell.value)
sheet.cells = new_cells
#setting columns to be hidden
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
if (column<=3) and (column>=18):
column.set_column(column, options={'hidden': True})
A lot of my code is messy I know since I just started Python two or three weeks ago. I also have a few outstanding issues which I can deal with later on.
It doesn't seem like a lot of people are using openpyxl for my purposes.
I tried using the normal Workbook module but that didn't seem to work because you can't iterate in the cell items. (which is required for me to copy and paste relevant data from one excel file to another)
UPDATE: I realised that openpyxl can only create workbooks but can't edit current ones. So I have decided to change tunes and edit the new workbook after I have transferred data into there. I have resulted to using back to Workbook to transfer data:
from openpyxl import Workbook
from openpyxl import worksheet
from openpyxl import load_workbook
import os
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = "c:/users/y.lai/desktop/data/201501.xlsx"
desktop = "c:/users/y.lai/desktop/"
excel = Workbook()
sheet = excel.add_sheet
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
q=0
for x in range(6,int(query)):
for s in range(65,90):
for cell in Cell(sheet,chr(s),x):
for rows in ws.iter_rows():
for cells in rows:
q+=1
if q>=5:
n+=1
if (n>=17) and (n<=32):
cell.value = cells.internal_value
But this doesn't seem to work still
Traceback (most recent call last):
File "xxx\Desktop\xlspostprocessing.py", line 40, in <module>
for cell in Cell(sheet,chr(s),x):
File "xxx\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl\cell.py", line 181, in __init__
self._shared_date = SharedDate(base_date=worksheet.parent.excel_base_date)
AttributeError: 'function' object has no attribute 'parent'
Went through the API but..I'm overwhelmed by the coding in there so I couldn't make much sense of the API. To me it looks like I have used the Cell module wrongly. I read the definition of the Cell and its attributes, thus having the chr(s) to give the 26 alphabets A-Z.

You can iterate using the standard Workbook mode. use_iterators=True has been renamed read_only=True to emphasise what this mode is used for (on demand reading of parts).
Your code as it stands cannot work with this method as the workbook is read-only and cell.internal_value is always a read only property.
However, it looks like you're not getting that far because there is a problem with your Excel files. You might want to submit a bug with one of the files. Also the mailing list might be a better place for discussion.

You could try using xlrd and xlwt instead of pyopenxl but you might find exactly what you are looking to do already available in xlutil - all are from python-excel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python splitting an Excel workbook - python

Related

How to merge multiple .xls files with hyperlinks in python?

splitting an Excel workbook into multiple excel files

How to create multiple excel sheets through Python?

Insert a title at the beginning of the Excel worksheet

Can't save excel file using openpyxl

Categories

Resources