I get a huge Excel-Sheet (normal table with header and data) on a regular basis and I need to filter and delete some data and split the table up into seperate sheets based on some rules. I think I can save me some time if I use Python for that tedious task because the filtering, deleting and splitting up into several sheets is based on always the same rules that can logically be defined.
Unfortunately the sheet and the data is partially color-coded (cells and font) and I need to maintain this formating for the resulting sheets. Is there a way of doing that with python? I think I need a pointer in the right direction. I only found workarounds with pandas but that does not allow me to keep the formatting.
You can take a look at an excellent Python library for Excel called openpyxl.
Here's how you can use it.
First, install it through your command prompt using:
pip install openpyxl
Open an existing file:
import openpyxl
wb_obj = openpyxl.load_workbook(path) # Open notebook
Deleting rows:
import openpyxl
from openpyxl import load_workbook
wb = load_wordbook(path)
ws = wb.active
ws.delete_rows(7)
Inserting rows:
import openpyxl
from openpyxl import load_workbook
wb = load_wordbook(path)
ws = wb.active
ws.insert_rows(7)
Here are some tutorials that you can take a look at:
Tutorial 1
Youtube Video
I have a excel file which has specific format, font and color etc. I would like to read it using Python libraries like pandas (pd.read_excel) and just modify a few cells without affecting the style. Is it possible? Currently, when I read and write using pandas, the style changes and it seems difficult to make the complex style in Python again.
Is there a way to load and store the format/style of Excel file when we are reading it, to be applied when it is being saved? I just want to modify the value of few cells.
You can use the openpyxl library like this:
from openpyxl import Workbook, load_workbook
workbook = load_workbook("test.xlsx") # Your Excel file
worksheet = workbook.active # gets first sheet
for row in range(1, 10):
# Writes a new value PRESERVING cell styles.
worksheet.cell(row=row, column=1, value=f'NEW VALUE {row}')
workbook.save("test.xlsx")
You can use the set_properties() function.
Further use can be viewed at How to change the font-size of text in dataframe using pandas
Apologies for no coding provided, this is really a generic question.
I'm using Python xlwings library, and trying to copy a sheet from one workbook to another new workbook, then hard-code the sheet in the newly created workbook. Effectively same as "Copy / Paste Values and source formatting".
I wasn't able to find any documentation on this, and thank you in advance for your help!
edit: someone mentioned that I should include an example. Here it is but it's kind hard to show the format in an Excel file. the following code will copy/paste "sht" into a new workbook but the "new_sht" will contain formulas. I'm trying to hard-code all the values while preserving the number format (eg. with thousands separator, percentage sign, etc)
import xlwings as xw
wb = xw.Book('example1.xlsx')
sht = wb.sheets['sheet1']
new_wb = xw.Book()
new_sht = new_wb.sheets[0]
sht.api.Copy(Before = new_sht.api)
Answering my own question as I just figured out what I wanted to accomplish.
The following code will hardcode the values while preserve the formatting, since it's essentially pasting value-only to an already formatted area.
new_sht.range('A1:C10').value = new_sht.range('A1:C10').value
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What is the best way to read Excel (XLS) files with Python (not CSV files).
Is there a built-in package which is supported by default in Python to do this task?
I highly recommend xlrd for reading .xls files. But there are some limitations(refer to xlrd github page):
Warning
This library will no longer read anything other than .xls files. For
alternatives that read newer file formats, please see
http://www.python-excel.org/.
The following are also not supported but will safely and reliably be
ignored:
- Charts, Macros, Pictures, any other embedded object, including embedded worksheets.
- VBA modules
- Formulas, but results of formula calculations are extracted.
- Comments
- Hyperlinks
- Autofilters, advanced filters, pivot tables, conditional formatting, data validation
Password-protected files are not supported and cannot be read by this
library.
voyager mentioned the use of COM automation. Having done this myself a few years ago, be warned that doing this is a real PITA. The number of caveats is huge and the documentation is lacking and annoying. I ran into many weird bugs and gotchas, some of which took many hours to figure out.
UPDATE:
For newer .xlsx files, the recommended library for reading and writing appears to be openpyxl (thanks, Ikar Pohorský).
You can use pandas to do this, first install the required libraries:
$ pip install pandas openpyxl
See code below:
import pandas as pd
xls = pd.ExcelFile(r"yourfilename.xls") # use r before absolute file path
sheetX = xls.parse(2) #2 is the sheet number+1 thus if the file has only 1 sheet write 0 in paranthesis
var1 = sheetX['ColumnName']
print(var1[1]) #1 is the row number...
You can choose any one of them http://www.python-excel.org/
I would recommended python xlrd library.
install it using
pip install xlrd
import using
import xlrd
to open a workbook
workbook = xlrd.open_workbook('your_file_name.xlsx')
open sheet by name
worksheet = workbook.sheet_by_name('Name of the Sheet')
open sheet by index
worksheet = workbook.sheet_by_index(0)
read cell value
worksheet.cell(0, 0).value
I think Pandas is the best way to go. There is already one answer here with Pandas using ExcelFile function, but it did not work properly for me. From here I found the read_excel function which works just fine:
import pandas as pd
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name")
print(dfs.head(10))
P.S. You need to have the xlrd installed for read_excel function to work
Update 21-03-2020: As you may see here, there are issues with the xlrd engine and it is going to be deprecated. The openpyxl is the best replacement. So as described here, the canonical syntax should be:
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name", engine="openpyxl")
For xlsx I like the solution posted earlier as https://web.archive.org/web/20180216070531/https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python. I uses modules from the standard library only.
def xlsx(fname):
import zipfile
from xml.etree.ElementTree import iterparse
z = zipfile.ZipFile(fname)
strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
rows = []
row = {}
value = ''
for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
if el.tag.endswith('}v'): # Example: <v>84</v>
value = el.text
if el.tag.endswith('}c'): # Example: <c r="A3" t="s"><v>84</v></c>
if el.attrib.get('t') == 's':
value = strings[int(value)]
letter = el.attrib['r'] # Example: AZ22
while letter[-1].isdigit():
letter = letter[:-1]
row[letter] = value
value = ''
if el.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
Improvements added are fetching content by sheet name, using re to get the column and checking if sharedstrings are used.
def xlsx(fname,sheet):
import zipfile
from xml.etree.ElementTree import iterparse
import re
z = zipfile.ZipFile(fname)
if 'xl/sharedStrings.xml' in z.namelist():
# Get shared strings
strings = [element.text for event, element
in iterparse(z.open('xl/sharedStrings.xml'))
if element.tag.endswith('}t')]
sheetdict = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
if element.tag.endswith('}sheet') }
rows = []
row = {}
value = ''
if sheet in sheets:
sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
#print(sheet,sheetfile)
for event, element in iterparse(z.open(sheetfile)):
# get value or index to shared strings
if element.tag.endswith('}v') or element.tag.endswith('}t'):
value = element.text
# If value is a shared string, use value as an index
if element.tag.endswith('}c'):
if element.attrib.get('t') == 's':
value = strings[int(value)]
# split the row/col information so that the row leter(s) can be separate
letter = re.sub('\d','',element.attrib['r'])
row[letter] = value
value = ''
if element.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
If you need old XLS format. Below code for ansii 'cp1251'.
import xlrd
file=u'C:/Landau/task/6200.xlsx'
try:
book = xlrd.open_workbook(file,encoding_override="cp1251")
except:
book = xlrd.open_workbook(file)
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
print(sh.row(rx))
For older .xls files, you can use xlrd
either you can use xlrd directly by importing it. Like below
import xlrd
wb = xlrd.open_workbook(file_name)
Or you can also use pandas pd.read_excel() method, but do not forget to specify the engine, though the default is xlrd, it has to be specified.
pd.read_excel(file_name, engine = xlrd)
Both of them work for older .xls file formats.
Infact I came across this when I used OpenPyXL, i got the below error
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
You can use any of the libraries listed here (like Pyxlreader that is based on JExcelApi, or xlwt), plus COM automation to use Excel itself for the reading of the files, but for that you are introducing Office as a dependency of your software, which might not be always an option.
You might also consider running the (non-python) program xls2csv. Feed it an xls file, and you should get back a csv.
Python Excelerator handles this task as well. http://ghantoos.org/2007/10/25/python-pyexcelerator-small-howto/
It's also available in Debian and Ubuntu:
sudo apt-get install python-excelerator
with open(csv_filename) as file:
data = file.read()
with open(xl_file_name, 'w') as file:
file.write(data)
You can turn CSV to excel like above with inbuilt packages. CSV can be handled with an inbuilt package of dictreader and dictwriter which will work the same way as python dictionary works. which makes it a ton easy
I am currently unaware of any inbuilt packages for excel but I had come across openpyxl. It was also pretty straight forward and simple You can see the code snippet below hope this helps
import openpyxl
book = openpyxl.load_workbook(filename)
sheet = book.active
result =sheet['AP2']
print(result.value)
For older Excel files there is the OleFileIO_PL module that can read the OLE structured storage format used.
If the file is really an old .xls, this works for me on python3 just using base open() and pandas:
df = pandas.read_csv(open(f, encoding = 'UTF-8'), sep='\t')
Note that the file I'm using is tab delimited. less or a text editor should be able to read .xls so that you can sniff out the delimiter.
I did not have a lot of luck with xlrd because of – I think – UTF-8 issues.
I'm trying to write some dates from one excel spreadsheet to another. Currently, I'm getting a representation in excel that isn't quite what I want such as this: "40299.2501157407"
I can get the date to print out fine to the console, however it doesn't seem to work right writing to the excel spreadsheet -- the data must be a date type in excel, I can't have a text version of it.
Here's the line that reads the date in:
date_ccr = xldate_as_tuple(sheet_ccr.cell(row_ccr_index, 9).value, book_ccr.datemode)
Here's the line that writes the date out:
row.set_cell_date(11, datetime(*date_ccr))
There isn't anything being done to date_ccr in between those two lines other than a few comparisons.
Any ideas?
You can write the floating point number directly to the spreadsheet and set the number format of the cell. Set the format using the num_format_str of an XFStyle object when you write the value.
https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html#xlwt.Worksheet.write-method
The following example writes the date 01-05-2010. (Also includes time of 06:00:10, but this is hidden by the format chosen in this example.)
import xlwt
# d can also be a datetime object
d = 40299.2501157407
wb = xlwt.Workbook()
sheet = wb.add_sheet('new')
style = xlwt.XFStyle()
style.num_format_str = 'DD-MM-YYYY'
sheet.write(5, 5, d, style)
wb.save('test_new.xls')
There are examples of number formats (num_formats.py) in the examples folder of the xlwt source code. On my Windows machine: C:\Python26\Lib\site-packages\xlwt\examples
You can read about how Excel stores dates (third section on this page): https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html