Failing to open an Excel file with Python

Failing to open an Excel file with Python - python

I'm on a Debian GNU/Linux computer, working with Python 2.7.9.
As a part of my job, I have been making python scripts that read inputs in various formats (e.g. Excel, Csv, Txt) and parse the information to more standarized files. It's not my first time opening or working with Excel files.
There's a particular file which is giving me problems, I just can't open it. When I tried with xlrd (version 0.9.3), it gave me the following error:
xlrd.open_workbook('sample.xls')
XLRDError: Unsupported format, or corrupt file: BOF not
workbook/worksheet: op=0x0009 vers=0x0002 strm=0x000a build=0 year=0
-> BIFF21
I tried to investigate the matter on my own, found a couple of answers in StackOverflow but I couldn't open it anyway. This particular answer I found may be the problem (the second explanation), but it doesn't include a workaround: https://stackoverflow.com/a/16518707/4345659
A tool that could conert the file to csv/txt would also solve the problem.
I already tried with:
xlrd
openpyxl
xlsx2csv (the shell tool)
A sample file is available here:
https://ufile.io/r4m6j
As a side note, I can open it with LibreOffice Calc and MS Excel, so I could eventually change it to csv that way. The thing is, I need to do it all with a python script.
Thanks in advance!

It seems like MS Problem. The xls file is very strange, maybe you should contact xlrd support.
But I have a crazy workaround for you: xls2ods. It works for me even though xls2csv doesn't (SiC!).
So, install catdoc first:
$sudo apt-get install catdoc
Then convert your xls file to ods and open ods using pyexcel_ods or whatever you prefer. To use pyexcel_ods install it first using pip install pyexcel_ods.
import subprocess
from pyexcel_ods import get_data
file_basename = 'sample'
returncode = subprocess.call(['xls2ods', '{}.xls'.format(file_basename)])
if returnecode > 0:
# consider to use subprocess.Popen if you need more control on stderr
exit(returncode)
data = get_data('{}.ods'.format(file_basename))
print(data)
I'm getting following output:
OrderedDict([(u'sample',
[[u'labo',
u'codfarm',
u'farmacia',
u'direccion',
u'localidad',
u'nom_medico',
u'matricula',
u'troquel',
u'producto',
u'cant_total']])])

Here is a kludge I would use:
Assuming you have LibreOffice on Debian, you could either convert all your *.xls files into *.csv using:
import os
os.system("libreoffice --headless --convert-to csv *.xls")
#or use os.call
... and then work consistently with csv.
Or you could convert only the corrupted file(s) when needed using a try/except block:
import os
try:
xlrd.open_workbook('sample.xls')
except XLRDError:
os.system("libreoffice --headless --convert-to csv sample.xls")
# mycsv = open("sample.csv", "r")
# for line in mycsv.readlines():
# ...
# ...
OBS: Keep LibreOffice closed while running the script.
Alternatively there are other tools out there to do the conversion. Here is one (which I have not tested): https://github.com/dilshod/xlsx2csv

If you are targeting windows, if you have Excel installed, and if you are familiar with Excel VBA, you will have a quick solution using the comtypes package:
http://pythonhosted.org/comtypes/
You will have direct access to Excel by its COM interfaces.

This code open an xls file and saves it as a cvs file, using the comtypes package:
import comtypes.client as cl
progId = "Excel.Application.15"
xl = cl.CreateObject(progId)
wb = xl.Workbooks.Open(r"C:\Users\aUser\Desktop\thermoList.xls")
wb.SaveAs(r"C:\Users\aUser\Desktop\thermoList.csv",FileFormat=6)
xl.DisplayAlerts = False
xl.Quit()
I could not test it with "sample.xls" which is corrupt.
Your could try with another file.
You might need to adjust the progId according to your version of Excel.

It's a file format issue. I'm not sure what file type is it but it's not Excel. I just open and saved the file with sample2.xls name and compare the types:
How are you creating this file?

If you need to get the words as a list of strings:
text_file = open("sample.xls", "r")
lines = text_file.read().replace(chr(200), '').replace(chr(0), '').replace(chr(1), '').replace(chr(5), '').replace(chr(2), '').replace(chr(3), '').replace(chr(4), '').replace(chr(6), '').replace(chr(7), '').replace(chr(8), '').replace(chr(9), '').replace(chr(10), '').replace(chr(12), '').replace(chr(15), '').replace(chr(16), '').replace(chr(17), '').replace(chr(18), '').replace(chr(49), '').replace('Arial', '')
for line in lines.split(chr(128)):
print(line)
the output:

The file you provided is corrupted, so there is no way for other responders to test it and recommend a good solution. And exception you posted confirming that.
As a solution you can try to debug some things, please see some steps below:
You mentioned you tried the xlrd library. Try to check if your xlrd module is upto date by executing this:
Python 2.7.9
>>> import xlrd
>>> xlrd.__VERSION
update to the latest official version if needed
Try to open any other *.xls file and see if it works with Python version you're using and current library.
Check module documentation it's pretty good, and there are some different things described how to use this module on various platforms( Win vs. Linux)http://xlrd.readthedocs.io/en/latest/dates.html
You always can rich out to the community (there is still a chance that you might be getting into some weird state or bug) the link is here https://github.com/python-excel/xlrd/issues
Hope that helps.

Unable to open your Excel either. Just as yadayada said, I think it is the problem of data source. If you really want to figure out the reason, I suggest you ask questions about the excel instead of python.

It's always work for me with any xls or xlsx files:
def csv_from_excel(filename_xls, filename_csv):
wb = xlrd.open_workbook(filename_xls, encoding_override='YOUR_ENCODING_HERE (f.e. "cp1251"')
sh = wb.sheet_by_index(0)
your_csv_file = open(filename_csv, 'wb')
wr = unicodecsv.writer(your_csv_file)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
So, i don't work directly with excel file before convert them to csv. Mb it will help you

Related

Can't keep or/and load image in Excel using openpyxl after bundling the application with pyinstaller

Using the openpyxl library, I am loading an Excel, adding values in it and I save it in another file.
It is working nicely.
Problem: when I use the library pyinstaller to bundle the application, the images aren't load and keep in the file anymore when the new file is saved.
PS :
I tried to let the images in the file, it doesn't work.
I tried to load the images with openpyxl, it doesn't work too.
import openpyxl
wb = openpyxl.load_workbook('input_file.xlsx')
ws = wb['excel_tab_name']
image = openpyxl.drawing.image.Image('my_image.jpg')
image.anchor = 'A1'
ws.add_image(image)
wb.save('output_file.xlsx')
The real problem is that it work when I use only openpyxl but when I bundle it, no image can stay or be loaded in the file.
I am ready to use another library to handle this problem if it is needed ! :)

Just try and catch the error.
If it's "you must install pillow to fetch image objects"
Use pip install image
And that's it !

I modified your code like this and it worked. Is this answering your question ?
import openpyxl
wb = openpyxl.load_workbook('input_path.xlsx')
ws = wb['excel_tab_name']
image = openpyxl.drawing.image.Image('my_image.jpg')
image.anchor = 'A1'
ws.add_image(image)
wb.save('output_path.xlsx')

Convert XLSX to XLS and Preserve hidden rows

I am using the Python solution from Here to convert an XLSX file to XLS however this ignores the rows I already have hidden. Is there a way to have this only copy the rows that are visible in my source Xlsx file?
Here is my code:
import pyexcel as p
p.save_book_as(file_name='Source.xlsx', dest_file_name='Destination.xls')

Short Answer
Please use skip_hidden_row_and_column=True as in pyexcel-xlsx test code:
p.save_book_as(file_name='Source.xlsx',
library='pyexcel-xlsx', # <--- note 1
skip_hidden_row_and_column=True, # <--- note 2
dest_file_name='Destination.xls')
To get pyexcel-xlsx, you can use pip:
pip install pyexcel-xlsx
Explanation/Long Answer
pyexcel-xls(xlrd) does not support hidden rows for xlsx format but xls. That's why note 1 ask pyexcel to use pyexcel-xlsx to read the xlsx file instead.
And this flag was noted in pyexcel-xlsx README, True means to ignore hidden rows.
Why library? It is specific for save_as, save_book_as, isave_as and isave_book_as. In these functions, a reader and a writer were involved to finish the function. library tells pyexcel to use a specific library to read a file whereas dest_library tells pyexcel to write a file.
These have been documented, for example save_as and please find library in the page.

Python vba extract to get bin of macro

I am trying to add a vba_project to "Sheet1" of a workbook using python.
I am following XLSXWRITER documentation to get the bin of the VBA code from a different sheet which I would want to use in "Sheet1" of my new workbook.
I enter the below code in command prompt but I get the error: "'vba_extract.py' is not recognized as an internal or external command"
$ vba_extract.py Book1.xlsm
Extracted: vbaProject.bin
Can someone give me a step by step on how to extract the macro from old file as bin and then input into sheet1 of new workbook using python?

You have to tell the cmd you're running a python file.
Try this batch code:
cd C:\path\of\yourfile.py
python vba_extract.py Book1.xlsm
edit:
Added cd command, you have to be in the folder of the python file.

I figured this out today and just wanted to leave it here for any future people to use. This was so unbelievably frustrating to figure out how to do. If you are using the Pandas library, this is also relevant. Make sure to install xlsxwriter also.
1.Click on your windows start button and type 'cmd' and click on it to run the Command Prompt.
2.Once you have it open, you need to locate where the vba_extract.py file is. For me it was here:
C:\Users\yourusername\AppData\Local\Programs\Python\Python36-32\Scripts\vba_extract.py
3.Now, you need to get the path of the .xlsm file you want to take from. If you don't have a .xlsm file made. Make one. Here is an example:
C:\Users\yourusername\Desktop\excelfilename.xlsm
4.Now, back to the Command Prompt. This is exactly what you will type. You will take both items from steps 2 and 3 and combine then and hit enter. Here:
C:\Users\yourusername\AppData\Local\Programs\Python\Python36-32\Scripts\vba_extract.py C:\Users\yourusername\Desktop\excelfilename.xlsm
if it is successful, it will tell you this:
Extracted: vbaProject.bin
5.For this one I'm not sure. I assume that wherever your .xlsm file is where the .bin file will end up. For this example, it ended up on my desktop. It will have all the macros you created or had on the original .xlsm file.
C:\Users\yourusername\Desktop/vbaProject.bin
Here is an example of it being used in full code:
import pandas
import xlsxwriter
df_new = pd.read_csv('C:\\Users\\yourusername\\Desktop\\CSV1.csv')
writer = pd.ExcelWriter('C:\\Users\\yourusername\\Desktop\\CSV1.xlsx')
df_new.to_excel(writer, index = False, sheet_name = 'File Name', header = False)
pandaswb = writer.book
pandaswb.filename = 'C:\\Users\\yourusername\\Desktop\\newmacroexcelfile.xlsm')
pandaswb.add_vba_project(r'C:\Users\yourusername\Desktop/vbaProject.bin')
writer.save()

Python Convert Excel tabs to CSV files

I've edited the post to reflect the changes recommended.
def Excel2CSV(ExcelFile, Sheetname, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook('C:\Users\Programming\consolidateddataviewsyellowed.xlsx')
worksheet = workbook.sheet_by_name (ARC)
csvfile = open (ARC.csv,'wb')
wr = csv.writer (csvfile, quoting = csv.QUOTE_ALL)
for rownum in xrange (worksheet.nrows):
wr.writerow(
list(x.encode('utf-8') if type (x) == type (u'') else x
for x in worksheet.row_values (rownum)))
csvfile.close()
Excel2CSV("C:\Users\username\Desktop\Programming\consolidateddataviewsyellowed.xlsx","ARC","output.csv")
It displays the following error.
Traceback (most recent call last):
File "C:/Programming/ExceltoCSV.py", line 18, in <module>
File "C:/Programming/ExceltoCSV.py", line 2, in Excel2CSV
import xlrd
ImportError: No module named xlrd
Any help would be greatly appreciated.

Response to edited code
No module named xlrd indicates that you have not installed the xlrd library. Bottom line, you need to install the xlrd module. Installing a module is an important skill which beginner python users must learn and it can be a little hairy if you aren't tech savvy. Here's where to get started.
First, check if you have pip (a module used to install other modules for python). If you installed python recently and have up-to-date software, you almost certainly already have pip. If not, see this detailed how-to answer elsewhere on stackoverflow:
How do I install pip on Windows?
Second, use pip to install the xlrd module. The internet already has a trove of tutorials on this subject, so I will not outline this here. Just Google: "how to pip install a module on your OS here"
Hope this helps!
Old Answer
your code looks good.. Here's the test case I ran using mostly what your wrote. Note that I changed your function so that it uses the arguments rather than hardcoded values. that may be where your trouble is?
def Excel2CSV(ExcelFile, Sheetname, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook (ExcelFile)
worksheet = workbook.sheet_by_name (Sheetname)
csvfile = open (CSVFile,'wb')
wr = csv.writer (csvfile, quoting = csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(
list(x.encode('utf-8') if type (x) == type (u'') else x
for x in worksheet.row_values (rownum)))
csvfile.close()
Excel2CSV("C:\path\to\XLSXfile.xlsx","Sheet_Name","C:\path\to\CSVfile.csv")
Double check that the arguments you are passing are all correct.

Python, openpyxl: I get the wrong value when running get_highest_column()

I am practicing with openpyxl and I'm working on an Excel file called 'test.xlsx'. The file only has 3 columns and 7 rows. The .xlsx file was created with LibreOffice.
When I run...
>>> #! python3
>>> import openpyxl
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> sheet = wb.get_sheet_by_name('Sheet1')
>>> sheet.get_highest_column()
1025
The returned value should be 3.
A quick Google search suggested I run:
>>> sheet.calculate_dimension()
and got the return value:
'A1:AMK7'
This should only be 'A1:C7'.
I remember reading that LibreOffice could be part of the problem to this.
However, I can't switch to MSOffice, and I hate OpenOffice.
Is there suggestion on how I could fix this, or work around it?
Thanks!

It sounds like you're using older versions of LibreOffice and openpyxl. LibreOffice did used to set a default value of "A1:AMK7" for the dimensions but it version 5 doesn't seem to be doing that any more. openpyxl used to rely on the dimensions tag when reading files but hasn't done this for a while. Please try using openpyxl 2.3-b2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Failing to open an Excel file with Python - python

If you are targeting windows, if you have Excel installed, and if you are familiar with Excel VBA, you will have a quick solution using the comtypes package: http://pythonhosted.org/comtypes/ You will have direct access to Excel by its COM interfaces.

It's a file format issue. I'm not sure what file type is it but it's not Excel. I just open and saved the file with sample2.xls name and compare the types: How are you creating this file?

Unable to open your Excel either. Just as yadayada said, I think it is the problem of data source. If you really want to figure out the reason, I suggest you ask questions about the excel instead of python.

Related

Can't keep or/and load image in Excel using openpyxl after bundling the application with pyinstaller

Convert XLSX to XLS and Preserve hidden rows

Python vba extract to get bin of macro

Python Convert Excel tabs to CSV files

Python, openpyxl: I get the wrong value when running get_highest_column()

Categories

Resources