Pandas won't read Excel XML 2003 file with `xls` extension

Pandas won't read Excel XML 2003 file with `xls` extension - python

Can anyone please tell me why pandas won't read Excel XML 2003 file with xls extension? When I try to read it from my Python script, it throws an error:
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html x'
I know the obvious reason: it is actually a XML file with fake xls extension. But I can still open it with Excel: a normal spread sheet. I think that means there is still a way to read it from pandas?
If no luck, can I convert this Excel XML 2003 with xls extension into a "real" xls file without the XML tags by using Python script? If so I can just add this section of code in front of the PANDAS code to read the converted xls file.

It should work if you make sure openpyxl is installed and explicitly tell Pandas to use that engine:
df = pd.read_excel("foo.xls", engine="openpyxl")
# ^^^^^^^^^^^^^^^^^
Pandas can use one of four underlying engines when ingesting Excel files. The one it uses for xls files doesn't support the newer formats:
If io is not a buffer or path, this must be set to identify io. Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb". Engine compatibility:
"xlrd" supports old-style Excel files (.xls).
"openpyxl" supports newer Excel file formats.
"odf" supports OpenDocument file formats (.odf, .ods, .odt).
"pyxlsb" supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if path_or_buffer is in xlsb format, pyxlsb will be used.
New in version 1.3.0.
Otherwise openpyxl will be used.
Changed in version 1.3.0.

Related

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!

try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

Can pandas.open_csv() open more file types than just a .csv file?

I was wondering if I could open a different file extension than .csv even if it is formatted the same way?
For example I have a .key file that is being used as a cipher and it is using the .csv format, but I do not know if i can still open it.

The answer is yes - the read_csv() method can defiantly read other kind of files except csv. for example you can read txt file with pd.read_csv(). but pandas has a build in methods to read XLSX, html and other format... so is better to use a built-in method for the specific type you need.
Source - pandas.read_csv documentation:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

python pandas read_excel engine=openpyxl not closing file

I am loading a dataframe into pandas using following:
import pandas as pd
df_factor_histories=pd.read_excel("./eco_factor/eco_factor_test_data_builder.xlsx",
engine='openpyxl', sheet_name=0)
engine=openpyxl is required to enable read_excel to support newer Excel file formats (specifically in my case .xlsx rather than jusy .xls).
The dataframe loads just fine but the file is left open:
import psutil
p = psutil.Process()
print(p.open_files())
OUTPUT
[popenfile(path='C:\\Users\\xx\\.ipython\\profile_default\\history.sqlite', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\kernel32.dll.mui', fd=-1),
popenfile(path='D:\\xxxxx\\data modelling\\eco_factor\\eco_factor_test_data_builder.xlsx', fd=-1)]
This Github Post suggests the bug is fixed - but not for me (running Anaconda/Jupyter).
Relevant versions I am running:
numpy 1.19.2
openpyxl 3.0.5
pandas 1.1.3
Python 3.7.4
I would appreciate some suggestions on how to close the files/best work around this, thanks

I suggest to remove engine='openpyxl' from your code. It isn't actually needed. I use the pd.read_excel without it and it works just fine even for .xlsx format.
Removing this will cause the default behavior for the engine parameter to take over. The engine will know which engine to use:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine : str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.

I was facing the same issue,
When i read the excel file using pandas with engine=openpyxl. It is not being closed. when i try to archive/ move the excel file using python, it was giving error,
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
Also, once it is read by pandas, we are not able to edit or any other operation on excel file using Excel tool.
Following solution worked for me.
i am using:
python version 3.6.8.
pandas==0.25.1
openpyxl==3.0.7
import io
import pandas as pd
with open('path/to/input_excel_file.xlsx', "rb") as f:
file_io_obj = io.BytesIO(f.read())
df_input_file = pd.read_excel(file_io_obj, engine='openpyxl', sheet_name=None)

openpyxl cannot read Strict Open XML Spreadsheet format: UserWarning: File contains an invalid specification for Sheet1. This will be removed

A few of my users (all of whom use Mac) have uploaded an Excel into my application, which then rejected it because the file appeared to be empty. After some debugging, I've determined that the file was saved in Strict Open XML Spreedsheet format, and that openpyxl (2.6.0) doesn't issue an error, but rather prints a warning to stderr.
To reproduce, open a file, add a few rows and save as Strict Open XML Spreedsheet (*.xlsx) format.
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
This will print the following warning, but will not throw any exception:
UserWarning: File contains an invalid specification for Sheet1. This will be removed
Furthermore, the workbook appears to have no sheets:
assert workbook.get_sheet_names() == []
I've now had three Mac users experience this issue. It seems like Mac will sometimes default to using this Strict Open XML Spreedsheet format. If this is a normal case, then openpyxl should be able to handle it. Otherwise, it would be great if openpyxl would just throw an exception. As a workaround, it seems I can do the following:
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
if not workbook.get_sheet_names():
raise Exception("The Excel was saved in an incorrect format")

I had similar problems with XLSX files created using the R library openxlsx. A sample error message from a simple python program to open the file and retrieve a single value from sheet Crops:
Warning (from warnings module):
File "C:\Python38\lib\site-packages\openpyxl\reader\workbook.py", line 88
warn(msg)
UserWarning: File contains an invalid specification for Crops. This will be removed
My first, very clumsy solution:
Open with Excel
Save the file as *.xls, which triggered a warning about compatibility.
Re-save as *.xlsx
My second solution works if you only need to read the file:
Impose a read-only restriction:
wb = load_workbook(filename = 'CAF_LTAR_crops_out_0.3.xlsx', read_only=True)
The broad lesson seems to be that the XLSX file specification is not uniformly (correctly?) implemented across programming languages.

I am working with a Windows PC and I had the same Problem with openpyxl. I got an excel template that was saved as Strict Open XML Spreadsheet (*.xlsx). I tried to fill out the template but I got always a fault message for each work sheet as below and when I tried to print the array with all worksheet names was empty [].
UserWarning: File contains an invalid specification for Sheetname. This will be removed
Solution
I saved the file as Excel Workbook (*.xlsx) and not as Strict Open XML Spreadsheet (*.xlsx). After that I had no fault message, the array included all Worksheets and I could fill out the template with openpyxl.

Reading an .xls file in python (using pandas read_excel)

So I have an .xls file which I am able to open with Excel and also with Notepad (can see the numbers along with some other text) but I cannot read the file using pandas module.
df = pd.read_excel(r'"R:\Project\Projects\429 - Buchner Höhe\Analysis Data\scada\20171101.xls"',parse_dates=[[0,1,2,3]])
The error which pops up is as follows:
XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found b'\x03\x11\x0b\x02 \x01\x00\x00'
I tried renaming the file to .xlsx using os.rename, it still does not work.

It is quite likely the file was already a csv file--not an xls or xlsx, renamed through the file system, rather than an actual Excel format file. This is the error generated when you attempt to open a csv with xlrd.
The indicator that this is the case is you can open it with Notepad.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.