Related
A few of my users (all of whom use Mac) have uploaded an Excel into my application, which then rejected it because the file appeared to be empty. After some debugging, I've determined that the file was saved in Strict Open XML Spreedsheet format, and that openpyxl (2.6.0) doesn't issue an error, but rather prints a warning to stderr.
To reproduce, open a file, add a few rows and save as Strict Open XML Spreedsheet (*.xlsx) format.
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
This will print the following warning, but will not throw any exception:
UserWarning: File contains an invalid specification for Sheet1. This will be removed
Furthermore, the workbook appears to have no sheets:
assert workbook.get_sheet_names() == []
I've now had three Mac users experience this issue. It seems like Mac will sometimes default to using this Strict Open XML Spreedsheet format. If this is a normal case, then openpyxl should be able to handle it. Otherwise, it would be great if openpyxl would just throw an exception. As a workaround, it seems I can do the following:
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
if not workbook.get_sheet_names():
raise Exception("The Excel was saved in an incorrect format")
I had similar problems with XLSX files created using the R library openxlsx. A sample error message from a simple python program to open the file and retrieve a single value from sheet Crops:
Warning (from warnings module):
File "C:\Python38\lib\site-packages\openpyxl\reader\workbook.py", line 88
warn(msg)
UserWarning: File contains an invalid specification for Crops. This will be removed
My first, very clumsy solution:
Open with Excel
Save the file as *.xls, which triggered a warning about compatibility.
Re-save as *.xlsx
My second solution works if you only need to read the file:
Impose a read-only restriction:
wb = load_workbook(filename = 'CAF_LTAR_crops_out_0.3.xlsx', read_only=True)
The broad lesson seems to be that the XLSX file specification is not uniformly (correctly?) implemented across programming languages.
I am working with a Windows PC and I had the same Problem with openpyxl. I got an excel template that was saved as Strict Open XML Spreadsheet (*.xlsx). I tried to fill out the template but I got always a fault message for each work sheet as below and when I tried to print the array with all worksheet names was empty [].
UserWarning: File contains an invalid specification for Sheetname. This will be removed
Solution
I saved the file as Excel Workbook (*.xlsx) and not as Strict Open XML Spreadsheet (*.xlsx). After that I had no fault message, the array included all Worksheets and I could fill out the template with openpyxl.
I am using cmis package available in python to download the document from FileNet repository. I am using getcontentstream method available in the package. However it returns content file that beings with 'Pk' and ends in 'PK'. when I googled I came to know it is excel zip package content. is there a way to save the content into an excel file. I should be able to open the downloaded excel. I am using below code. but getting byte-liked object is required not str. I noticed type of result is string.io.
# expport the result
result = testDoc.getContentStream()
outfile = open(sample.xlsx, 'wb')
outfile.write(result.read())
result.close()
outfile.close()
Hi there and welcome to stackoverflow. There are a few bits I noticed about your post.
To answer the error code you are getting directly. You called the outfile FileStream to be in terms of binary, however the result.read() must be in Unicode string format which is why you are getting this error. You can try to encode it before passing it to the outfile.write() function (ex: outfile.write(result.read().encode())).
You can also simply just write Unicode directly by:
result = testDoc.getContentStream()
result_text = result.read()
from zipfile import ZipFile
with ZipFile(filepath, 'w') as zf:
zf.writestr('filename_that_is_zipped', result_text)
Not I am not sure what you have in your ContentStream but note that a excel file is made up of xml files zipped up. The minimum file structure you need for an excel file is as follows:
_rels/.rels contains excel schemas
docProps/app.xml contains number of sheets and sheet names
docProps/core.xml boiler plate user info and date created
xl/workbook.xml contains sheet names rdId to workbook link
xl/worksheets/sheet1.xml (and more sheets in this folder) contains cell data for each sheet
xl/_rels/workbook.xml.rels contains sheet file locations within zipfile
xl/sharedStrings.xml if you have string only cell values
[Content_Types].xmlapplies schemas to file types
I recently went through piecing together an excel file from scratch, if you want to see the code check out https://github.com/PydPiper/pylightxl
My code:
import xlrd
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
sh = wb.sheet_by_index(0)
print sh.cell(0,0).value
The error:
Traceback (most recent call last):
File "Z:\Wilson\tradedStockStatus.py", line 18, in <module>
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 429, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1545, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1539, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found '<table r'"
The file doesn't seem to be corrupted or of a different format.
Anything to help find the source of the issue would be great.
Try to open it as an HTML with pandas:
import pandas as pd
data = pd.read_html('filename.xls')
Or try any other html python parser.
That's not a proper excel file, but an html readable with excel.
You say:
The file doesn't seem to be corrupted or of a different format.
However as the error message says, the first 8 bytes of the file are '<table r' ... that is definitely not Excel .xls format. Open it with a text editor (e.g. Notepad) that won't take any notice of the (incorrect) .xls extension and see for yourself.
This will happen to some files while also open in Excel.
I had a similar problem and it was related to the version. In a python terminal check:
>> import xlrd
>> xlrd.__VERSION__
If you have '0.9.0' you can open almost all files. If you have '0.6.0' which was what I found on Ubuntu, you may have problems with newest Excel files. You can download the latest version of xlrd using the Distutils standard.
I found the similar problem when downloading .xls file and opened it using xlrd library. Then I tried out the solution of converting .xls into .xlsx as detailed here: how to convert xls to xlsx
It works like a charm and rather than opening .xls, I am working with .xlsx file now using openpyxl library.
Hope it helps to solve your issue.
I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter. The reason is that actually, xlrd does not support XML Spreadsheet (*.xml) i.e. NOT in XLS or XLSX format.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()
In my case, after opening the file with a text editor as #john-machin suggested, I realized the file is not encrypted as an Excel file is supposed to but it's in the CSV format and was saved as an Excel file. What I did was renamed the file and its extension and used read_csv function instead:
os.rename('sample_file.xls', 'sample_file.csv')
csv = pd.read_csv("sample_file.csv", error_bad_lines=False)
It may be an old excel file format. It can be read as html in pandas via
import pandas as pd
df = pd.read_html('file.xls')
Eventually, this gives a list of dataframes (if you check the type is a list). https://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html
You need to extract them, for instance with df[0]
I met this problem too.I opened this file by excel and saved it as other formats such as excel 97-2003 and finally I solved this problem
I had the same issue. Those old files are formatted like a tab-delimited file. I've been able to open my problem files with read_table; ie df = pd.read_table('trouble_maker.xls').
I got this error when I tried to read some XLSX files from a folder and that one of the files was opened. I closed the XLSX file and this error did not show up.
Try this It worked for me.
import pandas as pd
data = pd.read_csv('filename.xls')
I just downloaded xlrd, created an excel document (excel 2007) for testing and got the same error (message says 'found PK\x03\x04\x14\x00\x06\x00'). Extension is a xlsx. Tried saving it to an older .xls format and error disappears .....
I meet the same problem.
it lies in the .xls file itself - it looks like an Excel file however it isn't. (see if there's a pop up when you plainly open the .xls from Excel)
sjmachin commented on Jan 19, 2013 from https://github.com/python-excel/xlrd/issues/26 helps.
Worked on the same issue , finally done this is top for the question so just putting what i did.
Observation -
1 -The file was not actually XLS i renamed to txt and noticed HTML text in file.
2 - Renamed the file to html and tried reading pd.read_html, Failed.
3- Added as it was not there in txt file, removed style to ensure that table is displaying in browser from local, and WORKED.
Below is the code may help someone..
import pandas as pd
import os
import shutil
import html5lib
import requests
from bs4 import BeautifulSoup
import re
import time
shutil.copy('your.xls','file.html')
shutil.copy('file.html','file.txt')
time.sleep(2)
txt = open('file.txt','r').read()
# Modify the text to ensure the data display in html page, delete style
txt = str(txt).replace('<style> .text { mso-number-format:\#; } </script>','')
# Add head and body if it is not there in HTML text
txt_with_head = '<html><head></head><body>'+txt+'</body></html>'
# Save the file as HTML
html_file = open('output.html','w')
html_file.write(txt_with_head)
# Use beautiful soup to read
url = r"C:\Users\hitesh kumar\PycharmProjects\OEM ML\output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
my_table = soup.find("table",attrs={'border': '1'})
frame = pd.read_html(str(my_table))[0]
print(frame.head())
frame.to_excel('testoutput.xlsx',sheet_name='sheet1', index=False)
Open in google sheets and then download from sheets as CSV and then reupload to drive. Then you can Open CSV file from python.
2 ways I know of is to just download the xls file once again and if you are doing in google colab, just load the file once again from your computer and run the pd.read_excel("filename,xlsx") once again . It should work.
As they already wrote it is actually html, to see the first table you can use
df= pd.read_html(file)
df[0]
To see how many tables there are you can use
print('Tables found:', len(df))
This work for me, using encoding="utf-8" from this post
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 100: character maps to <undefined>
def convert_to_xlsx():
with open('sample.xls', encoding="utf-8") as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False,
header=False)
writer.save()
melike's answer works for me, while the last output sentence did't work, so if anyone has the same issue with me and wants to output the xlsx file into local location, can just easily modify the last three lines.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
output_df = pd.DateFrame(sheet_as_list)
output_df.to_excel(writer, sheet_name='sheet1',index=False, header=False)
writer.close()
import os
import pandas as pd
# Rename the file if it's not already a .csv file
if not os.path.exists('3.8 locates.csv'):
os.rename('3.8 locates.xls', '3.8 locates.csv')
# Load the data into a pandas dataframe
df = pd.read_csv("3.8 locates.csv", sep='\t|\n', engine='python')
# Show the first 5 rows of the dataframe
print(df.head())
The code imports the os and pandas modules and then uses them to perform the following operations:
Check if the file '3.8 locates.csv' exists.
If it does not exist, it renames the file '3.8 locates.xls' to '3.8 locates.csv'.
Load the contents of the file '3.8 locates.csv' into a Pandas dataframe using the pd.read_csv method. The sep argument is set to '\t|\n' and the engine argument is set to 'python' to handle the file's separators correctly.
Print the first 5 rows of the dataframe using the df.head() method.
Note: The code may not work as expected if the file is not a valid tab-separated or newline-separated file.
there's nothing wrong with your file. xlrd does not yet support xlsx (excel 2007+) files although it's purported to have supported this for some time.
Simplistix github
2-days ago they committed a pre-alpha version to their git which integrates xlsx support. Other forums suggest that you use a DOM parser for xlsx files since the xlsx file type is just a zip archive containing XML. I have not tried this. there is another package with similar functionality as xlrd and this is called openpyxl which you can get from easy_install or pip. I have not tried this either, however, its API is supposed to be similar to xlrd.
I know there should be a proper way to solve it
but just to save time
I uploaded my xlsx sheet to Google Sheets and then again downloaded it from Google Sheets
it working now
If you don't have time to solve the problem, you can try this
Sometimes help to add ?raw=true at the end of a file path. For example:
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls?raw=true")
I'm new to python and having trouble dealing with excel manpulation in python.
So here's my situation: I'm using requests to get a .xls file from a web server. After that I'm using xlrd to save the content in excel file. I'm only interested in one value of that file, and there are thousands of files im retrieving from different url addresses.
I want to know how could i handle the contents i get from request in some other way rather than creating a new file.
Besides, i've included my code my comments on how could I improve it. Besides, it doesn't work, since i'm trying to save new content in an already created excel file (but i couldnt figure out how to delete the contents of that file for my code to work (even if its not efficient)).
import requests
import xlrd
d={}
for year in string_of_years:
for month in string_of_months:
dls=" http://.../name_year_month.xls"
resp = requests.get(dls)
output = open('temp.xls', 'wb')
output.write(resp.content)
output.close()
workbook = xlrd.open_workbook('temp.xls')
worksheet = workbook.sheet_by_name(mysheet_name)
num_rows = worksheet.nrows
for k in range(num_rows):
if condition I'm looking for:
w={key_year_month:worksheet.cell_value(k,0)}
dic.update(w)
break
xlrd.open_workbook can accept a string for the file data instead of the file name. Your code could pass the contents of the XLS, rather than creating a file and passing its name.
Try this:
# UNTESTED
resp = requests.get(dls)
workbook = xlrd.open_workbook(file_contents=resp.content)
Reference: xlrd.open_workbook documentation
Save it and then delete the file readily on each loop after the work with os.
import os
#Your Stuff here
os.remove(#path to temp_file)
My code:
import xlrd
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
sh = wb.sheet_by_index(0)
print sh.cell(0,0).value
The error:
Traceback (most recent call last):
File "Z:\Wilson\tradedStockStatus.py", line 18, in <module>
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 429, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1545, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1539, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found '<table r'"
The file doesn't seem to be corrupted or of a different format.
Anything to help find the source of the issue would be great.
Try to open it as an HTML with pandas:
import pandas as pd
data = pd.read_html('filename.xls')
Or try any other html python parser.
That's not a proper excel file, but an html readable with excel.
You say:
The file doesn't seem to be corrupted or of a different format.
However as the error message says, the first 8 bytes of the file are '<table r' ... that is definitely not Excel .xls format. Open it with a text editor (e.g. Notepad) that won't take any notice of the (incorrect) .xls extension and see for yourself.
This will happen to some files while also open in Excel.
I had a similar problem and it was related to the version. In a python terminal check:
>> import xlrd
>> xlrd.__VERSION__
If you have '0.9.0' you can open almost all files. If you have '0.6.0' which was what I found on Ubuntu, you may have problems with newest Excel files. You can download the latest version of xlrd using the Distutils standard.
I found the similar problem when downloading .xls file and opened it using xlrd library. Then I tried out the solution of converting .xls into .xlsx as detailed here: how to convert xls to xlsx
It works like a charm and rather than opening .xls, I am working with .xlsx file now using openpyxl library.
Hope it helps to solve your issue.
I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter. The reason is that actually, xlrd does not support XML Spreadsheet (*.xml) i.e. NOT in XLS or XLSX format.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()
In my case, after opening the file with a text editor as #john-machin suggested, I realized the file is not encrypted as an Excel file is supposed to but it's in the CSV format and was saved as an Excel file. What I did was renamed the file and its extension and used read_csv function instead:
os.rename('sample_file.xls', 'sample_file.csv')
csv = pd.read_csv("sample_file.csv", error_bad_lines=False)
It may be an old excel file format. It can be read as html in pandas via
import pandas as pd
df = pd.read_html('file.xls')
Eventually, this gives a list of dataframes (if you check the type is a list). https://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html
You need to extract them, for instance with df[0]
I met this problem too.I opened this file by excel and saved it as other formats such as excel 97-2003 and finally I solved this problem
I had the same issue. Those old files are formatted like a tab-delimited file. I've been able to open my problem files with read_table; ie df = pd.read_table('trouble_maker.xls').
I got this error when I tried to read some XLSX files from a folder and that one of the files was opened. I closed the XLSX file and this error did not show up.
Try this It worked for me.
import pandas as pd
data = pd.read_csv('filename.xls')
I just downloaded xlrd, created an excel document (excel 2007) for testing and got the same error (message says 'found PK\x03\x04\x14\x00\x06\x00'). Extension is a xlsx. Tried saving it to an older .xls format and error disappears .....
I meet the same problem.
it lies in the .xls file itself - it looks like an Excel file however it isn't. (see if there's a pop up when you plainly open the .xls from Excel)
sjmachin commented on Jan 19, 2013 from https://github.com/python-excel/xlrd/issues/26 helps.
Worked on the same issue , finally done this is top for the question so just putting what i did.
Observation -
1 -The file was not actually XLS i renamed to txt and noticed HTML text in file.
2 - Renamed the file to html and tried reading pd.read_html, Failed.
3- Added as it was not there in txt file, removed style to ensure that table is displaying in browser from local, and WORKED.
Below is the code may help someone..
import pandas as pd
import os
import shutil
import html5lib
import requests
from bs4 import BeautifulSoup
import re
import time
shutil.copy('your.xls','file.html')
shutil.copy('file.html','file.txt')
time.sleep(2)
txt = open('file.txt','r').read()
# Modify the text to ensure the data display in html page, delete style
txt = str(txt).replace('<style> .text { mso-number-format:\#; } </script>','')
# Add head and body if it is not there in HTML text
txt_with_head = '<html><head></head><body>'+txt+'</body></html>'
# Save the file as HTML
html_file = open('output.html','w')
html_file.write(txt_with_head)
# Use beautiful soup to read
url = r"C:\Users\hitesh kumar\PycharmProjects\OEM ML\output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
my_table = soup.find("table",attrs={'border': '1'})
frame = pd.read_html(str(my_table))[0]
print(frame.head())
frame.to_excel('testoutput.xlsx',sheet_name='sheet1', index=False)
Open in google sheets and then download from sheets as CSV and then reupload to drive. Then you can Open CSV file from python.
2 ways I know of is to just download the xls file once again and if you are doing in google colab, just load the file once again from your computer and run the pd.read_excel("filename,xlsx") once again . It should work.
As they already wrote it is actually html, to see the first table you can use
df= pd.read_html(file)
df[0]
To see how many tables there are you can use
print('Tables found:', len(df))
This work for me, using encoding="utf-8" from this post
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 100: character maps to <undefined>
def convert_to_xlsx():
with open('sample.xls', encoding="utf-8") as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False,
header=False)
writer.save()
melike's answer works for me, while the last output sentence did't work, so if anyone has the same issue with me and wants to output the xlsx file into local location, can just easily modify the last three lines.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
output_df = pd.DateFrame(sheet_as_list)
output_df.to_excel(writer, sheet_name='sheet1',index=False, header=False)
writer.close()
import os
import pandas as pd
# Rename the file if it's not already a .csv file
if not os.path.exists('3.8 locates.csv'):
os.rename('3.8 locates.xls', '3.8 locates.csv')
# Load the data into a pandas dataframe
df = pd.read_csv("3.8 locates.csv", sep='\t|\n', engine='python')
# Show the first 5 rows of the dataframe
print(df.head())
The code imports the os and pandas modules and then uses them to perform the following operations:
Check if the file '3.8 locates.csv' exists.
If it does not exist, it renames the file '3.8 locates.xls' to '3.8 locates.csv'.
Load the contents of the file '3.8 locates.csv' into a Pandas dataframe using the pd.read_csv method. The sep argument is set to '\t|\n' and the engine argument is set to 'python' to handle the file's separators correctly.
Print the first 5 rows of the dataframe using the df.head() method.
Note: The code may not work as expected if the file is not a valid tab-separated or newline-separated file.
there's nothing wrong with your file. xlrd does not yet support xlsx (excel 2007+) files although it's purported to have supported this for some time.
Simplistix github
2-days ago they committed a pre-alpha version to their git which integrates xlsx support. Other forums suggest that you use a DOM parser for xlsx files since the xlsx file type is just a zip archive containing XML. I have not tried this. there is another package with similar functionality as xlrd and this is called openpyxl which you can get from easy_install or pip. I have not tried this either, however, its API is supposed to be similar to xlrd.
I know there should be a proper way to solve it
but just to save time
I uploaded my xlsx sheet to Google Sheets and then again downloaded it from Google Sheets
it working now
If you don't have time to solve the problem, you can try this
Sometimes help to add ?raw=true at the end of a file path. For example:
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls?raw=true")