Python Pandas XLRDError when reading .xls files

Python Pandas XLRDError when reading .xls files - python

I'm having a problem with reading .xls files in Pandas.
Here's the code
df = pd.read_excel('sample.xls')
And the output states,
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xff\xfeD\x00A\x00T\x00'
Anyone experiencing the same issue? How to fix it?

# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io
filename = r'sample.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('1.xls')
Credits to this Medium Article

Related

Python - Open file in default program and save with default program extension (or the like)

I'm currently trying to do the following:
Open up an .xml file that's already in spreadsheet format with Excel
Save the .xml file as .xlsx without corrupting the file
Other options that I can take via Python are:
Convert the .xml to .xlsx
Copy specific columns (A1:AC6000) to another Excel workbook
Import an XML file directly in an Excel workbook.
I failed at all of them and can't think of a different way so here I am asking for help. My latest code is here:
# importing openpyxl module
import openpyxl as xl;
# opening the source excel file
file = 'C:\\Users\\ddejean\\Desktop\\HESKlogin\\Downloads\\data.xlsx'
wb1 = xl.load_workbook(file)
ws1 = wb1['Sheet1']
# opening the destination excel file
filename1 = 'C:\\Users\\ddejean\\Desktop\\HESKlogin\\Downloads\\updated.xlsx'
wb2 = xl.load_workbook(filename1)
ws2 = wb2['Sheet1']
# calculate total number of rows and
# columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
# copying the cell values from source
# excel file to destination excel file
for i in range (1, mr + 1):
for j in range (1, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row = i, column = j)
# writing the read value to destination excel file
ws2.cell(row = i, column = j).value = c.value
# saving the destination excel file
wb2.save(filename1)
I also tried changing the format of the file which ultimately corrupted the file:
A = r"C:\\Users\\ddejean\\Desktop\\HESKlogin\\Downloads\\data.xml"
pre, ext = os.path.splitext(A)
B = os.rename(A, pre + ".xlsx")
I tried importing the file into Excel which was terrible since none of the data in xml have properly name attributes to differentiate the data. I also tried calling a macro, but I get an error with each macro on my network, so I disposed of that alternative.
Any assistance you can give would be much appreciated! I also think it's important to say that I'm a noob.

This works for me :)
import os
import win32com.client as win32
import requests as r
import pandas as pd
hesk = "C:\\Users\\ddejean\\Desktop\\TEST\\hesk.xml"
folder = "C:\\Users\\ddejean\\Desktop\\TEST"
output = "C:\\Users\\ddejean\\Desktop\\TEST\\output.csv"
cd = os.path.dirname(os.path.abspath(folder))
xmlfile = os.path.join(cd, hesk)
csvfile = os.path.join(cd, output)
# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
os.remove(csvfile)
try:
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.OpenXML(xmlfile)
wb.SaveAs(csvfile, 6)
wb.Close(True)
except Exception as e:
print(e)
finally:
# RELEASES RESOURCES
wb = None
excel = None

Python - XLRDError: Unsupported format, or corrupt file: Expected BOF record

I am trying to open an excel file which was given to me for my project, the excel file is the file that we get from a SAP system. But when I try opening it using pandas I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
The following is my code:
import pandas as pd
# To open an excel file
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')

Dont know whether it will work for you once it had worked for me, but anyway can you try the following:
from __future__ import unicode_literals
from xlwt import Workbook
import io
filename = r'myexcel.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')

I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter. You can call pd.ExcelFile('myexcel.xlsx') after the convertion. The reason is that actually, pandas uses xlrd for reading Excel files and xlrd does not support XML Spreadsheet (*.xml) i.e. NOT in XLS or XLSX format.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()

What worked for me was applying this advice:
How to cope with an XLRDError
There you also find a suitable explanation that was appropiated for me. It says that the problem was a file format not correctly saved. When I opened the xls file, it offered to save it as html.I saved it a ".xlsx" and solved the problem

how to convert xlsx to tab delimited files

I have quite a lot of xlsx files which is a pain to convert them one by one to tab delimited files
I would like to know if there is any solution to do this by python. Here what I found and what tried to do with failure
This I found and I tried the solution but did not work Mass Convert .xls and .xlsx to .txt (Tab Delimited) on a Mac
I also tried to do it for one file to see how it works but with no success
#!/usr/bin/python
import xlrd
import csv
def main():
# I open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# I don't know the name of sheet
mysheet = myfile.sheet_by_index(0)
# I open the output csv
myCsvfile = open('my.csv', 'wb')
# I write the file into it
wr = csv.writer(myCsvfile, delimiter="\t")
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
myCsvfile.close()
if __name__ == '__main__':
main()

No real need for the main function.
And not sure about your indentation problems, but this is how I would write what you have. (And should work, according to first comment above)
#!/usr/bin/python
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))

Why go with so much pain when you can do it in 3 lines:
import pandas as pd
file = pd.read_excel('myfile.xlsx')
file.to_csv('myfile.xlsx',
sep="\t",
index=False)

Converting Excel to CSV python

I am using xlrd to convert my .xls Excel file to a CSVfile yet when I try to open the workbook my program crashes sending an error message
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xlrd/book.py", line 1224, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found 'Chrom\tPo'
The Chrom\tPo is part of my header for the excel file yet I don't understand what the error is with the Excel file and how to change it.
The program crashes right when i try to open the excel file using xlrd.open_workbook('Excel File')

I would use openpyxl for this.
import openpyxl
wb = openpyxl.load_workbook(file_name)
ws = wb.worksheets[page_number]
table = []
for row_num in range(ws.get_highest_row()):
temp_row = []
for col_num in range(ws.get_highest_column()):
temp_row.append(ws.cell(row=row_num, col=col_num).value)
table.append(temp_row[:])
This will give you the contents of the sheet as a 2-D list, which you can then write out to a csv or use as you wish.
If you're stuck with xlrd for whatever reason, You may just need to convert your file from xls to xlsx

Here is an answer from a previous question: How to save an Excel worksheet as CSV from Python (Unix)?
The answer goes for openpyxl and xlrd.

convert a tsv file to xls/xlsx using python

I want to convert a file in tsv format to xls/xlsx..
I tried using
os.rename("sample.tsv","sample.xlsx")
But the file getting converted is corrupted. Is there any other method of doing it?

Here is a simple example of converting TSV to XLSX using XlsxWriter and the core csv module:
import csv
from xlsxwriter.workbook import Workbook
# Add some command-line logic to read the file names.
tsv_file = 'sample.tsv'
xlsx_file = 'sample.xlsx'
# Create an XlsxWriter workbook object and add a worksheet.
workbook = Workbook(xlsx_file)
worksheet = workbook.add_worksheet()
# Create a TSV file reader.
tsv_reader = csv.reader(open(tsv_file, 'rb'), delimiter='\t')
# Read the row data from the TSV file and write it to the XLSX file.
for row, data in enumerate(tsv_reader):
worksheet.write_row(row, 0, data)
# Close the XLSX file.
workbook.close()

You need:
Read the data from the tsv file.
Convert it in what you want them to be.
Write them to an Excel file with openpyxl for xlsx or xlwt for xls.

import csv
from xlsxwriter.workbook import Workbook
# Add some command-line logic to read the file names.
tsv_file = 'sample.tsv'
xlsx_file = 'sample.xlsx'
# Create an XlsxWriter workbook object and add a worksheet.
workbook = Workbook(xlsx_file)
worksheet = workbook.add_worksheet()
# Create a TSV file reader.
tsv_reader = csv.reader(open(tsv_file,'rt'),delimiter="\t")
# Read the row data from the TSV file and write it to the XLSX file.
for row, data in enumerate(tsv_reader):
worksheet.write_row(row, 0, data)
# Close the XLSX file.
workbook.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas XLRDError when reading .xls files - python

I'm having a problem with reading .xls files in Pandas. Here's the code df = pd.read_excel('sample.xls') And the output states, XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xff\xfeD\x00A\x00T\x00' Anyone experiencing the same issue? How to fix it?

Related

Python - Open file in default program and save with default program extension (or the like)

Python - XLRDError: Unsupported format, or corrupt file: Expected BOF record

how to convert xlsx to tab delimited files

Converting Excel to CSV python

convert a tsv file to xls/xlsx using python

Categories

Resources