I am trying to loop over Excel files with python pandas. First I'm saving them to csv and then I open them again, slice and then saving them again. But I get an error:
"Workbook: size exceeds expected 10752 bytes; corrupt?"
I am relatively new to python.
I could fix this exact error by just opening the problematic Excel file in Excel and simply saving it.
After that I could import the file in Pandas without the error.
In my case I suspect that this error results from a platform inconsistancy in Excel between Windows (where my source files were generated) and Mac OS (where I import the files).
I think you may have a cell that has more than 255 characters in it.
See this article about data and file size limitations: http://kb.tableau.com/articles/knowledgebase/jet-data-file-size-limitations
Consider using openpyxl to open your Excel files instead.
It seems that pandas uses xlrd to read excel files and xlrd raises an error if it feels something is wrong about reading the file... which is what happened to you.
xlrd is no longer maintained as of Jan 2020.
Related
My organization needs me to use pylightxl library to read some bulky excel xlsx files. I have never used this library before and I'm getting a strange error in pycharm. I simply do not understand what it is.
I've tried googling but there isn't much support for pylightxl on the web. Does anyone know how to help?
for completeness of this post it looks like there was a bug in early days of pylightxl for files that were converted from xls to xlsx, however this issue has been resolved with #31 with version 1.52+
For those who might meet the same issue in the future, here's how i solved my issue, or rather a work around of it.
The excel files i had been given were initially in xls format, which pylightxl does not support.
I converted them to xlsx by just clicking "save as" in excel and then tried to read them in pylightxl, of which i was getting the strange error above. Must be something to do with the format
So I ended up saving it in csv instead, of which reading them was successful.
So if anyone meets this error, try different formats for the document you're trying to read
I have been teaching myself Python to automate some of our work processes. So far reading from Excel files (.xls, .xlsx) has gone great.
Currently I have hit a bit of a snag. Although I can output .xlsx files fine, the software system that we have to use for our primary work task can only take .xls files as an input - it cannot handle .xlsx files, and the vendor sees no reason to add .xlsx support at any point in the foreseeable future.
When I try to output a .xls file using either Pandas or OpenPyXl, and open that file in Excel, I get a warning that the file format and extension of the file do not match, which leads me to think that attempting to open this file using our software could lead to some pretty unexpected consequences (because it's actually a .xlsx file, just not named as such)
I've tried to search for how to fix this all on Google, but all I can find are guides for how to convert a .xls file to a .xlsx file (which is almost the opposite of what I need). So I was wondering if anybody could please help me on whether this can be achieved, and if it can, how.
Thank you very much for your time
Under the pandas.DataFrame.to_excel documentation you should notice a parameter called engine, which states:
engine : str, optional
Write engine to use, openpyxl or xlsxwriter. You can also set this via the options io.excel.xlsx.writer, io.excel.xls.writer, and io.excel.xlsm.writer.
What it does not state is that the engine param is automatically picked based on your file extension -- therefore, easy fix:
import pandas as pd
df = pd.DataFrame({"data": [1, 2, 3]})
df.to_excel("file.xls") # Notice desired file extension.
This will automatically use the xlwt engine, so make sure you have it installed via pip install xlwt.
Felipe is right the filename extension will set the engine parameter.
So basically all it's saying is that the old Excel format ".xls" extension is no longer supported in Pandas. So if you specify the output spreadsheet with the ".xlsx" extension the warning message disappears.
I FINALLY have the answer!
I have libreoffice installed and am using the following in the command line on windows:
"C:\Program Files\LibreOffice\program\soffice.exe" --headless --convert-to xlsx test2.xls
Currently trying to use subprocess to automate this.
i got a permission denied error while i tried to open an excel file.
I dont have the ms excel complete version. I mean, im just using the trial version.
Could it be because of that?
my code has just 4 lines
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_excel("E:\\ML")
It's something about how file open function works. I successfully reproduced your problem and find the way.
It's believed that you have a directory named ML in E disk, and maybe there are some excels files (such as *.xls or *.xlsx) in ML(I bet you just started learning machine learning). Now you try to load the excel data into your program, but you give the path E:\\ML, which is a directory instead of a file, so operation is forbidden by system when pandas try to serialize the directory as a file, which is the cause of error "Permission denied".
The method is that you're supposed to use file path like E:\\ML\\your_database_file_name.xls.
I hope it will work for you.
For me, it turns out that it was because I had the same Excel file opened (I kept getting the error while trying to push my work to Github) which was resolved immediately after I closed the MS Excel (the program using the file I wanted to push..)
I hope you find this helpful!
I wrote a tool that extracts data from a large DB and outputs it to an Excel file along with (conditional) formatting to improve readability. For this I use Python with openpyxl on a Linux machine. It works great, but this package is rather slow for writing Excel.
It seems to be a lot quicker to dump the table as (compressed) csv, import that into Excel and apply formatting there using a macro/vba.
To automate the process I'd like to create an empty Excel file pre-loaded with the required VBA to do the formatting; a template. For every data dump, the data is embedded (compressed using deflate) into the Excel file and loaded into the Workbook upon opening the document (or using a "LOAD" button to circumvent macro related security things).
However, just adding some file into the Excel file raises an error when opened:
We found a problem with some content in 'Werkmap1_test_embed.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes.
Clicking Yes opens the file and shows some tracing information as XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>Repair Result to Werkmap1_OLE_Word0.xml</logFileName>
<summary>Errors were detected in file '/Users/joostk/mnt/cluster/Werkmap1_OLE_Word.xlsx'</summary>
<additionalInfo>
<info>Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.</info>
</additionalInfo>
</recoveryLog>
Is it possible to avoid this? How would I embed a file into the Excel ZIP? Do I need to update some file table (which I could not file easily).
When that's done, I'd like to import the data. Can I access files in the Excel ZIP from VBA? I guess not, and I need to extract the data to some temporary path and load it from there.
I have found these helpful answers elsewhere to load ZIP and plain text:
https://stackoverflow.com/a/35781621/4998990
https://stackoverflow.com/a/11267603/4998990
Many thanks for sharing your thoughts!
so my "Answer" here is that this is caused by using Named Ranges, or an underlying table, or an embedded Query/Connection. When you start manipulating this file you will get the error that you are talking about:
There is no harm to the file if you click "yes" and open. Excel will open this in Repaired Mode which will require you to re-save the file.
The way I've worked around this is to re-read the "repaired" file, in python, and save it as another file or replace it. Essentially just do an extra step of re-reading the data into memory, and write it to a new file. The error will go away. As always, test this method before deploying to production to ensure no records are lost. The way I solve it is with two lines of pandas.
import pandas as pd
repair = pd.read_excel('PATH_TO_REPAIR_FILE')
new_file = repair.to_excel('PATH_TO_WHERE_NEW_FILE_GOES')
I tried the following code to be able to read an excel file from my personal computer.
import xlrd
book = xlrd.open_workbook('C:\\Users\eline\Documents\***\***\Python', 'Example 1.xlsx')
But I am getting the error 'Permission denied'. I am using windows and if I look at the properties of the directory and look at the 'Security' tab I have three groups/users and all three have permissions for all the authorities, except for the last option which is called 'special authorities' (as far as I know I do not need this authority to read the excel file in Python).
I have no idea how to fix this error. Furthermore, I do not have the Excel file open on my computer when running the simulation.
I really hope someone can help me to fix this error.
Sometimes, it is because you try to read the Excel file while it is opened. Close the file in Excel and you are good to go.
book = xlrd.open_workbook('C:\\Users\eline\Documents\***\***\Python', 'Example 1.xlsx')
You cannot give path like this to xlrd. path need to be single string.
If you insist you can use os module
import os
book = xlrd.open_workbook(os.path.join('C:\\Users\eline\Documents\***\***\Python', 'Example 1.xlsx'))
[Errno13] permission denied in your case is happening because you want to read folder like a file which is not allowed.
I ran into this situation also while reading an Excel file into a data frame. To me it appears that it is a Python and/or Excel bug which we should probably not hide by using os.path.join even if that solves the problem. My situation involved an excel spreadsheet that links cells to another CSV file. If this excel file is freshly opened and open when I try to read it in python, it fails.
Python reads it correctly if I do an unnecessary save of the open Excel file.