Pandas exported excel file reads as Zip file format

Pandas exported excel file reads as Zip file format - python

I have a python script that utilizes pandas to perform some aggregation on a huge dataframe and after doing so It tries to export it as an Excel "xlsx" File format.
this is the last step in the process.
print("Exporting to Excel...")
sum_df = sum_df.set_index('products_code')
with pd.ExcelWriter(OUTPUT_FILE, engine='openpyxl') as writer:
sum_df.to_excel(writer, sheet_name="stocks")
print("Done!")
The file exports normally but whenever I try to upload it to the server, the server rejects it and reads it as a zip file instead of an xlsx file, I found a quick fix for this which is to open the file in Microsoft Excel and just hit save and exit, this seems to fix the issue. But I don't know the reason for this behavior and was looking for help to automatically save it as a valid excel file from the script directly.
Any Ideas?

As discussed in the comments, it seems using xlsxwriter as the engine has solved this issue. E.g.:
print("Exporting to Excel...")
sum_df = sum_df.set_index('products_code')
with pd.ExcelWriter(OUTPUT_FILE, engine='xlsxwriter') as writer:
sum_df.to_excel(writer, sheet_name="stocks")
print("Done!")
It would be good to know what software was used on the server, if possible, in case other people encounter this issue.

Related

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!

try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

Openpyxl-Made changes to excel and store it in a dataframe, how to kill the Excel without saving all the changes and avoid further recovery dialogue?

I need to open and edit my Excel with openpyxl, store the excel as a dataframe, and close the excel without any changes. Are there any ways to kill the excel and disable the auto-recovery dialogue which may pop out later?
The reason I'm asking is that my code worked perfectly fine in Pycharm, however after I packed it into .exe with pyinstaller, the code stopped working, the error said "Excel cannot access the file, there are serval possible reasons, the file name or path does not exist, or the file is being used by another program, or the workbook you are saving has the same name as a currently open workbook.
I assume it is because the openpyxl did not really close the excel, and I exported it to a different folder with the same file name.
Here is my code:
wb1 = openpyxl.load_workbook(my_path, keep_vba=True)
ws1 = wb1["sheet name"]
making changes...
ws1_df = pd.DataFrame(ws1.values)
wb1.close()
Many thanks ahead :)

The following way you can do this. solution
from win32com.client import Dispatch
# Start excel application
xl = Dispatch('Excel.Application')
# Open existing excel file
book = xl.Workbooks.Open('workbook.xlsx')
# Some arbitrary excel operations ...
# Close excel application without saving file
book.Close(SaveChanges=False)
xl.Quit()

Accessing Excel files directly from RAM using Excel Writer

In the documentation for pd.ExcelWriter we see the following code snippet:
You can store Excel file in RAM:
import io
df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
buffer = io.BytesIO()
with pd.ExcelWriter(buffer) as writer:
df.to_excel(writer)
My question is that how can we access the excel back. I wanted to have the b64 coded version of the same excel without saving the file in the system that is why I am thinking of saving it in my RAM. Can someone please help on this?
Thanks for your time
Solution: Was able to access the file using buffer.getvalue().

In the snippet you provided, the Excel file has been written to the buffer the same way as if it would have been stored on disk.
Therefore you can read it back in a similar way as if you were reading from file:
pd.read_excel(buffer.getvalue())
More on how BytesIO behave:
Create an excel file from BytesIO using python
Difference between `open` and `io.BytesIO` in binary streams

Save changes with openpyxl

I am new to python and an using openpyxl to edit an xlsx file. I am having an issue trying to save the original file. It seems that openpyxl keeps making me save the changes as a new xlsx.
Here is the code I am using and get the error TypeError: save() takes exactly 2 arguments (1 given)
import openpyxl
from openpyxl import Workbook
wb = openpyxl.load_workbook('book1.xlsx')
sheet = wb.get_sheet_by_name("Sensor Status")
sheet['I3'] = '=countifs(B:B,"*server*",C:C,"=0")'
sheet['I4'] = '=countifs(B:B,"*server*",C:C,">=0")'
wb.save()

Sir,
You need to add the file name like:
wb.save('book1.xlsx')

Gen Wan's answer is already correct. But, assuming you already did that and you still had an error, this might help you. I had the same problem, and I figured out that it gave me an error because my file was still open in Microsoft Excel while I was trying to save it with openpyxl. When the one same file is opened by two platforms (Which in this case are Microsoft Excel and openpyxl), I think that the priviledge to save the file is prioritized for the Microsoft Excel software, that's why it's declining the save command from openpyxl. Once I closed Microsoft Excel, I no longer had the error and I was able to save the file. I am assuming you had the error because of that too.

What is the right way to export Pandas dataframe to Excel multi-sheet file?

I need to output two cleaned and recalculated dataframes to Excel file as separate sheets. This code works, but opening resulting file in Excel produces "file corrupted" - it gets repaired and opens fine afterwards, but this is annoying.
The code is on Azure Jupiter Notebook, Python 3.6, I download Excel file and open in Excel 365, Win 10.
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('PR_weatherGDDid.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name='Daily', index=False)
doystats.to_excel(writer, sheet_name='stats')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
So: Excel file gets created but has a problem to be opened in Excel.

Here is the correct way.
>>> with pd.ExcelWriter('PR_weatherGDDid.xlsx') as writer:
... df.to_excel(writer, sheet_name='Daily')
... doystats.to_excel(writer, sheet_name='stats')

This is my code and I can open the Excell file allright:
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('PR_weatherGDDid.xlsx')
data = [['AMN987','Ok'],['AMN987','Ok'],['AMN987','Error'], ['BBB987','Ok'],['BBB987','Ok'],['CCC','Error']]
df = pd.DataFrame(data, columns=['Serial', 'Status'])
days_to = [['02/08/19',4],['02/08/19',8],['02/08/19',3], ['02/08/19',6],['02/08/19',0],['02/08/19',9]]
doystats = pd.DataFrame(days_to, columns=['Date', 'Day'])
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name='Daily', index=False)
doystats.to_excel(writer, sheet_name='stats')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
writer.close()
The output looks like this:

As Larisa Golovko noted, this appears to be an issue only with XlsxWriter on Azure Notebooks. It doesn't happen with XlsxWriter, Pandas or Jupyter in offline environments.
I dug into it a bit more here and it looks like it there is a zipfile compression error on the .rels files in the xlsx archive. Currently I don't know what is causing that but it appears to be related to the standard Python zipfile library on that environment. I'll try to put together a simpler test case without XlsxWriter.
A workaround is to use the XlsxWriter in_memory constructor option:
workbook = xlsxwriter.Workbook('hello_world.xlsx', {'in_memory': True})
# Or:
writer = pd.ExcelWriter('pandas_example.xlsx',
engine='xlsxwriter',
options={'in_memory': True})

The problem with Excel only opening the created file after "repairs" seems to stem from the fact that file was created in Azure Jupiter notebook online. All 3 code variants (mine and suggested by #atlas and #sharif) produced file needing "repairs" in the online environment, but made normal Excel file when I run it through local-installed Jupiter Notebooks (Anaconda).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.