Accessing Excel files directly from RAM using Excel Writer - python

In the documentation for pd.ExcelWriter we see the following code snippet:
You can store Excel file in RAM:
import io
df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
buffer = io.BytesIO()
with pd.ExcelWriter(buffer) as writer:
df.to_excel(writer)
My question is that how can we access the excel back. I wanted to have the b64 coded version of the same excel without saving the file in the system that is why I am thinking of saving it in my RAM. Can someone please help on this?
Thanks for your time
Solution: Was able to access the file using buffer.getvalue().

In the snippet you provided, the Excel file has been written to the buffer the same way as if it would have been stored on disk.
Therefore you can read it back in a similar way as if you were reading from file:
pd.read_excel(buffer.getvalue())
More on how BytesIO behave:
Create an excel file from BytesIO using python
Difference between `open` and `io.BytesIO` in binary streams

Related

Pandas exported excel file reads as Zip file format

I have a python script that utilizes pandas to perform some aggregation on a huge dataframe and after doing so It tries to export it as an Excel "xlsx" File format.
this is the last step in the process.
print("Exporting to Excel...")
sum_df = sum_df.set_index('products_code')
with pd.ExcelWriter(OUTPUT_FILE, engine='openpyxl') as writer:
sum_df.to_excel(writer, sheet_name="stocks")
print("Done!")
The file exports normally but whenever I try to upload it to the server, the server rejects it and reads it as a zip file instead of an xlsx file, I found a quick fix for this which is to open the file in Microsoft Excel and just hit save and exit, this seems to fix the issue. But I don't know the reason for this behavior and was looking for help to automatically save it as a valid excel file from the script directly.
Any Ideas?
As discussed in the comments, it seems using xlsxwriter as the engine has solved this issue. E.g.:
print("Exporting to Excel...")
sum_df = sum_df.set_index('products_code')
with pd.ExcelWriter(OUTPUT_FILE, engine='xlsxwriter') as writer:
sum_df.to_excel(writer, sheet_name="stocks")
print("Done!")
It would be good to know what software was used on the server, if possible, in case other people encounter this issue.

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!
try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

Python save Excel .xlsx to CSV/XML and also save styling Information for conversion back into .xsls

My Python program converts Excel files (.xlsx) into a CSV file using Panda's read_excel and to_csv function, and at some point in the future, the CSV is converted back into an Excel file. Maintaining the data is fine, but of course all of the formatting and styling is gone. So I could use some help in being able to capture the that information to use when after converting the CSV back into an Excel file.
import pandas as pd
import xlsxwriter
EXCEL_PATH_FROM = r'C:\absolute\path\to\excel.xlsx'
EXCEL_PATH_TO = r'C:\absolute\path\to\other\excel.xlsx'
CSV_PATH = r'C:\absolute\path\to\csv.csv'
# read excel and convert to csv
def saveData():
read_excel = pd.read_excel(EXCEL_PATH_FROM)
print("writing csv...")
read_excel.to_csv(CSV_PATH, index=None, header=True)
# get csv data and import that data into an excel file
def createFromData():
csv = pd.read_csv(CSV_PATH)
excel = pd.ExcelWriter(EXCEL_PATH_TO, engine='xlsxwriter')
csv.to_excel(excel, index=None)
excel.save()
Some ideas I had were to save the Excel as a XML and insert format and style information as attributes or something, or to create both a CSV and XML from the Excel (one for data and one for styling). One problem I have is figuring out how to access that information.
Are there currently any packages that support Python 3 (currently using 3.8) that could help simplify this process? I dug through openpyxl's documentation and they have some stylesheet classes that aren't meant to be used directly I don't think and I couldn't figure out how to use them directly.

Processing large XLSX file in python

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?
Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.
Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.
The load time you're experiencing is directly related to the io speed of your memory chip.
When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).
In terms of a solution: I'd suggest, as a workaround:
load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet
For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Categories

Resources