This question already has an answer here:
Pandas to_excel- How to make it faster
(1 answer)
Closed 2 years ago.
I am successfully writing dataframes to Excel using df.to_excel(). Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.
I found that using the df.to_csv() method instead offers the convenient chunksize=nnnn argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.
with open(temporary_filepath, 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save()
This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel() was doing the same thing internally). So this approach didn't get me out of the woods after all.
I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.
Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?
Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?
Probably you could use the library pyexcelerate - https://github.com/kz26/PyExcelerate. They have posted the benchmarks on their github repo
from pyexcelerate import Workbook
values = [df.columns] + list(df.values)
wb = Workbook()
wb.new_sheet('data_sheet_name', data=values)
wb.save('data.xlsx')
The pyexcelerate response is exactly what I asked about, so I accepted that answer, but just wanted to post an update that we found an alternate solution that's possibly even easier. Sharing here in case it's useful.
Pandas now prefers xlsxwriter over openpyxl. If it's installed, and you do not specify the engine, xlsxwriter will be used by default (or of course you can specify it explicitly). In my experiments, xlsxwriter was 4x more memory efficient than openpyxl at the task of writing to Excel. This not an infinitely scalable solution - it's still conceivable that one could receive a dataset so large that it still overwhelms memory even with this optimization - but it's extremely easy: Just pip install xlsxwriter and you get a 4x bump in memory use when calling df.to_excel(), with no code changes (in my case).
Related
Context: I am attempting to automate a report that is rather complicated (not conceptually, just in the sheer volume of things to keep track of). The method I settled on after a lot of investigation was to;
Create a template xlsx file which has a couple summary pages containing formulas pointing at other (raw data) sheets within the file.
Pull data from SQL Server and insert into template file, overwriting the raw data sheet with relevant data.
Publish report (Most likely this will just be moving xlsx file to a new directory).
Obviously, I have spent a lot of time looking at other peoples solutions to this issue (as this topic has been discussed a lot). The issue I have found however is that (at least in my search) none of the methods purposed have worked for me, my belief is that the previously correct responses are no longer relevant in current versions of pandas etc. Rather than linking to the dozens of articles attempting to answer this question, I will explain the issues I have had with various solutions.
Using openpyxl instead of xlsxwriter - This resulted in "BadZipFile: File is not a zip file" response. Which as I understand pertains to the pandas version, or rather the fix (mode='a') does not work due to the pandas version (I believe anything beyond 1.2 has this issue).
Helper Function This does not work however, also throws the BadZipFile error.
Below is a heavily redacted version of the code which should give all the required detail.
#Imports
import os
import pyodbc
import numpy as np
import shutil
import pandas as pd
import datetime
from datetime import date
from openpyxl import load_workbook
# Set database connection variables.
cnxn = pyodbc.connect(*Credentials*)
cursor = cnxn.cursor()
df = pd.read_sql_query(script, cnxn)
df.to_excel(writer, sheet_name = 'Some Sheet',index=False)
writer.close()
Long story short here, I am finding it very frustrating that what should be very very simple is turning into a multiple day long exercise. Please if anyone has experience with this and could offer some insight I would be very grateful.
Finally, I have to admit that I am quite new to using python, though I have not found the transition too difficult until today. Most of the issues I have been having are easily solvable (for me), with the exception of this issue. If there is something I have somehow completely missed, put me on the track and I will not be a bother.
Okay, so I found that I was infact incorrect (big surprise). That is, my statement that the helper function does not work. It does work, the ZipFile issue was most likely caused by some form of protection on the workbook. Funny thing is, I was able to get it working with a new workbook, but when I changed the name of the new workbook it again started throwing the ZipFile error. After awhile of creating new files and trying different things I eventually got it to work.
Two things I would note about the helper function;
It is not particularly efficient. At least not in the way I have set it up. I replaced all instances of 'to_excel' with 'append_df_to_excel' from the helper function. Doing this resulted in run time going from about 1-2 minutes to well over 10. I will do some more testing and see why this might be (I will post back if I find something intersting), but just something to watch for if using larger datasets.
Not an issue as such, but for me to get this to work as expected, I had to alter the function slightly. Specifically, in order to use the truncate feature in my situation, I needed to move the 'truncate' section to be above the 'firstrow' section. In my situation it made more sense to do that, rather than to specify the start row prior to truncating the sheet.
Hope this helps anyone running into the same issue.
Lesson learned, as always the information is out there, its just a matter of actually paying close attention and trying things out rather than copy paste and scratching your head when things aren't working.
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
I have a pretty complex excel file that includes pivot tables and sizes about 70 MB, and what I need is to edit one single cell with a script in Python. I'm trying openpyxl.
The problem is that it runs out of memory with no more than opening the file. Do you see any way around?
You can try pandas.read_excel. It may be better optimized for your purpose (reading one cell from one sheet).
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.
We are using OpenPyxl to export MySQL content to Microsoft Excel in XSLX format
https://bitbucket.org/ericgazoni/openpyxl/overview
However, the amount of data we are dealing with is big. We are running to out of memory situation. Tables may contain up to 400 columns in 50000+ rows. Even the files are big, they are not that big that Microsoft Excel or OpenOffice should have problems with them.
We are assuming our issues mainly stem from the fact that Python keeps XML DOM structure in memory in not efficient enough manner.
EDIT: Eric, the author of OpenPyxl, pointed out that there is an option to make OpenPyxl write with fixed memory usage. However, this didn't solve our problem completely, as we still have issues with raw speed and something else taking up too much memory in Python.
Now we are looking for more efficient ways to create Excel files. With Python preferably, but if we cannot find a good solution we might want to look other programming languages as well.
Options, not in any specific order, include
1) Using OpenOffice and PyUno and hope their memory structures are more efficient than with OpenPyxl and the TCP/IP call bridge is efficient enough
2) Openpyxl uses xml.etree. Would Python lxml (libxml2 native extension) be more efficient wit XML memory structures and is it possible to replace xml.etree directly with lxml drop-in e.g. with monkey-patching? (later the changes could be contributed back to Openpyxl if there is a clear benefit)
3) Export from MySQL to CSV and then post-process CSV files directly to XSLX using Python and file iteration
4) Use other programming languages and libraries (Java)
Pointers:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/
If you're going to use Java, you will want to use Apache POI, but likely not the regular UserModel as you're wanting to keep your memory footprint down.
Instead, take a look at BigGridDemo, which shows you how to write a very large xlsx file using POI, with most of the work not happening in memory.
You might also find that the technique used in the BigGridDemo could equally be used in Python?
Have you tried to look at the optimized writer for openpyxl ? It's a recent feature (2 months old), but it's quite robust (used in production in several corporate projects) and can handle almost indefinite amount of data with steady memory consumption (around 7Mb)
http://packages.python.org/openpyxl/optimized.html#optimized-writer