We are using OpenPyxl to export MySQL content to Microsoft Excel in XSLX format
https://bitbucket.org/ericgazoni/openpyxl/overview
However, the amount of data we are dealing with is big. We are running to out of memory situation. Tables may contain up to 400 columns in 50000+ rows. Even the files are big, they are not that big that Microsoft Excel or OpenOffice should have problems with them.
We are assuming our issues mainly stem from the fact that Python keeps XML DOM structure in memory in not efficient enough manner.
EDIT: Eric, the author of OpenPyxl, pointed out that there is an option to make OpenPyxl write with fixed memory usage. However, this didn't solve our problem completely, as we still have issues with raw speed and something else taking up too much memory in Python.
Now we are looking for more efficient ways to create Excel files. With Python preferably, but if we cannot find a good solution we might want to look other programming languages as well.
Options, not in any specific order, include
1) Using OpenOffice and PyUno and hope their memory structures are more efficient than with OpenPyxl and the TCP/IP call bridge is efficient enough
2) Openpyxl uses xml.etree. Would Python lxml (libxml2 native extension) be more efficient wit XML memory structures and is it possible to replace xml.etree directly with lxml drop-in e.g. with monkey-patching? (later the changes could be contributed back to Openpyxl if there is a clear benefit)
3) Export from MySQL to CSV and then post-process CSV files directly to XSLX using Python and file iteration
4) Use other programming languages and libraries (Java)
Pointers:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/
If you're going to use Java, you will want to use Apache POI, but likely not the regular UserModel as you're wanting to keep your memory footprint down.
Instead, take a look at BigGridDemo, which shows you how to write a very large xlsx file using POI, with most of the work not happening in memory.
You might also find that the technique used in the BigGridDemo could equally be used in Python?
Have you tried to look at the optimized writer for openpyxl ? It's a recent feature (2 months old), but it's quite robust (used in production in several corporate projects) and can handle almost indefinite amount of data with steady memory consumption (around 7Mb)
http://packages.python.org/openpyxl/optimized.html#optimized-writer
Related
This question already has an answer here:
Pandas to_excel- How to make it faster
(1 answer)
Closed 2 years ago.
I am successfully writing dataframes to Excel using df.to_excel(). Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.
I found that using the df.to_csv() method instead offers the convenient chunksize=nnnn argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.
with open(temporary_filepath, 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save()
This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel() was doing the same thing internally). So this approach didn't get me out of the woods after all.
I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.
Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?
Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?
Probably you could use the library pyexcelerate - https://github.com/kz26/PyExcelerate. They have posted the benchmarks on their github repo
from pyexcelerate import Workbook
values = [df.columns] + list(df.values)
wb = Workbook()
wb.new_sheet('data_sheet_name', data=values)
wb.save('data.xlsx')
The pyexcelerate response is exactly what I asked about, so I accepted that answer, but just wanted to post an update that we found an alternate solution that's possibly even easier. Sharing here in case it's useful.
Pandas now prefers xlsxwriter over openpyxl. If it's installed, and you do not specify the engine, xlsxwriter will be used by default (or of course you can specify it explicitly). In my experiments, xlsxwriter was 4x more memory efficient than openpyxl at the task of writing to Excel. This not an infinitely scalable solution - it's still conceivable that one could receive a dataset so large that it still overwhelms memory even with this optimization - but it's extremely easy: Just pip install xlsxwriter and you get a 4x bump in memory use when calling df.to_excel(), with no code changes (in my case).
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
I know that xlwings uses windows COM and stuffs, and based on this: https://support.microsoft.com/en-us/help/196776/office-automation-using-visual-c (8th Question):
A common cause of speed problems with Automation is with repetitive reading and writing of data. This is typical for Excel Automation clients.
And that's exactly what I am doing, tons of reading and writing, and later on I can see that EXCEL.exe is taking up 50% of my CPU usage and my python script has kind of stopped (it just halts there, but python.exe is still on TM).
Now is there anyway to work with this? I asked because, continuing the quote above, Microsoft says:
However, most people aren't aware that this data can usually be written or read all at once using SAFEARRAY.
So I guess there's a way to work like this on python using xlwings?
Please be noted that: there are things I cant do on other libraries, like "Getting the values on the cells which are visible to the user, all I get is formulas". So I guess xlwings is that way to go. Thanks.
I have a rather complex database which I deliver in CSV format to my client. The logic to arrive at that database is an intricate mix of Python processing and SQL joins done in sqlite3.
There are ~15 source datasets ranging from a few hundreds records to as many as several million (but fairly short) records.
Instead of having a mix of Python / sqlite3 logic, for clarity, maintainability and several other reasons I would love to move ALL logic to an efficient set of Python scripts and circumvent sqlite3 altogether.
I understand that the answer and the path to go would be Pandas, but could you please advise if this is the right track for a rather large database like the one described above?
I have been using Pandas with datasets > 20 GB in size (on a Mac with 8 GB RAM).
My main problem has been that there is a know bug in Python that makes it impossible to write files larger than 2 GB on OSX. However, using HDF5 circumvents that.
I found the tips in this and this article enough to make everything run without problem. The main lesson is to check the memory usage of your data frame and cast the types of the columns to the smallest possible data type.
I am running a simulation that saves some result in a file from which I take the data I need and save it in
result.dat
like so:
SAN.statisticsCollector.Network Response Time
Min: 0.210169
Max: 8781.55
average: 346.966666667
I do all this using python, and it was easy to convert result.dat into an excel file using xlwt. The problem is that creating charts using xlwt is not possible. I than came across Jpype, but installation on my ubuntu 12.04 machine was a headache. I'm probably just being lazy but still - is there any other way, not necessarily python-related, to convert result.dat into an excel file with charts?
Thanks
P.s the file I want to create is a spreadsheet, not Microsoft's Excel!
there is a now a new possibility: http://pythonhosted.org/openpyxl/charts.html
and http://xlsxwriter.readthedocs.org/chart.html
The main problem is that currently there's no Python library that implements MS Excel chart creation, and, obviously, they will not appear due to lack of good chart format documentation (as python-excel.org guys told) and its huge complecity.
There are two other options though:
Another option is to use 3-rd party tools (like JPype that you've mentioned) combining them with Python scripts. As far as I know, except Java smartXML there's no libraries that are capable of creating excel charts (or course, there are ones for .NET, e.g. expertXLS) and I'm not sure it will run on Mono + IronPython, though you can try.
The third option is Win32 COM API, e.g. like described in this SO post, which is not quite an option for you due to your working operating system.