Using Python's pickle in Sage results in high memory usage

Using Python's pickle in Sage results in high memory usage - python

I am using the Python based Sage Mathematics software to create a very long list of vectors. The list contains roughly 100,000,000 elements and sys.getsizeof() tells me that it is of size a little less than 1GB.
This list I pickle into a file (which already takes a long time -- but fair enough). Only when I unpickle this list it gets annoying. The RAM usage increases from 1.15GB to 4.3GB, and I am wondering what's going on?
How can I find out in Sage what all the memory is used for? And do you have any ideas how to optimize this by maybe applying Python tricks?
This is a reply to the comment of kcrisman.
The exact code I cannot post since it would be too long. But here is a simple example where the phenomena can be observed. I am working on Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux.
Start Sage and execute:
import pickle
L = [vector([1,2,3]) for k in range(1000000)]
f = open("mylist", 'w')
pickle.dump(L, f)
On my system the list is 8697472 bytes big, and the file I pickled into has roughly 130MB. Now close Sage and watch your memory (with htop, for example). Then execute the following lines:
import pickle
f = open("mylist", 'r')
pickle.load(f)
Without sage my Linux system uses 1035MB of memory, when Sage is running the usage increases to 1131MB. After I unpickled the file it uses 2535MB which I find odd.

It's probably better to not use python's pickle module directly. cPickle is already a bit better, but a lot of pickling in sage assumes protocol 2, which (c)Pickle doesn't default to. You can use sage's own wrappers of pickle. If I do your example with
sage: open("mylist",'w').write(dumps(L))
and then load it in a fresh session via
sage: L = loads(open("mylist",'r').read())
I observe no problems.
Note that the above interface is not the best one to pickle/unpickle in sage to a file. You'd be better off using save/load. I just did it that way to stay as close as possible to your example.

Related

Fast zip decryption in python

I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?

There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.

I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?

It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Python sharedmem access from Lua

I have to repeatedly move a large array from Lua to Python. Currently, I run the Lua code as a subprocess from Python and read the array from its stdout. This is much slower than I'd like, and the bottleneck seems to be almost entirely the Python p.stdout.read([byte size of array]) calls, as running the Lua code in isolation is much faster.
From what I have read, the only way to improve over pipes is to use shared memory, but this is (almost) always discussed in regards to multiprocessing between different Python processes instead of between Python and a subprocess.
Is there a reasonable way to share memory between Python and Lua? Related answers have suggested using direct calls to shm_open but I'd rather use prebuilt modules/packages if they exist.

Before going down the path of looking into shared memory I would suggest doing some profiling experiments to identify exactly where the time is being spent.
If your experiments prove you're spending too much time serializing/deserializing data between the processes then using shared memory along with a format designed to avoid that cost like Cap'n Proto could be a good solution.
A quick search turned up these two libraries:
lua-capnproto -
Lua-capnp is a pure lua implementation of capnproto based on luajit.
pycapnp - This is a python wrapping of the C++ implementation of the Cap’n Proto library.
But definitely do the profiling first.
Also is there a reason lupa wouldn't work for you?.

Here's the solution I found using Torch from Lua and NumPy from Python. The Lua code is run from Python using lupa.
In main.lua:
require 'torch'
data_array = torch.FloatTensor(256, 256)
function write_data()
return tonumber(torch.data(data_array:contiguous(), true))
end
From Python:
import ctypes
import lupa
import numpy as np
data_shape = (256, 256)
lua = lupa.LuaRuntime()
with open('main.lua') as f: lua.execute(f.read())
data_array = np.ctypeslib.as_array(ctypes.cast(ctypes.c_void_p(lua.globals().write_data()), ctypes.POINTER(ctypes.c_float)), shape=data_shape)
data_array is constructed to point at the storage of the Torch tensor.

Huge memory usage of Python's json module?

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.

I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

Efficient writing to a Compact Flash in python

I'm writing a gui to do perform a glorified 'dd'.
I could just subprocess to 'dd' but I thought I might as well use python's open()/read()/write() if I can as it'll let me display progress much more easily.
Prompted by this link here I have:
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
while True:
buffer = input.read(1024)
if buffer:
output.write(buffer)
else:
break
input.close()
output.close()
...however it is horribly slow. Or at least far slower than dd. (around 4-5x slower)
I had a play and noticed altering the number of bytes 'buffered' had a huge affect on the speed of completion. Raising it to 2048 for example seems to half the time taken. Perhaps going OT for SO here but I guess the flash has an optimum number of bytes to be written at once? Could anyone suggest how I discover this?
The image & card are 1Gb so I would very much like to return to the ~5 minutes dd took if possible. I appreciate that in all likelihood I won't match it.
Rather than trial and error, would anyone be able to suggest a way to optimise the above code and reasoning as to why it works? Especially what value for input.read() for example?
One restriction: python 2.4.3 on linux (centos5) (please don't hurt me)

Speed depending on buffer size is unrelated to the specific characteristics of compact flash, but inherent to all I/O with (relatively) slow devices, even to all kinds of system calls. You should make the buffer size as large as possible without exhausting memory - 2MiB should be enough for a Flash drive.
You should use the time and strace utilities to determine why your program is slower. If time shows a large user/real (large meaning greater than 0.1), you can optimize your Python interpreter - cpython 2.4 is pretty slow, and you're creating new objects all the time instead of writing into a preallocated buffer. If there is a significant difference in sys timings, analyze the syscalls made by both programs (with strace) and try to emit the ones dd does.
Also note that you must call fsync (or execute the sync program) afterwards to measure the real time it took writing the file to disk (or open the output file with O_DIRECT). Otherwise, the operating system will let your program exit and just keep all the written data in buffers that are then continually written out to the actual disk. To test that you're doing it right, remove the disk immediately after your program is finished. Note that the speed difference can be staggering. This effect is less noticeable if your disk(CF card) is way larger than the available physical memory.

So with a little help, I've removed the 'buffer' bit completely and added an os.fsync().
import os
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
output.write(input.read())
input.close()
output.close()
outputfile.flush()
os.fsync(outputfile.fileno())

cProfile taking a lot of memory

I am attempting to profile my project in python, but I am running out of memory.
My project itself is fairly memory intensive, but even half-size runs are dieing with "MemoryError" when run under cProfile.
Doing smaller runs is not a good option, because we suspect that the run time is scaling super-linearly, and we are trying to discover which functions are dominating during large runs.
Why is cProfile taking so much memory? Can I make it take less? Is this normal?

Updated: Since cProfile is built into current versions of Python (the _lsprof extension) it should be using the main allocator. If this doesn't work for you, Python 2.7.1 has a --with-valgrind compiler option which causes it to switch to using malloc() at runtime. This is nice since it avoids having to use a suppressions file. You can build a version just for profiling, and then run your Python app under valgrind to look at all allocations made by the profiler as well as any C extensions which use custom allocation schemes.
(Rest of original answer follows):
Maybe try to see where the allocations are going. If you have a place in your code where you can periodically dump out the memory usage, you can use guppy to view the allocations:
import lxml.html
from guppy import hpy
hp = hpy()
trees = {}
for i in range(10):
# do something
trees[i] = lxml.html.fromstring("<html>")
print hp.heap()
# examine allocations for specific objects you suspect
print hp.iso(*trees.values())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.