Memory leak when using pickle in python

Memory leak when using pickle in python - python

I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.
I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop).
It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.

I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects
edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).

If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.
In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.
Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)
In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:
gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection
If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!

Related

Python process not cleaning memory as expected (memory-leak)

Summary:
Python process is not managing memory as expected, resulting in the process getting killed.
Details:
I'm making an app in python that manages huge image data (hundreds of 32bit 3000x3000 px images). I'm trying to manage the data in the most storage-efficient and memory-efficient way by following the OOP principles, saving the data in optimized formats, loading the data in minimal batches and keeping almost all variables out of the "main" scope.
However, I'm facing a problem that I'm unable to diagnose. After running a method, the memory usage skyrockets from 40% to 80%. This method opens multiple stacks of images in napari, so it is expected to use that much memory (nevertheless, I should optimize it).
The issue arises when exiting this method, as the memory is not freed. This means that running the method twice or performing any other intense work afterwards fills up the memory and makes the program crash. The method runs out of the "main" scope. I've printed the local and global variables from the "main" scope before and after running this method:
Before the issue:
After the issue:
I already tried:
Running gc.collect from the main scope and making sure from the debugger that no napari-related object exists after the execution of the method.
Maybe there is some variable not show by locals().items() or globals().items(), or maybe I simply don't understand how Python allocates memory at all. This is my first time dealing with memory management and garbage collection in Python, so any information will be highly appreciated.
Edit:
I've been playing with objgraph to track the memory leak, and I found that the Garbage Collector is not removing napari-related objects upong closing napari. This means that I should move this question to napari's Issues page, on github. However, it would be highly appreciated if someone knew of a way of cleaning all module-related objects, so I could just dump all the napari leftover trash. The alternative for the moment is just closing and running the script, however, this is far from desired.

can I release an imported package to save memory in python?

I imported some packages in a python program. But a lot of them only be called at the beginning of the program process. Are those packages saved in memory until the whole program finishes running? Can I delete them after they been called to release memory?

I don't think it is possible - but even if it is, the memory gained by erasing a loaded module is usually minimal.
Running del module in the code of some file will not remove it from memory: all loaded modules are available from the global sys.modules dictionary. If you delete the module from sys.modules, you can gain some space, but I think Python will still hold references to at least the classes that were defined in the module, (even if you have no instances of, or other references to those classes in your code.)
all in all, most modules would get you a few kilobytes back. If you plan to "unload" modules that have a multi-megabyte impact in memory (think of Pandas/matplotlib/sqlalchemy), you will be even less likely to be successful as the complexity of these code bases mean they have a lot of internal cross references, and you'd need to delete everything they import internally from sys.modules.
If the setup is based on a multi-megabyte framework like these, and you are running on a very limited system (raspberry-class mini PC, with a few MB, for example) you could try to perform the initialisation in a subprocess, and kill that process when you are done - that would give you some guarantees the memory is freed up. But limited hardware apart you should simply not bother about it.

How to make python ignore an object for garbage collection?

At the start of my code, I load in a huge (33GB) pickled object. This object is essentially a huge graph with many interconnected nodes.
Periodically, I run gc.collect(). When I have the huge object loaded in, this takes 100 seconds. When I change my code to not load in the huge object, gc.collect() takes .5 seconds. I assume that this is caused by python checking through every subobject of this object for reference cycles every time I call gc.collect().
I know that neither the huge object, nor any of the objects it references when it is loaded in at the start, will ever need to be garbage collected. How do I tell python this, so that I can avoid the 100s gc time?

In python 3.7 you might be able to hack something using https://docs.python.org/3/library/gc.html#gc.freeze
allocate_a_lot()
gc.freeze() # move all objects to a permanent generation. none will be collected
allocate_some_more()
gc.collect() # collect all non-frozen objects
gc.unfreeze() # return to sanity
This said, I think that python does not offer the tools for what you want. In general all garbage collected languages do not want you to do manual memory management.

How to clear up memory when using python?

I'm workin with fairly large dataframes and textfiles (thousands of docs) that I am opening up in my ipython notebook. I'm noticing that after a while, my computer becomes really slow. Is there a way to take inventory of my python program to find out what's slowing down my computer?

You have a few options. First, you can use third party tools like heapy or PySizer to evaluate your memory usage at different points in your program. This (now closed) SO question discusses them a little bit. Additionally, there is a third option simply called 'memory_profiler' hosted here on GitHub, and according to this blog there are some special shortcuts in IPython for memory_profiler.
Once you have identified the data structures that are consuming the most memory, there are a few options:
Refactor to take advantage of garbage collection
Examine the flow of data through your program and see if there are any places where large data structures are kept around when they don't need to be. If you have a large data structure that you do some processing on, put that processing in a function and returned the processed result so the original memory hog can go out of scope and be destroyed.
A comment suggested using the del statement. Although the commenter is correct that it will free memory, it really should indicate to you that your program isn't structured correctly. Python has good garbage collection, and if you find yourself manually messing with memory freeing, you should probably put that section of code in a function or method instead, and let the garbage collector do its thing.
Temporary Files
If you really need access to large data structures (almost) simultaneously, consider writing one or several of them to temporary files while not needed. You can use the JSON or Pickle libraries to write stuff out in sophisticated formats, or simply pprint your data to a file and read it back in later.
I know that seems like some kind of manual hard disk thrashing, but it gives you great control over exactly when the writes to and reads from the hard disk occur. Also, in this case only your files are bouncing on and off the disk. When you use up your memory and swapping starts occurring, everything gets bounced around - data files, program instructions, memory page tables, etc... Everything grinds to a halt instead of just your program running a little more slowly.
Buy More Memory
Yes, this is an option. But like the del statement, it can usually be avoided by more careful data abstraction and should be a last resort, reserved for special cases.

iPython it's a wonderful tool, but sometimes it tends to slow things up.
If you have large print output statements, lots of graphics, or your code has grown too big, the autosave takes forever to snap your Notebooks. Try autosaving sparingly with:
%autosave 300
Or disabling it entirely:
%autosave 0

How do I store a Python object in memory for use by different processes?

Here's the situation: I have a massive object that needs to be loaded into memory. So big that if it is loaded in twice it will go beyond the available memory on my machine (and no, I can't upgrade the memory). I also can't divide it up into any smaller pieces. For simplicity's sake, let's just say the object is 600 MB and I only have 1 GB of RAM. I need to use this object from a web app, which is running in multiple processes, and I don't control how they're spawned (a third party load balancer does that), so I can't rely on just creating the object in some master thread/process and then spawning off children. This also eliminates the possibility of using something like POSH because that relies on it's own custom fork call. I also can't use something like a SQLite memory database, mmap or the posix_ipc, sysv_ipc, and shm modules because those act as a file in memory, and this data has to be an object for me to use it. Using one of those I would have to read it as a file and then turn it into an object in each individual process and BAM, segmentation fault from going over the machine's memory limit because I just tried to load in a second copy.
There must be someway to store a Python object in memory (and not as a file/string/serialized/pickled) and have it be accessible from any process. I just don't know what it is. I've looked all over StackOverflow and Google and can't find the answer to this, so I'm hoping somebody can help me out.

http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes
Look for shared memory, or Server process. After re-reading your post Server process sounds closer to what you want.
http://en.wikipedia.org/wiki/Shared_memory

There must be someway to store a Python object in memory (and not as a
file/string/serialized/pickled) and have it be accessible from any
process.
That isn't the way in works. Python object reference counting and an object's internal pointers do not make sense across multiple processes.
If the data doesn't have to be an actual Python object, you can try working on the raw data stored in mmap() or in a database or somesuch.

I would implement this as a C module that gets imported into each Python script. Then the interface to this large object would be implemented in C, or some combination of C and Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.