I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:
read an input file
process the file and create a list of triangles, represented by their vertices
output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices
The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I'm getting memory errors because of the sizes of the lists.
What is the best way to tell Python that I no longer need some of the data, and it can be freed?
According to Python Official Documentation, you can explicitly invoke the Garbage Collector to release unreferenced memory with gc.collect(). Example:
import gc
gc.collect()
You should do that after marking what you want to discard using del:
del my_array
del my_object
gc.collect()
Unfortunately (depending on your version and release of Python) some types of objects use "free lists" which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory "earmarked" for only objects of a certain type and thereby unavailable to the "general fund".
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you're all done with them).
The del statement might be of use, but IIRC it isn't guaranteed to free the memory. The docs are here ... and a why it isn't released is here.
I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.
This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory
Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the "del" statement to get rid of a variable completely:
biglist = [blah,blah,blah]
#...
del biglist
(del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the "freed" memory.)
Maybe you would not run into any memory problem in the first place by using a more compact structure for your data.
Thus, lists of numbers are much less memory-efficient than the format used by the standard array module or the third-party numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.
You can't explicitly free memory. What you need to do is to make sure you don't keep references to objects. They will then be garbage collected, freeing the memory.
In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don't need to have the large lists in memory at all.
I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.
To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I'm sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big for loop, processing it, and running gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.
Practically it works like this:
from dask import delayed # this module wraps the multithreading
def f(storage, index, chunk_size): # the processing function
# read the chunk of size chunk_size starting at index in the file
# process it using data in storage if needed
# append data needed for further computations to storage
return storage
partial_result = delayed([]) # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100 # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
# we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
partial_result = delayed(f)(partial_result, index, chunk_size)
# no computations are done yet !
# dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
# passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
# it also allows you to use the results of the processing of the previous chunks in the file if needed
# this launches all the computations
result = partial_result.compute()
# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided
Others have posted some ways that you might be able to "coax" the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.
There isn't really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you're going to have to write an extension in C or C++.
That said, there are some tools to help with this:
cython
swig
boost python
As other answers already say, Python can keep from releasing memory to the OS even if it's no longer in use by Python code (so gc.collect() doesn't free anything) especially in a long-running program. Anyway if you're on Linux you can try to release memory by invoking directly the libc function malloc_trim (man page).
Something like:
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
If you don't care about vertex reuse, you could have two output files--one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.
I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.
I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop).
It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.
I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects
edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).
If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.
In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.
Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)
In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:
gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection
If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!
In a Python script, how can I get the memory usage of all variables in memory?
There are a few questions on here about getting the size or memory of a specified object, which is good, but I'm trying to look for the variables using the most memory (since code might be running on a machine with a memory limit and thus it will throw an error if the used memory becomes too high) so I'd like to somehow profile the current state of ALL variables to see which are causing problems by being too big.
Perhaps something inside a loop through the values of locals(), but I'm not sure if there is a performance concern there that another method might avoid.
I'm workin with fairly large dataframes and textfiles (thousands of docs) that I am opening up in my ipython notebook. I'm noticing that after a while, my computer becomes really slow. Is there a way to take inventory of my python program to find out what's slowing down my computer?
You have a few options. First, you can use third party tools like heapy or PySizer to evaluate your memory usage at different points in your program. This (now closed) SO question discusses them a little bit. Additionally, there is a third option simply called 'memory_profiler' hosted here on GitHub, and according to this blog there are some special shortcuts in IPython for memory_profiler.
Once you have identified the data structures that are consuming the most memory, there are a few options:
Refactor to take advantage of garbage collection
Examine the flow of data through your program and see if there are any places where large data structures are kept around when they don't need to be. If you have a large data structure that you do some processing on, put that processing in a function and returned the processed result so the original memory hog can go out of scope and be destroyed.
A comment suggested using the del statement. Although the commenter is correct that it will free memory, it really should indicate to you that your program isn't structured correctly. Python has good garbage collection, and if you find yourself manually messing with memory freeing, you should probably put that section of code in a function or method instead, and let the garbage collector do its thing.
Temporary Files
If you really need access to large data structures (almost) simultaneously, consider writing one or several of them to temporary files while not needed. You can use the JSON or Pickle libraries to write stuff out in sophisticated formats, or simply pprint your data to a file and read it back in later.
I know that seems like some kind of manual hard disk thrashing, but it gives you great control over exactly when the writes to and reads from the hard disk occur. Also, in this case only your files are bouncing on and off the disk. When you use up your memory and swapping starts occurring, everything gets bounced around - data files, program instructions, memory page tables, etc... Everything grinds to a halt instead of just your program running a little more slowly.
Buy More Memory
Yes, this is an option. But like the del statement, it can usually be avoided by more careful data abstraction and should be a last resort, reserved for special cases.
iPython it's a wonderful tool, but sometimes it tends to slow things up.
If you have large print output statements, lots of graphics, or your code has grown too big, the autosave takes forever to snap your Notebooks. Try autosaving sparingly with:
%autosave 300
Or disabling it entirely:
%autosave 0
Does Python ctypes have a known memory leak? I am working on a Python script having code like the below snippet, using ctypes, that for some reason is causing a memory leak. The "while True" in this example is to test for the leak caused by calling the function. It is being run on Windows with Python 2.5.4:
import ctypes
def hi():
class c1(ctypes.Structure):
_fields_=[('f1',ctypes.c_uint8)]
class c2(ctypes.Structure):
_fields_=[('g1',c1*2)]
while True:
test=hi()
The leak can be tested using ProcessExplorer -- as it keeps looping, Python keeps taking up more and more memory. It seems to require having two Structure subclasses where one of the classes has a "multiple" of the other one (using the * operator), but I'm not sure if the condition is more basic than that. Even if del test is added in the loop, it still leaks memory.
Any ideas on what might be causing this?
Edit: Because someone suggested it might not have garbage-collected yet, here is an edited version that does garbage-collect but still appears to leak memory:
import gc
import ctypes
def hi():
class c1(ctypes.Structure):
_fields_=[('f1',ctypes.c_uint8)]
class c2(ctypes.Structure):
_fields_=[('g1',c1*2)]
while True:
test=hi()
test2=gc.collect()
That's not a memory leak, that just means the garbage collector hasn't run yet. And even if the garbage collector does run, odds are good that there's some kind of memory pooling going on.
ProcessExplorer isn't a good debugging tool, especially for memory.
The script in and by itself doesn't leak. Running with gc.set_debug(gc.DEBUG_LEAK) shows that the created structure types are collectable, and gc.garbage remains empty in every loop iterations, so there are not uncollectable objects. Running the script with time on a Linux system doesn't show an steady increase in memory consumption, too.