I am working with very large numpy/scipy arrays that take up a huge junk of memory. Suppose my code looks something like the following:
def do_something(a):
a = a / a.sum() #new memory is allocated
#I don't need the original a now anylonger, how to delete it?
#do a lot more stuff
#a = super large numpy array
do_something(a)
print a #still the same as originally (as passed by value)
So I am calling a function with a huge numpy array. The function then processes the array in some way or the other, but the original object is still kept in memory. Is there any way to free the memory inside the function; deleting the reference does not work.
What you want cannot be done; Python will only free the memory when all references to the array object are gone, and you cannot delete the a reference in the calling namespace from the function.
Instead, break up your problem into smaller steps. Do your calculations on a with one function, delete a then, then call another function to do the rest of the work.
Python works with a simple GC algorithm, basically it has a reference counting (it has a generational GC too, but that's not the case), that means that every reference to the object increment a counter, and every object out of scope decrement the scope.
The memory is deallocated only after the counter reach 0.
so while you've a reference to that object, it'll keep on memory.
In your case the caller of do_something still have a reference to the object, if you want that this variable gone you can reduce the scope of that variable.
If you suspect of memory leaks you can set the DEBUG_LEAK flag and inspect the output, more info here: https://docs.python.org/2/library/gc.html
Related
I am writing a python class like this:
class MyImageProcessor:
def __init__ (self, image, metadata):
self.image=image
self.metadata=metadata
Both image and metadata are objects of a class written by a
colleague. Now I need to make sure there is no waste of memory. I am thinking of defining a quit() method like this,
def quit():
self.image=None
self.metadata=None
import gc
gc.collect()
and suggest users to call quit() systematically. I would like to know whether this is the right way. In particular, do the instructions in quit() above guarantee that unused memories being well collected?
Alternatively, I could rename quit() to the build-in __exit__(), and suggest users to use the "with" syntax. But my question is
more about whether the instructions in quit() indeed fulfill the garbage collection work one would need in this situation.
Thank you for your help.
In python every object has a built-in reference_count, the variables(names) you create are only pointers to the objects. There are mutable and unmutable variables (for example if you change the value of an integer, the name will be pointed to another integer object, while changing a list element will not cause changing of the list name).
Reference count basically counts how many variable uses that data, and it is incremented/decremented automatically.
The garbage collector will destroy the objects with zero references (actually not always, it takes extra steps to save time). You should check out this article.
Similarly to object constructors (__init__()), which are called on object creation, you can define destructors (__del__()), which are executed on object deletion (usually when the reference count drops to 0). According to this article, in python they are not needed as much needed in C++ because Python has a garbage collector that handles memory management automatically. You can check out those examples too.
Hope it helps :)
No need for quit() (Assuming you're using C-based python).
Python uses two methods of garbage collection, as alluded to in the other answers.
First, there's reference counting. Essentially each time you add a reference to an object it gets incremented & each time you remove the reference (e.g., it goes out of scope) it gets decremented.
From https://devguide.python.org/garbage_collector/:
When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on.
You can get information about current reference counts for an object using sys.getrefcount(x), but really, why bother.
The second way is through garbage collection (gc). [Reference counting is a type of garbage collection, but python specifically calls this second method "garbage collection" -- so we'll also use this terminology. ] This is intended to find those places where reference count is not zero, but the object is no longer accessible. ("Reference cycles") For example:
class MyObj:
pass
x = MyObj()
x.self = x
Here, x refers to itself, so the actual reference count for x is more than 1. You can call del x but that merely removes it from your scope: it lives on because "someone" still has a reference to it.
gc, and specifically gc.collect() goes through objects looking for cycles like this and, when it finds an unreachable cycle (such as your x post deletion), it will deallocate the whole lot.
Back to your question: You don't need to have a quit() object because as soon as your MyImageProcessor object goes out of scope, it will decrement reference counters for image and metadata. If that puts them to zero, they're deallocated. If that doesn't, well, someone else is using them.
Your setting them to None first, merely decrements the reference count right then, but when MyImageProcessor goes out of scope, it won't decrement those reference count again, because MyImageProcessor no longer holds the image or metadata objects! So you're just explicitly doing what python does for you already for free: no more, no less.
You didn't create a cycle, so your calling gc.collect() is unlikely to change anything.
Check out https://devguide.python.org/garbage_collector/ if you are interested in more earthy details.
Not sure if it make sense but to my logic you could
Use :
gc.get_count()
before and after
gc.collect()
to see if something has been removed.
what are count0, count1 and count2 values returned by the Python gc.get_count()
I am coding in Python trying to decide whether I should return a numpy array (the result of a diff on some other array) or return numpy.where(diff)[0], which is a smaller array but requires that little extra work to create. Let's call the method where this happens methodB.
I call methodB from methodA. The rub is that I won't necessarily always need the where() result in methodA, but I might. So is it worth doing this work inside methodB, or should I pass back the (much larger memory-wise) diff itself and then only process it further in methodA if needed? That would be the more efficient choice assuming methodA just gets a reference to the result.
So, are function results ever not copied when they are passed back the the code that called that function?
I believe that when methodB finishes, all the memory in its frame will be reclaimed by the system, so methodA has to actually copy anything returned by methodB in to its own frame in order to be able to use it. I would call this "return by value". Is this correct?
Yes, you are correct. In Python, arguments are always passed by value, and return values are always returned by value. However, the value being returned (or passed) is a reference to a potentially shared, potentially mutable object.
There are some types for which the value being returned or passed may be the actual object itself, e.g. this is the case for integers, but the difference between the two can only be observed for mutable objects which integers aren't, and de-referencing an object reference is completely transparent, so you will never notice the difference. To simplify your mental model, you may just assume that arguments and return values are always passed by value (this is true anyhow), and that the value being passed is always a reference (this is not always true, but you cannot tell the difference, you can treat it as a simple performance optimization).
Note that passing / returning a reference by value is in no way similar (and certainly not the same thing) as passing / returning by reference. In particular, it does not allow you to mutate the name binding in the caller / callee, as pass-by-reference would allow you to.
This particular flavor of pass-by-value, where the value is typically a reference is the same in e.g. ECMAScript, Ruby, Smalltalk, and Java, and is sometimes called "call by object sharing" (coined by Barbara Liskov, I believe), "call by sharing", "call by object", and specifically within the Python community "call by assignment" (thanks to #timgeb) or "call by name-binding" (thanks to #Terry Jan Reedy) (not to be confused with call by name, which is again a different thing).
Assignment never copies data. If you have a function foo that returns a value, then an assignment like result = foo(arg) never copies any data. (You could, of course, have copy-operations in the function's body.) Likewise, return x does not copy the object x.
Your question lacks a specific example, so I can't go into more detail.
edit: You should probably watch the excellent Facts and Myths about Python names and values talk.
So roughly your code is:
def methodA(arr):
x = methodB(arr)
....
def methodB(arr):
diff = somefn(arr)
# return diff or
# return np.where(diff)[0]
arr is a (large) array, that is passed a reference to methodA and methodB. No copies are made.
diff is a similar size array that is generated in methodB. If that is returned, it be referenced in the methodA namespace by x. No copy is made in returning it.
If the where array is returned, diff disappears when methodB returns. Assuming it doesn't share a data buffer with some other array (such as arr), all the memory that it occupied is recovered.
But as long as memory isn't tight, returning diff instead of the where result won't be more expensive. Nothing is copied during the return.
A numpy array consists of small object wrapper with attributes like shape and dtype. It also has a pointer to a potentially large data buffer. Where possible numpy tries to share buffers, but readily makes new ndarray objects. Thus there's an important distinction between view and copy.
I see what I missed now: Objects are created on the heap, but function frames are on the stack. So when methodB finishes, its frame will be reclaimed, but that object will still exist on the heap, and methodA can access it with a simple reference.
Was just wondering this. So sometimes programmers will insert an input() into a block of code without assigning its value to anything for the purpose of making the program wait for an input before continuing. Usually when it runs, you're expected to just hit enter without typing anything to move forward, but what if you do type something? What happens to that string if its not assigned to any variable? Is there any way to read its value after the fact?
TL;DR: If you don't immediately assign the return value of input(), it's lost.
I can't imagine how or why you would want to retrieve it afterwards.
If you have any callable (as all callables have return values, default is None), call it and do not save its return value, there's no way to get that again. You have one chance to capture the return value, and if you miss it, it's gone.
The return value gets created inside the callable of course, the code that makes it gets run and some memory will be allocated to hold the value. Inside the callable, there's a variable name referencing the value (except if you're directly returning something, like return "unicorns".upper(). In that case there's of course no name).
But after the callable returns, what happens? The return value is still there and can be assigned to a variable name in the calling context. All names that referenced the value inside the callable are gone though. Now if you don't assign the value to a name in your call statement, there are no more names referencing it.
What does that mean? It's gets on the garbage collector's hit list and will be nuked from your memory on its next garbage collection cycle. Of course the GC implementation may be different for different Python interpreters, but the standard CPython implementation uses reference counting.
So to sum it up: if you don't assign the return value a name in your call statement, it's gone for your program and it will be destroyed and the memory it claims will be freed up any time afterwards, as soon as the GC handles it in background.
Now of course a callable might do other stuff with the value before it finally returns it and exits. There are a few possible ways how it could preserve a value:
Write it to an existing, global variable
Write it through any output method, e.g. store it in a file
If it's an instance method of an object, it can also write it to the object's instance variables.
But what for? Unless there would be any benefit from storing the last return value(s), why should it be implemented to hog memory unnecessarily?
There are a few cases where caching the return values makes sense, i.e. for functions with determinable return values (means same input always results in same output) that are often called with the same arguments and take long to calculate.
But for the input function? It's probably the least determinable function existing, even if you call random.random() you can be more sure of the result than when you ask for user input. Caching makes absolutely no sense here.
The value is discarded. You can't get it back. It's the same as if you just had a line like 2 + 2 or random.rand() by itself; the result is gone.
I have a python module that calls a DLL written C to encode XML strings. Once the function returns the encoded string, it fails to de-allocate the memory which was allocated during this step. Concretely:
encodeMyString = ctypes.create_string_buffer(4096)
CallEncodingFuncInDLL(encodeMyString, InputXML)
I have looked at this, this, and this and have also tried calling the gc.collect but perhaps since the object has been allocated in an external DLL, python gc doesn't have any record of it and fails to remove it. But since the code keeps calling the encoding function, it keeps on allocating memory and eventually the python process crashes. Is there a way to profile this memory usage?
Since you haven't given any information about the DLL, this will necessarily be pretty vague, but…
Python can't track memory allocated by something external that it doesn't know about. How could it? That memory could be part of the DLL's constant segment, or allocated with mmap or VirtualAlloc, or part of a larger object, or the DLL could just be expecting it to be alive for its own use.
Any DLL that has a function that allocates and returns a new object has to have a function that deallocates that object. For example, if CallEncodingFuncInDLL returns a new object that you're responsible for, there will be a function like DestroyEncodedThingInDLL that takes such an object and deallocates it.
So, when do you call this function?
Let's step back and make this more concrete. Let's say the function is plain old strdup, so the function you call to free up the memory is free. You have two choices for when to call free. No, I have no idea why you'd ever want to call strdup from Python, but it's about the simplest possible example, so let's pretend it's not useless.
The first option is to call strdup, immediately convert the returned value to a native Python object and free it, and not have to worry about it after that:
newbuf = libc.strdup(mybuf)
s = newbuf.value
libc.free(newbuf)
# now use s, which is just a Python bytes object, so it's GC-able
Or, better, wrap this up so it's automatic by using a custom restype callable:
def convert_and_free_char_p(char_p):
try:
return char_p.value
finally:
libc.free(char_p)
libc.strdup.restype = convert_and_free_char_p
s = libc.strdup(mybuf)
# now use s
But some objects can't be converted to a native Python object so easily—or they can be, but it's not very useful to do so, because you need to keep passing them back into the DLL. In that case, you can't clean it up until you're done with it.
The best way to do this is to wrap that opaque value up in a class that releases it on close or __exit__ or __del__ or whatever seems appropriate. One nice way to do this is with #contextmanager:
#contextlib.contextmanager
def freeing(value):
try:
yield value
finally:
libc.free(value)
So:
newbuf = libc.strdup(mybuf)
with freeing(newbuf):
do_stuff(newbuf)
do_more_stuff(newbuf)
# automatically freed before you get here
# (or even if you don't, because of an exception/return/etc.)
Or:
#contextlib.contextmanager
def strduping(buf):
value = libc.strdup(buf)
try:
yield value
finally:
libc.free(value)
And now:
with strduping(mybuf) as newbuf:
do_stuff(newbuf)
do_more_stuff(newbuf)
# again, automatically freed here
If I create a list in a python function and return it to the caller, how does garbage collection work on that list? Do I have to do anything to keep a memory leak from occurring?
For example:
#!/usr/bin/python
import random
class Example:
def f1(self):
list = []
len = random.randint(0, 30)
for i in range (0, len):
list.append(random.randint(0, 65536))
return list
random.seed(None)
e = Example()
while (1):
l = e.f1()
Will this cause a memory leak? Does the 'list' in f1() have an appropriate reference count at all times? Does the caller of f1() have to do anything to keep a memory leak from occurring? Should the caller do a del() on the list or something?
There's no memory leak here. The list assigned to l is the same list that is generated in the function. Python passes objects, not references or values.
Python keeps track of the references to that list: on each iteration of the while loop, a new list is created and assigned to l. This causes the previous one to no longer have any references, so it will be deleted.
in python it's all automatic...most of time you don't have to worry about garbage collections...
in this case, the func create a list and returns it. then you assign it to l.
once you assign a value to a non-empty var, the first value it's simply throw out, there is no memory leaks, you dont' have to use del..
it's, cool, isn't it? :)