Python Garbage Collection: Memory no longer needed not released to OS? - python

I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.

I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.

Related

Python parallel processing fills memory fast

I'm bruteforcing a 8-digit pin on a ELF executable (it's for a CTF) and I'm using asynchronous parallel processing. The code is very fast but it fills the memory even faster.
It takes about 10% of the total iterations to fill 8gbs of ram, and I have no idea what's causing it. Any help?
from pwn import *
import multiprocessing as mp
from tqdm import tqdm
def check_pin(pin):
program = process('elf_exe')
program.recvn(36)
program.sendline(str(pin))
program.recvline()
program.recvline()
res = program.recvline()
program.close()
if 'Access denied.' in str(res):
return null, null
else:
return res, pin
def process_result(res, pin):
if(res != null):
print(pin)
if __name__ == '__main__':
print(f'Starting bruteforce on {mp.cpu_count()} cores :)\n')
pool = mp.Pool(mp.cpu_count())
min = 10000000
max = 99999999
for pin in tqdm(range(min, max)):
pool.apply_async(check_pin, args=(pin), callback=process_result)
pool.close()
pool.join()
Multiprocessing pools create several processes. Calls to apply_async create a task that is added to a shared data structure (eg. queue). The data structure is read by processes thanks to inter-process communication (IPC). The thing is apply_async return a synchronization object that you do not use and so there is not synchronizations. Items appended in the data structure take some memory space (at least 32*3=96 bytes due to 3 CPython objects being allocated) and the data structure grow in memory to hold the 89_999_999 items hence at least 8 GiB of RAM. The process are not fast enough to execute the work. What tqdm print is totally is completely misleading: it just print the processing of the number of task submitted, not the one executed that is only a tiny fraction. Almost all the work is done when tqdm print 100% and the submission loop is done. I actually doubt the "code is very fast" since it appears to run 90 millions process while running a process is known to be an expensive operation.
To speed up this code and avoid a big memory usage, you need to aggregate the work in bigger tasks. You can for example and a range of pin variable to be computed and add a loop in check_pin. A reasonable range size is for example 1000. Additionally, you need to accumulate the AsyncResult objects returned by apply_async in a list and perform periodic synchronizations when the list becomes too big so that processes does not have too much work and so the shared data structure can remain small. Here is a simple untested example:
lst = []
for rng in allRanges:
lst.append(pool.apply_async(check_pin, args=(rng), callback=process_result))
if len(lst) > 100:
# Naive synchronization
for i in lst:
i.wait()
lst = []

Python application memory profiling not accurate

I am profiling Python 3 application and trying to understand where the memory is allocated.
I tried multiple libraries:
pympler
tracemalloc
Pymler code:
all_objects = muppy.get_objects()
summary.summarize(all_objects)
Tracemalloc code:
current_snapshot = tracemalloc.take_snapshot()
total_size = 0
for stat in current_snapshot.statistics("filename"):
total_size += stat.size
print(total_size)
Both of these tools report over 10 times less memory than Python application actually consumes. The application can consume over 500 MB and I get under 50 MB on the reports.
I am checking memory usage, both, manually and using the function below:
mem_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# resource.getrusage provides memory usage in bytes on OSX but in kilobytes for other platforms
if sys.platform != constants.PLATFORM_MAC:
mem_usage *= 1024
return mem_usage
I call gc.collect() prior generating reports in the code.
I used gc library to see if the application is leaking any memory but do not see anything significant.
Can someone explain why can I see such a drastic difference?

Avoid out of memory error for multiprocessing Pool

How do I avoid "out of memory" exception when a lot of sub processes are launched using multiprocessing.Pool?
First of all, my program loads 5GB file to a object. Next, parallel processing runs, where each process read that 5GB object.
Because my machine has more than 30 cores, I want to use full of my cores. However, when launching 30 sub processes, out of memory exception occurs.
Probably, each process has the copy of the large instance (5GB). The total memory is 5GB * 30 core = 150GB. That's why out of memory error occurs.
I believe there is a workaround to avoid this memory error because each process just read that object. If each process share memory of the huge object, only 5GB memory is enough for my multi processing.
Please let me know a workaround of this memory error.
import cPickle
from multiprocessing import Pool
from multiprocessing import Process
import multiprocessing
from functools import partial
with open("huge_data_5GB.pickle", "rb") as f
huge_instance = cPickle(f)
def run_process(i, huge_instance):
return huge_instance.get_element(i)
partial_process = partial(run_process, huge_instance=huge_instance)
p = Pool(30) # my machine has more than 30 cores
result = p.map(partial_process, range(10000))

Why does python thread consume so much memory?

Why does python thread consumes so much memory?
I measured that spawning one thread consumes 8 megs of memory, almost as big as a whole new python process!
OS: Ubuntu 10.10
Edit: due to popular demand I'll give some extraneous examples, here it is:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
sleep(999999999)
On my box, pmap pid will give 9424K
Now, let's run the child thread:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
child_thread.start() # <--- ADDED THIS LINE
sleep(999999999)
Now pmap pid will give 17620K
So, the cost for the extra thread is 17620K - 9424K = 8196K
ie. 87% of running a whole new separate process!
Now isn't that just, wrong?
This is not Python-specific, and has to do with the separate stack that gets allocated by the OS for every thread. The default maximum stack size on your OS happens to be 8MB.
Note that the 8MB is simply a chunk of address space that gets set aside, with very little memory committed to it initially. Additional memory gets committed to the stack when required, up to the 8MB limit.
The limit can be tweaked using ulimit -s, but in this instance I see no reason to do this.
As an aside, pmap shows address space usage. It isn't a good way to gauge memory usage. The two concepts are quite distinct, if related.

Python 2.6 GC appears to cleanup objects, but memory is not released

I have a program written in python 2.6 that creates a large number of short lived instances (it is a classic producer-consumer problem). I noticed that the memory usage as reported by top and pmap seems to increase when these instances are created and never goes back down. I was concerned that some python module I was using might be leaking memory so I carefully isolated the problem in my code. I then proceeded to reproduce it in as short as example as possible. I came up with this:
class LeaksMemory(list):
timesDelCalled = 0
def __del__(self):
LeaksMemory.timesDelCalled +=1
def leakSomeMemory():
l = []
for i in range(0,500000):
ml = LeaksMemory()
ml.append(float(i))
ml.append(float(i*2))
ml.append(float(i*3))
l.append(ml)
import gc
import os
leakSomeMemory()
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(gc.collect()) +" objects collected")
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(os.getpid()) + " : check memory usage with pmap or top")
If you run this with something like 'python2.6 -i memoryleak.py' it will halt and you can use pmap -x PID to check the memory usage. I added the del method so I could verify that GC was occuring. It is not there in my actual program and does not appear to make any functional difference. Each call to leakSomeMemory() increases the amount of memory consumed by this program. I fear I am making some simple error and that references are getting kept by accident, but cannot identify it.
Python will release the objects, but it will not release the memory back to the operating system immediately. Instead, it will re-use the same segments for future allocations within the same interpreter.
Here's a blog post about the issue: http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
UPDATE: I tested this myself with Python 2.6.4 and didn't notice persistent increases in memory usage. Some invocations of leakSomeMemory() caused the memory footprint of the Python process to increase, and some made it decrease again. So it all depends on how the allocator is re-using the memory.
According to Alex Martelli:
"The only really reliable way to
ensure that a large but temporary use
of memory DOES return all resources to
the system when it's done, is to have
that use happen in a subprocess, which
does the memory-hungry work then
terminates."
So, in your situation it sounds like it would make sense to use the multiprocessing module to run the short-lived functions in separate processes to ensure the return of resources when the process finishes.
import multiprocessing as mp
def NOT_leakSomeMemory():
# do stuff
return result
if __name__=='__main__':
pool = mp.Pool()
results=pool.map(NOT_leakSomeMemory, range(500000))
For more ideas on how to set things up using multiprocessing, see Doug Hellman's tutorial:

Categories

Resources