Python application memory profiling not accurate - python

I am profiling Python 3 application and trying to understand where the memory is allocated.
I tried multiple libraries:
pympler
tracemalloc
Pymler code:
all_objects = muppy.get_objects()
summary.summarize(all_objects)
Tracemalloc code:
current_snapshot = tracemalloc.take_snapshot()
total_size = 0
for stat in current_snapshot.statistics("filename"):
total_size += stat.size
print(total_size)
Both of these tools report over 10 times less memory than Python application actually consumes. The application can consume over 500 MB and I get under 50 MB on the reports.
I am checking memory usage, both, manually and using the function below:
mem_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# resource.getrusage provides memory usage in bytes on OSX but in kilobytes for other platforms
if sys.platform != constants.PLATFORM_MAC:
mem_usage *= 1024
return mem_usage
I call gc.collect() prior generating reports in the code.
I used gc library to see if the application is leaking any memory but do not see anything significant.
Can someone explain why can I see such a drastic difference?

Related

Python concurrent futures multiprocessing Pool does not scale with the number of processors

I have written a simple function to demonstrate this behavior which iteratively creates a list and I pass that function to the concurrent.futures.ProcessPoolExecutor. The actual function isn't important as this seems to happen for a wide variety of functions I've tested. As I increase the number of processors it takes longer to run the underlying function. At only 10 processors the total execution time per processor increases by 2.5 times! For this function it continues to increase at a rate of about 15% per processor up to the capacity limits of my machine. I have a Windows machine with 48 processors and my total CPU and memory usage doesn't exceed 25% for this test. I have nothing else running. Is there some blocking lurking somewhere?
from datetime import datetime
import concurrent.futures
def process(num_jobs=1,**kwargs) :
from functools import partial
iterobj = range(num_jobs)
args = []
func = globals()['test_multi']
with concurrent.futures.ProcessPoolExecutor(max_workers=num_jobs) as ex:
## using map
result = ex.map(partial(func,*args,**kwargs),iterobj)
return result
def test_multi(*args,**kwargs):
starttime = datetime.utcnow()
iternum = args[-1]
test = []
for i in range(200000):
test = test + [i]
return iternum, (datetime.utcnow()-starttime)
if __name__ == '__main__' :
max_processors = 10
for i in range(max_processors):
starttime = datetime.utcnow()
result = process(i+1)
finishtime = datetime.utcnow()-starttime
if i == 0:
chng = 0
total = 0
firsttime = finishtime
else:
chng = finishtime/lasttime*100 - 100
total = finishtime/firsttime*100 - 100
lasttime = finishtime
print(f'Multi took {finishtime} for {i+1} processes changed by {round(chng,2)}%, total change {round(total,2)}%')
This gives the following results on my machine:
Multi took 0:00:52.433927 for 1 processes changed by 0%, total change 0%
Multi took 0:00:52.597822 for 2 processes changed by 0.31%, total change 0.31%
Multi took 0:01:13.158140 for 3 processes changed by 39.09%, total change 39.52%
Multi took 0:01:26.666043 for 4 processes changed by 18.46%, total change 65.29%
Multi took 0:01:43.412213 for 5 processes changed by 19.32%, total change 97.22%
Multi took 0:01:41.687714 for 6 processes changed by -1.67%, total change 93.93%
Multi took 0:01:38.316035 for 7 processes changed by -3.32%, total change 87.5%
Multi took 0:01:51.106467 for 8 processes changed by 13.01%, total change 111.9%
Multi took 0:02:15.046646 for 9 processes changed by 21.55%, total change 157.56%
Multi took 0:02:13.467514 for 10 processes changed by -1.17%, total change 154.54%
The increases are not linear and vary from test to test but always end up significantly increasing the time to run the function. Given the ample free resources on this machine and very simple function I would have expected the total time to remain fairly constant or perhaps slightly increase with the spawning of new processes, not to increase dramatically from pure calculation.
yes there is, and it's called MEMORY BANDWIDTH.
while the memory controller is good at pipelining read/write instructions to improve throughput for parallel programs, if too many applications are reading/writing to your RAM sticks then you are going to see a slowdown because the RAM pipeline is being bombarded from all applications at the same time.
other applications running at the same time may not be using the RAM as heavily because each core has cache (L1,L2 and a shared L3) to keep applications running without contesting on the RAM bandwidth, so only applications that do heavy memory operations will be contesting on the RAM bandwidth, and your application is clearly contesting with itself on the RAM bandwidth.
this is one of the hard limits on parallel programs, and the solution is obviously to make a more "Cache friendly" application that reaches out for your RAM less often.
"Cache friendly applications" are more easily written in C/C++ than python, but they are totally doable in python as computers have a few MBs of cache which can fit an entire application in a lot of cases.

excessive memory usage in multiprocessing pool

I have a code using python multiprocessing pool. But it is using excessive memory. I have tested both pool.map and pool.imap_unordered, however, both using the same amount of memory. Below is a simplified version of my code.
import random, time, multiprocessing
def func(arg):
y = arg**arg # Don't look into here because my original function is
# much complicated and I can't change anything here.
print y
p = multiprocessing.Pool(24)
count = 0
start = time.time()
for res in p.imap_unordered(func, range(48000), chunksize=2):
#for res in p.map(func, range(48000), chunksize=2):
print "[%5.2f] res:%s count:%s"%(time.time()-start, res, count)
count += 1
The function saves output in the files and doesn't have any return statement. The above code took:
p.map: CPU Utilized: 03:18:31, Job Wall-clock time: 00:08:17, Memory Utilized: 162.92 MB
p.imap_unordered: CPU Utilized: 04:00:13, Job Wall-clock time: 00:10:02, Memory Utilized: 161.16 MB
I have total 128GB of memory and my original code stops due to memory exceed. Both map and imap_unordered shows the same problem. I was expecting imap_unordered to take much less memory. What should I modify to have less memory consumption without changing the func (the original one)?

Python fork: 'Cannot allocate memory' if process consumes more than 50% avail. memory

I encountered a memory allocation problem when forking processes in Python. I know the issue was already discussed in some other posts here, however I couldn't find a good solution in any of them.
Here is a sample script illustrating the Problem:
import os
import psutil
import subprocess
pid = os.getpid()
this_proc = psutil.Process(pid)
MAX_MEM = int(psutil.virtual_memory().free*1E-9) # in GB
def consume_memory(size):
""" Size in GB """
memory_consumer = []
while get_mem_usage() < size:
memory_consumer.append(" "*1000000) # Adding ~1MB
return(memory_consumer)
def get_mem_usage():
return(this_proc.memory_info()[0]/2.**30)
def get_free_mem():
return(psutil.virtual_memory().free/2.**30)
if __name__ == "__main__":
for i in range(1, MAX_MEM):
consumer = consume_memory(i)
mem_usage = get_mem_usage()
print("\n## Memory usage %d/%d GB (%2d%%) ##" % (int(mem_usage),
MAX_MEM, int(mem_usage*100/MAX_MEM)))
try:
subprocess.call(['echo', '[OK] Fork worked.'])
except OSError as e:
print("[ERROR] Fork failed. Got OSError.")
print(e)
del consumer
The script was tested with Python 2.7 and 3.6 on Arch Linux and uses psutils to keep track of memory usage. It gradually increases memory usage of the Python process and tries to fork a process using subprocess.call(). Forking fails if more then 50% of the avail. memory is consumed by the parent process.
## Memory usage 1/19 GB ( 5%) ##
[OK] Fork worked.
## Memory usage 2/19 GB (10%) ##
[OK] Fork worked.
## Memory usage 3/19 GB (15%) ##
[OK] Fork worked.
[...]
## Memory usage 9/19 GB (47%) ##
[OK] Fork worked.
## Memory usage 10/19 GB (52%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 11/19 GB (57%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 12/19 GB (63%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 13/19 GB (68%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
[...]
Note that I had no Swap activated when running this test.
There seem to be two options to solve this problem:
Using a Swap of at least twice the size of physical memory.
Changing overcommit_memory setting: echo 1 > /proc/sys/vm/overcommit_memory
I tried the latter on my desktop machine and the above script finished without errors.
However on the Computing Cluster I'm working on I can't use any of these options.
Also forking the required processes in advance, before consuming the memory, is not an option unfortunately.
Does anybody have an other suggestion on how to solve this problem?
Thank you!
Best
Leonhard
The problem you are facing is not really Python related and also not something you could really do much to change with Python alone. Starting a forking process (executor) up front as suggested by mbrig in the comments really seems to be the best and cleanest option for this scenario.
Python or not, you are dealing with how Linux (or similar system) create new processes. Your parent process first calls fork(2) which creates a new child process as a copy of itself. It does not actually copy itself elsewhere at that time (it uses copy-on-write), nonetheless, it checks if sufficient space is available and if not fails setting errno to 12: ENOMEM -> the OSError exception you're seeing.
Yes, allowing VMS to overcommit memory can suppress this error popping up... and if you exec new program (which would also end up being smaller) in the child. It does not have to cause any immediate failures. But it sounds like possibly kicking the problem further down the road.
Growing memory (adding swap). Pushes the limit and as long twice your running process still fits into available memory, the fork could succeed. With the follow-up exec, the swap would not even need to get utilized.
There seems to be one more option, but it looks... dirty. There is another syscall vfork() which creates a new process which initially shares memory with its parent whose execution is suspended at that point. This newly created child process can only set variable returned by vfork, it can _exit or exec. As such, it is not exposed through any Python interface and if you tried (I did) loading it directly into Python using ctypes it would segfault (I presume because Python would still do something other then just those three actions mentioned after vfork and before I could exec something else in the child).
That said, you can delegate the whole vfork and exec to a shared object you load in. As a very rough proof of concept, I did just that:
#include <errno.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
char run(char * const arg[]) {
pid_t child;
int wstatus;
char ret_val = -1;
child = vfork();
if (child < 0) {
printf("run: Failed to fork: %i\n", errno);
} else if (child == 0) {
printf("arg: %s\n", arg[0]);
execv(arg[0], arg);
_exit(-1);
} else {
child = waitpid(child, &wstatus, 0);
if (WIFEXITED(wstatus))
ret_val = WEXITSTATUS(wstatus);
}
return ret_val;
}
And I've modified your sample code in the following way (bulk of the change is in and around replacement of subprocess.call):
import ctypes
import os
import psutil
pid = os.getpid()
this_proc = psutil.Process(pid)
MAX_MEM = int(psutil.virtual_memory().free*1E-9) # in GB
def consume_memory(size):
""" Size in GB """
memory_consumer = []
while get_mem_usage() < size:
memory_consumer.append(" "*1000000) # Adding ~1MB
return(memory_consumer)
def get_mem_usage():
return(this_proc.memory_info()[0]/2.**30)
def get_free_mem():
return(psutil.virtual_memory().free/2.**30)
if __name__ == "__main__":
forker = ctypes.CDLL("forker.so", use_errno=True)
for i in range(1, MAX_MEM):
consumer = consume_memory(i)
mem_usage = get_mem_usage()
print("\n## Memory usage %d/%d GB (%2d%%) ##" % (int(mem_usage),
MAX_MEM, int(mem_usage*100/MAX_MEM)))
try:
cmd = [b"/bin/echo", b"[OK] Fork worked."]
c_cmd = (ctypes.c_char_p * (len(cmd) + 1))()
c_cmd[:] = cmd + [None]
ret = forker.run(c_cmd)
errno = ctypes.get_errno()
if errno:
raise OSError(errno, os.strerror(errno))
except OSError as e:
print("[ERROR] Fork failed. Got OSError.")
print(e)
del consumer
With that, I could still fork at 3/4 of available memory reported filled.
In theory it could all be written "properly" and also wrapped nicely to integrate with Python code well, but while it seems to be one additional option. I'd still go back to the executor process.
I've only briefly scanned through the concurrent.futures.process module, but once it spawns a worker process, it does not seem to clobber it before done, so perhaps abusing existing ProcessPoolExecutor would be a quick and cheap option. I've added these close to the script top (main part):
def nop():
pass
executor = concurrent.futures.ProcessPoolExecutor(max_workers=1)
executor.submit(nop) # start a worker process in the pool
And then submit the subprocess.call to it:
proc = executor.submit(subprocess.call, ['echo', '[OK] Fork worked.'])
proc.result() # can also collect the return value

Avoid out of memory error for multiprocessing Pool

How do I avoid "out of memory" exception when a lot of sub processes are launched using multiprocessing.Pool?
First of all, my program loads 5GB file to a object. Next, parallel processing runs, where each process read that 5GB object.
Because my machine has more than 30 cores, I want to use full of my cores. However, when launching 30 sub processes, out of memory exception occurs.
Probably, each process has the copy of the large instance (5GB). The total memory is 5GB * 30 core = 150GB. That's why out of memory error occurs.
I believe there is a workaround to avoid this memory error because each process just read that object. If each process share memory of the huge object, only 5GB memory is enough for my multi processing.
Please let me know a workaround of this memory error.
import cPickle
from multiprocessing import Pool
from multiprocessing import Process
import multiprocessing
from functools import partial
with open("huge_data_5GB.pickle", "rb") as f
huge_instance = cPickle(f)
def run_process(i, huge_instance):
return huge_instance.get_element(i)
partial_process = partial(run_process, huge_instance=huge_instance)
p = Pool(30) # my machine has more than 30 cores
result = p.map(partial_process, range(10000))

Python Garbage Collection: Memory no longer needed not released to OS?

I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.
I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.

Categories

Resources