Why does python thread consumes so much memory?
I measured that spawning one thread consumes 8 megs of memory, almost as big as a whole new python process!
OS: Ubuntu 10.10
Edit: due to popular demand I'll give some extraneous examples, here it is:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
sleep(999999999)
On my box, pmap pid will give 9424K
Now, let's run the child thread:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
child_thread.start() # <--- ADDED THIS LINE
sleep(999999999)
Now pmap pid will give 17620K
So, the cost for the extra thread is 17620K - 9424K = 8196K
ie. 87% of running a whole new separate process!
Now isn't that just, wrong?
This is not Python-specific, and has to do with the separate stack that gets allocated by the OS for every thread. The default maximum stack size on your OS happens to be 8MB.
Note that the 8MB is simply a chunk of address space that gets set aside, with very little memory committed to it initially. Additional memory gets committed to the stack when required, up to the 8MB limit.
The limit can be tweaked using ulimit -s, but in this instance I see no reason to do this.
As an aside, pmap shows address space usage. It isn't a good way to gauge memory usage. The two concepts are quite distinct, if related.
Related
I am observing memory usage that I cannot explain to myself. Below I provide a stripped down version of my actual code that still exhibits this behavior. The code is intended to accomplish the following:
Read a text file in chunks of 1000 lines. Each line is a sentence. Split these 1000 sentences into 4 generators. Pass these generators to a thread pool and run feature extraction in parallel on 250 sentences.
In my actual code I accumulate features and labels from all sentences of the entire file.
Now here comes the weird thing: Memory gets allocated but not freed again even when not accumulating these values! And it has something to do with the thread pool I think. The amount of memory taken in total is dependent on how many features are extracted for any given word. I simulate this here with range(100). Have a look:
from sys import argv
from itertools import chain, islice
from multiprocessing import Pool
from math import ceil
# dummyfied feature extraction function
# the lengt of the range determines howmuch mamory is used up in total,
# eventhough the objects are never stored
def features_from_sentence(sentence):
return [{'some feature' 'some value'} for i in range(100)], ['some label' for i in range(100)]
# split iterable into generator of generators of length `size`
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def features_from_sentence_meta(l):
return list(map (features_from_sentence, l))
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# split sentences into a generator of 4 generators
sentence_chunks = chunks(sentences, ceil(50000/4))
# results is a list containing the lists of pairs of X and Y of all chunks
results = map(lambda x : x[0], pool.map(features_from_sentence_meta, sentence_chunks))
X, Y = zip(*results)
print(f'end: {i}')
return X, Y
# reads file in chunks of `lines_per_chunk` lines
def line_chunks(textfile, lines_per_chunk=1000):
chunk = []
i = 0
with open(textfile, 'r') as textfile:
for line in textfile:
if not line.split(): continue
i+=1
chunk.append(line.strip())
if i == lines_per_chunk:
yield chunk
i = 0
chunk = []
yield chunk
textfile = argv[1]
for i, line_chunk in enumerate(line_chunks(textfile)):
# stop processing file after 10 chunks to demonstrate
# that memory stays occupied (check your system monitor)
if i == 10:
while True:
pass
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
The file I am using to debug this has 50000 nonempty lines, which is why I use the hardcoded 50000 at one place. If you want to use the same file, he is a link for your convenience:
https://www.dropbox.com/s/v7nxb7vrrjim349/de_wiki_50000_lines?dl=0
Now when you run this script and open your system monitor you will observe that memory gets used up and the usage keeps going until the 10th chunk, where I artificially go into an endless loop to demonstrate that the memory stays in use, even though I never store anything.
Can you explain to me why this happens? I seem to be missing something about how multiprocessing pools are supposed to be used.
First, let's clear up some misunderstandings—although, as it turns out, this wasn't actually the right avenue to explore in the first place.
When you allocate memory in Python, of course it has to go get that memory from the OS.
When you release memory, however, it rarely gets returned to the OS, until you finally exit. Instead, it goes into a "free list"—or, actually, multiple levels of free lists for different purposes. This means that the next time you need memory, Python already has it lying around, and can find it immediately, without needing to talk to the OS to allocate more. This usually makes memory-intensive programs much faster.
But this also means that—especially on modern 64-bit operating systems—trying to understand whether you really do have any memory pressure issues by looking at your Activity Monitor/Task Manager/etc. is next to useless.
The tracemalloc module in the standard library provides low-level tools to see what actually is going on with your memory usage. At a higher level, you can use something like memory_profiler, which (if you enable tracemalloc support—this is important) can put that information together with OS-level information from sources like psutil to figure out where things are going.
However, if you aren't seeing any actual problems—your system isn't going into swap hell, you aren't getting any MemoryError exceptions, your performance isn't hitting some weird cliff where it scales linearly up to N and then suddenly goes all to hell at N+1, etc.—you usually don't need to bother with any of this in the first place.
If you do discover a problem, then, fortunately, you're already half-way to solving it. As I mentioned at the top, most memory that you allocated doesn't get returned to the OS until you finally exit. But if all of your memory usage is happening in child processes, and those child processes have no state, you can make them exit and restart whenever you want.
Of course there's a performance cost to doing so—process teardown and startup time, and page maps and caches that have to start over, and asking the OS to allocate the memory again, and so on. And there's also a complexity cost—you can't just run a pool and let it do its thing; you have to get involved in its thing and make it recycle processes for you.
There's no builtin support in the multiprocessing.Pool class for doing this.
You can, of course, build your own Pool. If you want to get fancy, you can look at the source to multiprocessing and do what it does. Or you can build a trivial pool out of a list of Process objects and a pair of Queues. Or you can just directly use Process objects without the abstraction of a pool.
Another reason you can have memory problems is that your individual processes are fine, but you just have too many of them.
And, in fact, that seems to be the case here.
You create a Pool of 4 workers in this function:
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# ...
… and you call this function for every chunk:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
So, you end up with 4 new processes for every chunk. Even if each one has pretty low memory usage, having hundreds of them at once is going to add up.
Not to mention that you're probably severely hurting your time performance by having hundreds of processes competing over 4 cores, so you waste time in context switching and OS scheduling instead of doing real work.
As you pointed out in a comment, the fix for this is trivial: just make a single global pool instead of a new one for each call.
Sorry for getting all Columbo here, but… just one more thing… This code runs at the top level of your module:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
… and that's the code that tries to spin up the pool and all the child tasks. But each child process in that pool needs to import this module, which means they're all going to end up running the same code, and spinning up another pool and a whole extra set of child tasks.
You're presumably running this on Linux or macOS, where the default startmethod is fork, which means multiprocessing can avoid this import, so you don't have a problem. But with the other startmethods, this code would basically be a forkbomb that eats up all of your system resources. And that includes spawn, which is the default startmethod on Windows. So, if there's ever any chance anyone might run this code on Windows, you should put all of that top-level code in a if __name__ == '__main__': guard.
first question on stack overflow so please bear with. I am looking to calculate the variance for group ratings (long numpy arrays). Running the program without parallel processing works fine, but given each process can run independently and there are 32 groups I am looking to make use of multiprocessing to speed things up. This works OK for small numbers of groups < 10, but after this the program will often just seemingly stop running with no error messages at an unspecified number of groups ( usually between 20 and 30 ) although less frequently will run all the way through. The arrays are quite large ( 21451 x 11462 user item ratings) and so I am wondering if the problem is caused by not enough memory, although no error messages are printed.
import numpy as np
from functools import partial
import multiprocessing
def variance_parallel(extra_matrices, group_num):
# do some variation calculation
# print confirmation that we have entered function, and group number
return single_group_var
def variance(extra_matrices, num_groups):
variance_partial = partial(variance_parallel, extra_matrices)
for g in list(range(num_groups)):
group_var = pool.map(variance_partial,range(g))
return(group_var)
num_cores = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(processes=num_cores)
variance(extra_matrices, num_groups)
Running the above code shows the program progressively building the number of groups it is checking variance on ([0],[0,1],[0,1,2],...) before eventually printing nothing.
Thanks in advance for any help and apologies if my formatting / question is a bit off!
Multiple processes do not share data
Data sent to processes needs to be copied
Since the arrays are large, the issue is very likely to do with said copying of large arrays to the processes. Further more in Python's multiprocessing, sending data to processes is done by serialisation which is (a) CPU intensive and (b) takes extra memory in and by it self.
In short multi processing is not a good fit for your use case. Since numpy is a native code extension (where GIL does not apply) and is thread safe, best to use threading instead of multiprocessing. With threading, the worker threads can share data via their parent process's address space which makes away with having to copy.
That should stop the program from running out of memory.
However, for threads to share address space the data they share needs to be bound to an object, like in a python class.
Something like the below - untested as the code sample is incomplete.
import numpy as np
from functools import partial
from threading import Thread
from multiprocessing import cpu_count
class Variance(Thread):
def __init__(self, extra_matrices, group_num):
Thread.__init__(self)
self.extra_matrices = extra_matrices
self.group_num = group_num
self.output = None
def run(self):
# do some variation calculation
# print confirmation that we have entered function, and group number
self.output = single_group_var
num_cores = cpu_count() - 1
results = []
for g in list(range(num_groups)):
workers = [Variance(extra_matrices, range(g))
for _ in range(num_cores)]
# Start threads
for worker in workers:
worker.start()
# Wait for completion
for worker in workers:
worker.join()
results.extend([w.output for w in workers])
print results
I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.
I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.
Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.
I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res
Simply create the pool within your loop and close it at the end of the loop with
pool.close().
Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.
I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache
You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link
I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.
I have a program written in python 2.6 that creates a large number of short lived instances (it is a classic producer-consumer problem). I noticed that the memory usage as reported by top and pmap seems to increase when these instances are created and never goes back down. I was concerned that some python module I was using might be leaking memory so I carefully isolated the problem in my code. I then proceeded to reproduce it in as short as example as possible. I came up with this:
class LeaksMemory(list):
timesDelCalled = 0
def __del__(self):
LeaksMemory.timesDelCalled +=1
def leakSomeMemory():
l = []
for i in range(0,500000):
ml = LeaksMemory()
ml.append(float(i))
ml.append(float(i*2))
ml.append(float(i*3))
l.append(ml)
import gc
import os
leakSomeMemory()
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(gc.collect()) +" objects collected")
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(os.getpid()) + " : check memory usage with pmap or top")
If you run this with something like 'python2.6 -i memoryleak.py' it will halt and you can use pmap -x PID to check the memory usage. I added the del method so I could verify that GC was occuring. It is not there in my actual program and does not appear to make any functional difference. Each call to leakSomeMemory() increases the amount of memory consumed by this program. I fear I am making some simple error and that references are getting kept by accident, but cannot identify it.
Python will release the objects, but it will not release the memory back to the operating system immediately. Instead, it will re-use the same segments for future allocations within the same interpreter.
Here's a blog post about the issue: http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
UPDATE: I tested this myself with Python 2.6.4 and didn't notice persistent increases in memory usage. Some invocations of leakSomeMemory() caused the memory footprint of the Python process to increase, and some made it decrease again. So it all depends on how the allocator is re-using the memory.
According to Alex Martelli:
"The only really reliable way to
ensure that a large but temporary use
of memory DOES return all resources to
the system when it's done, is to have
that use happen in a subprocess, which
does the memory-hungry work then
terminates."
So, in your situation it sounds like it would make sense to use the multiprocessing module to run the short-lived functions in separate processes to ensure the return of resources when the process finishes.
import multiprocessing as mp
def NOT_leakSomeMemory():
# do stuff
return result
if __name__=='__main__':
pool = mp.Pool()
results=pool.map(NOT_leakSomeMemory, range(500000))
For more ideas on how to set things up using multiprocessing, see Doug Hellman's tutorial: