I have developed a complex program in which a thread pool execute tasks planned by many objects. I have memory leaks.
So far I have detected using guppy that the number of created objects is growing steadily, but they are not destroyed. How can I know what objects are not destroyed/collected?
Here is an excerpt of my code:
# Memory Profiling
from guppy import hpy
import gc
class ThreadPool:
...
# Every 1 sec run:
gc.collect() # yes, this is paranoid...
print str(self.h.heap()).split('\n')[0]
And the result is:
Partition of a set of 110304 objects. Total size = 15475848 bytes.
Partition of a set of 110318 objects. Total size = 15479920 bytes.
Partition of a set of 110320 objects. Total size = 15480808 bytes.
Partition of a set of 110328 objects. Total size = 15481408 bytes.
...
What were the last objects created? Is there some introspection code that can help?
Thank you!
Related
I would like to create in Python a process that run constantly in parallell while the main execution of my code is running. It should provide a way to deal with the sequential execution of Python that prevent me to do an asynchronous execution.
So I would like that a function RunningFunc run while my main code is doing some other operation.
I tried to use the threading module. However the computation is not in parralell and RunningFunc is an highly intensive computation and slow down heavily my main code.
I also tried using the multiprocessing module and I guess this should be my answer using a multiprocessing.Manager() doing some computation on a first process while accessing via a shared memory the data computed over time. But I didn't figure out a way to do that.
For exemple the RunningFunc is incrementing the Compteur variable.
def RunningFunc(x):
boolean = True
Compteur = 0
while boolean:
Compteur +=1
While in my main code some computation are running and I call sometime (not necessarily each while other_bool iteration), the Compteur variable of RunningFunc.
other_bool = True
Value = 0
while other_bool:
## MAKING SOME COMPUTATION
Value = Compteur # Call the variable compteur that is constantly running
## MAKING SOME COMPUTATION
There are many ways to do processing in child processes. Which is best depends on questions such as the size of the data to be shared verses the time spent in the calculation. Following is an example much like your simple increment of a variable, but flushed out to a slightly larger list of integers to highlight some of the issues you'll bump into.
A multiprocessing.Manager is a convenient way to share data among processes, but its not particularly fast because it needs to synchronize data among its processes. If the data you want to share is fairly modest and doesn't change that often, its a good choice. But I will just focus on shared memory here.
Most python objects cannot be created in shared memory. Things like the object header, reference count or the memory heap are not shareable. Some objects, notably numpy arrays can be shared, but that is a different answer.
What you can do, is serialize and write/read to shared memory. This could be done with any serialization mechanism, but converting to fundamental types via struct is a good way to do it.
That means that you have to write your code to save its data periodically. You also need to worry about synchronization if you are saving anything bigger than a single CPU level word to memory. The parent could read while the child is writing, giving you inconsistent data.
The following example shows one way to handle shared memory:
import multiprocessing as mp
import multiprocessing.shared_memory
import time
import struct
data_format = struct.Struct("3Q") # will share 3 longlong ints
def main():
# lock keeps shared memory readers from getting intermediate data
shared_lock = mp.Lock()
shared = mp.shared_memory.SharedMemory(create=True, size=8*3)
buf = shared.buf
try:
print(shared)
child = mp.Process(target=running_func, args=(shared.name, shared_lock))
child.start()
try:
print("read for 20 seconds")
for i in range(20):
with shared_lock:
my_list = data_format.unpack_from(buf, 0)
print(my_list)
time.sleep(1)
finally:
child.terminate()
child.join()
finally:
shared.close()
shared.unlink()
def running_func(shared_memory_name, lock):
shared = mp.shared_memory.SharedMemory(name=shared_memory_name)
buf = shared.buf
try:
my_list = [1,2,3]
while True:
my_list = [val+1 for val in my_list]
with lock:
data_format.pack_into(buf, 0, *my_list)
finally:
shared.close()
if __name__ == "__main__":
main()
I'm bruteforcing a 8-digit pin on a ELF executable (it's for a CTF) and I'm using asynchronous parallel processing. The code is very fast but it fills the memory even faster.
It takes about 10% of the total iterations to fill 8gbs of ram, and I have no idea what's causing it. Any help?
from pwn import *
import multiprocessing as mp
from tqdm import tqdm
def check_pin(pin):
program = process('elf_exe')
program.recvn(36)
program.sendline(str(pin))
program.recvline()
program.recvline()
res = program.recvline()
program.close()
if 'Access denied.' in str(res):
return null, null
else:
return res, pin
def process_result(res, pin):
if(res != null):
print(pin)
if __name__ == '__main__':
print(f'Starting bruteforce on {mp.cpu_count()} cores :)\n')
pool = mp.Pool(mp.cpu_count())
min = 10000000
max = 99999999
for pin in tqdm(range(min, max)):
pool.apply_async(check_pin, args=(pin), callback=process_result)
pool.close()
pool.join()
Multiprocessing pools create several processes. Calls to apply_async create a task that is added to a shared data structure (eg. queue). The data structure is read by processes thanks to inter-process communication (IPC). The thing is apply_async return a synchronization object that you do not use and so there is not synchronizations. Items appended in the data structure take some memory space (at least 32*3=96 bytes due to 3 CPython objects being allocated) and the data structure grow in memory to hold the 89_999_999 items hence at least 8 GiB of RAM. The process are not fast enough to execute the work. What tqdm print is totally is completely misleading: it just print the processing of the number of task submitted, not the one executed that is only a tiny fraction. Almost all the work is done when tqdm print 100% and the submission loop is done. I actually doubt the "code is very fast" since it appears to run 90 millions process while running a process is known to be an expensive operation.
To speed up this code and avoid a big memory usage, you need to aggregate the work in bigger tasks. You can for example and a range of pin variable to be computed and add a loop in check_pin. A reasonable range size is for example 1000. Additionally, you need to accumulate the AsyncResult objects returned by apply_async in a list and perform periodic synchronizations when the list becomes too big so that processes does not have too much work and so the shared data structure can remain small. Here is a simple untested example:
lst = []
for rng in allRanges:
lst.append(pool.apply_async(check_pin, args=(rng), callback=process_result))
if len(lst) > 100:
# Naive synchronization
for i in lst:
i.wait()
lst = []
I want to measure how much memory Django queryset use.
For example, I try the simple way.
import psutil
process = psutil.Process(os.getpid())
s = process.memory_info().rss # in bytes
for i in queryset:
pass
e = process.memory_info().rss # in bytes
print('queryset memory: %s' % (e-s))
Since iterating queryset, Django will hit a database and the result will be cached and by getting memory usage for the Python process, I try to measure queryset memory usage.
I wonder if the access would be right or there is any way to measure my goal, you guys know.
This measure is to predict if there would be any issue when trying to get a massive query result and if there is, from how many rows it results in an issue.
I know If I want to avoid caching queryset result, I can use iterator().
However, iterator() also should determine chunk_size parameter to reduce the number of hitting database and memory usage will be different depending on chunk_size.
import psutil
process = psutil.Process(os.getpid())
s = process.memory_info().rss # in bytes
for i in queryset.iterator(chunk_size=10000):
pass
e = process.memory_info().rss # in bytes
print('queryset memory: %s' % (e-s))
I am trying to parallelize my code to find the similarity matrix using multiprocessing module in Python. It works fine when I use the small np.ndarray with 10 X 15 elements. But, when I scale my np.ndarray to 3613 X 7040 elements, system runs out of memory.
Below, is my code.
import multiprocessing
from multiprocessing import Pool
## Importing Jacard_similarity_score
from sklearn.metrics import jaccard_similarity_score
# Function for finding the similarities between two np arrays
def similarityMetric(a,b):
return (jaccard_similarity_score(a,b))
## Below functions are used for Parallelizing the scripts
# auxiliary funciton to make it work
def product_helper1(args):
return (similarityMetric(*args))
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
job_args = getArguments(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
## getArguments function is used to get the combined list
def getArguments(list_a,list_b):
arguments = []
for i in list_a:
for j in list_b:
item = (i,j)
arguments.append(item)
return (arguments)
Now when I run the below code, system runs out of memory and gets hanged. I am passing two numpy.ndarrays testMatrix1 and testMatrix2 which are of size (3613, 7040)
resultantMatrix = parallel_product1(testMatrix1,testMatrix2)
I am new to using this module in Python and trying to understand where I am going wrong. Any help is appreciated.
Odds are, the problem is just combinatoric explosion. You're trying to realize all the pairs in the main process up front, rather than generating them live, so you're storing a huge amount of memory. Assuming the ndarrays contain double values, which become Python float, then the memory usage of the list returned by getArguments is roughly the cost of a tuple and two floats per pair, or about:
3613 * 7040 * (sys.getsizeof((0., 0.)) + sys.getsizeof(0.) * 2)
On my 64 bit Linux system, that means ~2.65 GB of RAM on Py3, or ~2.85 GB on Py2, before the workers even do anything.
If you can process the data in a streaming fashion using a generator, so arguments are produced lazily and discarded when no longer needed, you could probably reduce memory usage dramatically:
import itertools
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
This still requires all the results to fit in memory; if product_helper returns floats, then the expected memory usage for the result list on a 64 bit machine would still be around 0.75 GB or so, which is pretty large; if you can process the results in a streaming fashion, iterating the results of p.imap or even better, p.imap_unordered (the latter returns results as computed, not in the order the generator produced the arguments) and writing them to disk or otherwise ensuring they're released in memory quickly would save a lot of memory; the following just prints them out, but writing them to a file in some reingestable format would also be reasonable.
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
for result in p.imap_unordered(product_helper1, job_args):
print(result)
p.close()
p.join()
The map method sends all data to the workers via inter-process communication. As currently done, this consumes a huge amount of resources, because you're sending
What I would suggest it to modify getArguments to make a list of tuple of indices in the matrix that need to be combined. That's only two numbers that have to be sent to the worker process, instead of two rows of a matrix! Each worker then knows which rows in the matrix to use.
Load the two matrices and store them in global variables before calling map. This way every worker has access to them. And as long as they're not modified in the workers, the OS's virtual memory manager will not copy identical memory pages, keeping memory usage down.
I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.
I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.