python: How to determine which file a dict is from? - python

I'm trying to debug a memory leak. It's a script that runs as a daemon, and has dependencies in about 10K lines of code in 30 different files. After a while you see the memory usage start creeping up.
I used heapy to determine that that it's a dict that's growing, but how can I find out which file that dict lives in?

Related

Multiprocessing memory leak, processes that stay around forever

I'm trying to solve a multiprocessing memory leak and am trying to fully understand where the problem is. My architecture is looking for the following: A main process that delegates tasks to a few sub-processes. Right now there are only 3 sub-processes. I'm using Queues to send data to these sub-processes and it's working just fine except the memory leak.
It seems most issues people are having with memory leaks involve people either forgetting to join/exit/terminate their processes after completion. My case is a bit different. I want these processes to stay around forever for the entire duration of the application. So the main process will launch these 3 sub-processes, and they will never die until the entire app dies.
Do I still need to join them for any reason?
Is this a bad idea to keep processes around forever? Should I consider killing them and re-launching them at some point despite me not wanting to do that?
Should I not be using multiprocessing.Process for this use case?
I'm making a lot of API calls and generating a lot of dictionaries and arrays of data within my sub processes. I'm assuming my memory leak comes from not properly cleaning that up. Maybe my problem is entirely there and not related to the way I'm using multiprocessing.Process?
from multiprocessing import Process
# This is how I'm creating my 3 sub processes
procs = []
for name in names:
proc = Process(target=print_func, args=(name,))
procs.append(proc)
proc.start()
# Then I want my sub-processes to live forever for the remainder of the application's life
# But memory leaks until I run out of memory
Update 1:
I'm seeing this memory growth/leaking on MacOS 10.15.5 as well as Ubuntu 16.04. It behaves the same way in both OSs. I've tried python 3.6 and python 3.8 and have seen the same results
I never had this leak before going multiprocess. So that's why I was thinking this was related to multiprocess. So when I ran my code on one single process -> no leaking. Once I went multiprocess running the same code -> leaking/bloating memory.
The data that's actually bloating are lists of data (floats & strings). I confirmed this using the python package pympler, which is a memory profiler.
The biggest thing that changed since my multiprocess feature was added is, my data is gathered in the subprocesses then sent to the main process using Pyzmq. So I'm wondering if there are new pointers hanging around somehow preventing python from garbage collecting and fully releasing this lists of floats and strings.
I do have a feature that every ~30 seconds clears "old" data that I no longer need (since my data is time-sensitive). I'm currently investigating this to see if it's working as expected.
Update 2:
I've improved the way I'm deleting old dicts and lists. It seems to have helped but the problem still persists. The python package pympler is showing that I'm no longer leaking memory which is great. When I run it on mac, my activity monitor is showing a consistent increase of memory usage. When I run it on Ubuntu, the free -m command is also showing consistent memory bloating.
Here's what my memory looks like shortly after running the script:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 3920 2901 0 788 3438
Swap: 0 0 0
After running for a while, memory bloats according to free -m:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 7385 130 0 93 40
Swap: 0 0 0
ubuntu:~/Folder/$
It eventually crashes from using too much memory.
To test where the leak comes from, I've turned off my feature where my subprocess send data to my main processes via Pyzmq. So the subprocesses are still making API calls and collecting data, just not doing anything with it. The memory leak completely goes away when I do this. So clearly the process of sending data from my subprocesses and then handling the data on my main process is where the leak is happening. I'll continue to debug.
Update 3 POSSIBLY SOLVED:
I may have resolved the issue. Still testing more thoroughly. I did some extra memory clean up on my dicts and lists that contained data. I also gave my EC2 instances ~20 GB of memory. My apps memory usage timeline looks like this:
Runtime after 1 minutes: ~4 GB
Runtime after 2 minutes: ~5 GB
Runtime after 3 minutes: ~6 GB
Runtime after 5 minutes: ~7 GB
Runtime after 10 minutes: ~9 GB
Runtime after 6 hours: ~9 GB
Runtime after 10 hours: ~9 GB
What's odd is that slow increment. Based on how my code works, I don't understand how it slowly increases memory usage from minute 2 to minute 10. It should be using max memory by around minute 2 or 3. Also, previously when I was running ALL of this logic on one single process, my memory usage was pretty low. I don't recall exactly what it was, but it was much much lower than 9 GB.
I've done some reading on Pyzmq and it appears to use a ton of memory. I think the massive memory usage increase comes from Pyzmq. Since I'm using it to send a massive amount of data between processes. I've read that Pyzmq is incredibly slow to release memory from large data messages. So it's very possible that my memory leak was not really a memory leak, it was just me using way way more memory due to Pyzmq and multi-processing sending data around.. I could confirm this by running my code from before my recent changes on a machine with ~20GB of memory.
Update 4 SOLVED:
My previous theory checked out. There was never a memory leak to begin with. The usage of Pyzmq with massive amounts of data dramatically increases memory usage to the point to where I had to ~6x my memory on my EC2 instance. So Pyzmq seems to either use a ton of memory or be very slow at releasing memory or both. Regardless, this has been resolved.
Given that you are on Linux, I'd suggest using https://github.com/vmware/chap to understand why the processes are growing.
To do that, first use ps to figure out the process IDs for each of your processes (the main and the child processes) then use "gcore " for each process to gather a live core. Gather cores again for each process after they have grown a bit.
For each core, you can open it in chap and use the following commands:
redirect on
describe used
The result will be files named like the original cores, followed by ".describe_used".
You can compare them to see which allocations are new.
Once you have identified some interesting new allocations for a process, try using "describe incoming" repeatedly from the chap prompt until you have seen how those allocations are used.

Preventing multiple loading of large objects in uWSGI workers?

I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.

Python: Writing multiprocess shared list in background

I believe I am about to ask a definite newbie question, but here goes:
I written a python script that does snmp queries. The snmp query function uses a global list as its output.
def get_snmp(community_string, mac_ip):
global output_list
snmp get here
output_list.append(output_string)
The get_snmp querier's are launched using the following code:
pool.starmap_async(get_snmp, zip(itertools.repeat(COMMUNITY_STRING), input_list))
pool.close()
pool.join()
if output_file_name != None:
csv_writer(output_list, output_file_name)
This setup works fine, all of the get_snmp process write their output out to a shared list output_list, and then the csv_write function is called and that list is dumped to disk.
The main issue with this program is on a large run the memory usage can become quite high as the list is being built. I would like to write the results to the text file in the background to keep memory usage down, and I'm not sure how to do it. I went with the global list to eliminate file locking issues.
I think that your main problem with increasing memory usage is that you don't remove contents from that list when writing them to file.
Maybe you should do del output_list[:] after writing it to file.
Have each of the workers write their output to a Queue, then have another worker (or the main thread) read from the Queue and write to a file. That way you don't have to store everything in memory.
Don't write directly to the file from the workers; otherwise you can have issues with multiple processes trying to write to the same file at the same time, which will just give you a headache until you fix it anyway.

Optimize Memory Usage in Python: del obj or gc.collect()?

I have a python script to analyze user behavior from log file.
This script reads from several large files(about 50 GB each) by using file.readlines(), and then analyze them line by line and save the results in a dict of python object, after all lines are analyzed, the dict is wrote to the disk.
As I have a sever which has 64 cores and 96 GB memory, I start 10 processes of this script and each of which handle part of data. Besides, in order to save the time spent on IO operation, I use file.readlines(MAX_READ_LIMIT) instead of file.readline() and set MAX_READ_LIMIT = 1 GB.
After running this script on sever while using top command to show the task resource, I find that although each process of my script will occupy only about 3.5 GB memory(40 GB in total), there is only 380 MB left on the server (there is no other significant memory-consuming app running on the server at the same time).
So, I was wondering where is the memory? there should be about 96-40=36GB memory left?
please tell me if I make some mistakes on above observations.
One hypothesis is that the memory unused is NOT placed back into memory pool immediately, So I was wondering how to release unused memory explicitly and immediately.
I learned from python document that there are two complementary methods to manage memory in python: garbage collect and reference counting, and according to python doc:
Since the collector supplements the reference counting already used in
Python, you can disable the collector if you are sure your program
does not create reference cycles.
So, which one should I use for my case, del obj or gc.collect() ?
using file.readlines() , then analyze data line by line
This is a bad design. readlines reads the entire file and returns a Python list of strings. If you only need to process the data line-by-line, then iterate through the file without using readlines:
with open(filename) as f:
for line in f:
# process line
This will massively reduce the amount of memory your program requires.

Python Daemon Process Memory Management

I'm currently writing a Python daemon process that monitors a log file in realtime and updates entries in a Postgresql database based on their results. The process only cares about a unique key that appears in the log file and the most recent value it's seen from that key.
I'm using a polling approach,and process a new batch every 10 seconds. In order to reduce the overall set of data to avoid extraneous updates to the database, I'm only storing the key and the most recent value in a dict. Depending on how much activity there has been in the last 10 seconds, this dict can vary from 10-1000 unique entries. Then the dict gets "processed" and those results are sent to the database.
My main concern has revolves around memory management and the dict over time (days, weeks, etc). Since this is a daemon process that's constantly running, memory usage bloats based on the size of the dict, but never shrinks appropriately. I've tried reseting dict using a standard dereference, and the dict.clear() method after processing a batch, but noticed no changes in memory usage (FreeBSD/top). It seems that forcing a gc.collect() does recover some memory, but usually only around 50%.
Do you guys have any advice on how I should proceed? Is there something more I could be doing in my process? Feel free to chime in if you see a different road around the issue :)
When you clear() the dict or del the objects referenced by the dict, the contained objects are still around in memory. If they aren't referenced anywhere, they can be garbage-collected, as you have seen, but garbage-collection isn't run explicitly on a del or clear().
I found this similar question for you: https://stackoverflow.com/questions/996437/memory-management-and-python-how-much-do-you-need-to-know. In short, if you aren't running low on memory, you really don't need to worry a lot about this. FreeBSD itself does a good job handling virtual memory, so even if you have a huge amount of stale objects in your Python program, your machine probably won't be swapping to the disk.

Categories

Resources