Some observations on Heroku that don't completely mesh with my mental model.
My understanding is that CPython will never release memory once it has been allocated by the OS. So we should never observe a decrease in resident memory of CPython processes. And this is in fact my observation from occasionally profiling my Django application on Heroku; sometimes the resident memory will increase, but it will never decrease.
However, sometimes Heroku will alert me that my worker dyno is using >100% of its memory quota. This generally happens when a long-running response-data-heavy HTTPS request that I make to an external service (using the requests library) fails due to a server-side timeout. In this case, memory usage will spike way past 100%, then gradually drop back to less than 100% of quota, when the alarm ceases.
My question is, how is this memory released back to the OS? AFAIK it can't be CPython releasing it. My guess is that the incoming bytes from the long-running TCP connection are being buffered by the OS, which has the power to de-allocate. It's murky to me when exactly "ownership" of TCP bytes is transferred to my Django app. I'm certainly not explicitly reading lines from the input stream, I delegate all of that to requests.
Apparently, at one time, CPython did NOT ever release memory back to the OS. Then a patch was introduced in Python 2.5 that allowed memory to be released under certain circumstances, detailed here. Therefore it's no longer true to say that python doesn't release memory; it's just that python doesn't often release memory, because it doesn't handle memory fragmentation very well.
At a high-level, python keeps track of its memory in 256K blocks called arenas. Object pools are held in these arenas. Python is smart enough to free the arenas back to the OS when they're empty, but it still doesn't handle fragmentation across the arenas very well.
In my particular circumstance, I was reading large HTTP responses. If you dig down the code chain starting with HttpAdapter.send() in the requests library, you'll eventually find that socket.read() in the python socket library is making a system call to receive from its socket in chunks of 8192 bytes (default buffer size). This is the point at which the OS copies bytes from the kernel to the process, where they will be designated by CPython as string objects of size 8K and shoved into an arena. Note that StringIO, which is the python-land buffer object for sockets, simply keeps a list of these 8K strings rather than mushing them together into a super-string object.
Since 8K fits precisely 32 times into 256K, I think what is happening is that the received bytes are nicely filling up entire arenas without much fragmentation. These arenas can then be freed to the OS when the 8K strings filling them are deleted.
I think I understand why the memory is released gradually (asynchronous garbage collection?), but I still don't understand why it takes so long to release after a connection error. If the memory release always took so long, I should be seeing these memory usage errors all the time, because my python memory usage should spike whenever one of these calls are made. I've checked my logs, and I can sometimes see these violations last for minutes. Seems like an insanely long interval for memory release.
Edit: I have a solid theory on the issue now. This error is being reported to me by a logging system that keeps a reference to the last traceback. The traceback maintains a reference to all the variables in the frames of the traceback, including the StringIO buffer, which in turn holds references to all the 8K-strings read from socket. See the note under sys.exc_clear(): This function is only needed in only a few obscure situations. These include logging and error handling systems that report information on the last or current exception.
Therefore, in exception cases, the 8K-string ref counts don't drop to zero and immediately empty their arenas as they would in the happy path; we have to wait for background garbage collection to detect their reference cycles.
The GC delay is compounded by the fact that when this exception occurs, lots of objects are allocated over 5 minutes until timeout, which I'm guessing is plenty of time for lots of the 8K-strings to make it into the 2nd generation. With default GC thresholds of (700, 10, 10), it would take roughly 700*10 allocations for string objects to make it into the 2nd generation. That comes out to 7000*8192 ~= 57MB, which means that all the strings received before the last 57MB of the bytestream make it into the 2nd gen, maybe even 3rd gen if 570MB is streamed (but that seems high).
Intervals on the order of minutes still seem awfully long for garbage collection of the 2nd generation, but I guess it's possible. Recall that GC isn't triggered only by allocations, the formula is actually trigger == (allocations - deallocations > threshold).
TL;DR Large responses fill up socket buffers that fill up arenas without much fragmentation, allowing Python to actually release their memory back to the OS. In unexceptional cases, this memory will be released immediately upon exit of whatever context referenced the buffers, because the ref count on the buffers will drop to zero, triggering an immediate reclamation. In exceptional cases, as long as the traceback is alive, the buffers will still be referenced, therefore we will have to wait for garbage collection to reclaim them. If the exception occurred in the middle of a connection and a lot of data was already transmitted, then by the time of the exception many buffers will have been classified as members of an elder generation, and we will have to wait even longer for garbage collection to reclaim them.
CPython will release memory, but it's a bit murky.
CPython allocates chunks of memory at a time, lets call them fields.
When you instantiate an object, CPython will use blocks of memory from an existing field if possible; possible in that there's enough contagious blocks for said object.
If there's not enough contagious blocks, it'll allocate a new field.
Here's where it gets murky.
A Field is only freed when it contains zero objects, and while there's garbage collection in CPython, there's no "trash compactor". So if you have a couple objects in a few fields, and each field is only 70% full, CPython wont move those objects all together and free some fields.
It seems pretty reasonable that the large data chunk you're pulling from the HTTP call is getting allocated to "new" fields, but then something goes sideways, the object's reference count goes to zero, then garbage collection runs and returns those fields to the OS.
Related
My memory usage on a Django DRF API project increases over time and RAM is getting filled once I reach 50+ API calls.
So far I tried
loaded all models, class variable upfront
used memory profiler, cleaned code as possible to reduce variable usage
added garbage collection : gc.disable() at beginning and gc.enable() at end of code
added ctypes malloc.trim() at end of code etc
setting gunicorn max-requests limit ( this results in more model loading / response time at that moment)
Any suggestions on how to free up memory at the end of each request ?
Due to the way that the CPython interpreter manages memory, it very rarely actually frees any allocated memory. Generally CPython processes will keep growing and growing in memory usage
Since you are using Gunicorn you can set the max_requests setting which will regularly restart your workers and alleviate some "memory leak" issues
I'm trying to solve a multiprocessing memory leak and am trying to fully understand where the problem is. My architecture is looking for the following: A main process that delegates tasks to a few sub-processes. Right now there are only 3 sub-processes. I'm using Queues to send data to these sub-processes and it's working just fine except the memory leak.
It seems most issues people are having with memory leaks involve people either forgetting to join/exit/terminate their processes after completion. My case is a bit different. I want these processes to stay around forever for the entire duration of the application. So the main process will launch these 3 sub-processes, and they will never die until the entire app dies.
Do I still need to join them for any reason?
Is this a bad idea to keep processes around forever? Should I consider killing them and re-launching them at some point despite me not wanting to do that?
Should I not be using multiprocessing.Process for this use case?
I'm making a lot of API calls and generating a lot of dictionaries and arrays of data within my sub processes. I'm assuming my memory leak comes from not properly cleaning that up. Maybe my problem is entirely there and not related to the way I'm using multiprocessing.Process?
from multiprocessing import Process
# This is how I'm creating my 3 sub processes
procs = []
for name in names:
proc = Process(target=print_func, args=(name,))
procs.append(proc)
proc.start()
# Then I want my sub-processes to live forever for the remainder of the application's life
# But memory leaks until I run out of memory
Update 1:
I'm seeing this memory growth/leaking on MacOS 10.15.5 as well as Ubuntu 16.04. It behaves the same way in both OSs. I've tried python 3.6 and python 3.8 and have seen the same results
I never had this leak before going multiprocess. So that's why I was thinking this was related to multiprocess. So when I ran my code on one single process -> no leaking. Once I went multiprocess running the same code -> leaking/bloating memory.
The data that's actually bloating are lists of data (floats & strings). I confirmed this using the python package pympler, which is a memory profiler.
The biggest thing that changed since my multiprocess feature was added is, my data is gathered in the subprocesses then sent to the main process using Pyzmq. So I'm wondering if there are new pointers hanging around somehow preventing python from garbage collecting and fully releasing this lists of floats and strings.
I do have a feature that every ~30 seconds clears "old" data that I no longer need (since my data is time-sensitive). I'm currently investigating this to see if it's working as expected.
Update 2:
I've improved the way I'm deleting old dicts and lists. It seems to have helped but the problem still persists. The python package pympler is showing that I'm no longer leaking memory which is great. When I run it on mac, my activity monitor is showing a consistent increase of memory usage. When I run it on Ubuntu, the free -m command is also showing consistent memory bloating.
Here's what my memory looks like shortly after running the script:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 3920 2901 0 788 3438
Swap: 0 0 0
After running for a while, memory bloats according to free -m:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 7385 130 0 93 40
Swap: 0 0 0
ubuntu:~/Folder/$
It eventually crashes from using too much memory.
To test where the leak comes from, I've turned off my feature where my subprocess send data to my main processes via Pyzmq. So the subprocesses are still making API calls and collecting data, just not doing anything with it. The memory leak completely goes away when I do this. So clearly the process of sending data from my subprocesses and then handling the data on my main process is where the leak is happening. I'll continue to debug.
Update 3 POSSIBLY SOLVED:
I may have resolved the issue. Still testing more thoroughly. I did some extra memory clean up on my dicts and lists that contained data. I also gave my EC2 instances ~20 GB of memory. My apps memory usage timeline looks like this:
Runtime after 1 minutes: ~4 GB
Runtime after 2 minutes: ~5 GB
Runtime after 3 minutes: ~6 GB
Runtime after 5 minutes: ~7 GB
Runtime after 10 minutes: ~9 GB
Runtime after 6 hours: ~9 GB
Runtime after 10 hours: ~9 GB
What's odd is that slow increment. Based on how my code works, I don't understand how it slowly increases memory usage from minute 2 to minute 10. It should be using max memory by around minute 2 or 3. Also, previously when I was running ALL of this logic on one single process, my memory usage was pretty low. I don't recall exactly what it was, but it was much much lower than 9 GB.
I've done some reading on Pyzmq and it appears to use a ton of memory. I think the massive memory usage increase comes from Pyzmq. Since I'm using it to send a massive amount of data between processes. I've read that Pyzmq is incredibly slow to release memory from large data messages. So it's very possible that my memory leak was not really a memory leak, it was just me using way way more memory due to Pyzmq and multi-processing sending data around.. I could confirm this by running my code from before my recent changes on a machine with ~20GB of memory.
Update 4 SOLVED:
My previous theory checked out. There was never a memory leak to begin with. The usage of Pyzmq with massive amounts of data dramatically increases memory usage to the point to where I had to ~6x my memory on my EC2 instance. So Pyzmq seems to either use a ton of memory or be very slow at releasing memory or both. Regardless, this has been resolved.
Given that you are on Linux, I'd suggest using https://github.com/vmware/chap to understand why the processes are growing.
To do that, first use ps to figure out the process IDs for each of your processes (the main and the child processes) then use "gcore " for each process to gather a live core. Gather cores again for each process after they have grown a bit.
For each core, you can open it in chap and use the following commands:
redirect on
describe used
The result will be files named like the original cores, followed by ".describe_used".
You can compare them to see which allocations are new.
Once you have identified some interesting new allocations for a process, try using "describe incoming" repeatedly from the chap prompt until you have seen how those allocations are used.
I'd expect the memory usage of my app engine instances (Python) to be relatively flat after an initial startup period. Each request to my app is short lived, and it seems all memory usage of single request should be released shortly afterwards.
This is not the case in practice however. Below is a snapshot of instance memory usage provided by the console. My app has relatively low traffic so I generally have only one instance running. Over the two-day period in the graph, the memory usage trend is constantly increasing. (The two blips are where two instances were briefly running.)
I regularly get memory exceeded errors so I'd like to prevent this continuous increase of memory usage.
At the time of the snapshot:
Memcache is using less than 1MB
Task queues are empty
Traffic is low (0.2 count/second)
I'd expect the instance memory usage to fall in these circumstances, but it isn't happening.
Because I'm using Python with its automatic garbage collection, I don't see how I could have caused this.
Is this expected app engine behavior and is there anything I can do to fix it?
I found another answer that explains part of what is going on here. I'll give a summary based on that answer:
When using NDB, entities are stored in a context cache, and the context cache is part of your memory usage.
From the documentation, one would expect that memory to be released upon the completion of an HTTP request.
In practice, the memory is not released upon the completion of the HTTP request. Apparently, context caches are reused, and the cache is cleared before its next use, which can take a long time to happen.
For my situation, I am adding _use_cache=False to most entities to prevent them from being stored in the context cache. Because of the way my app works, I don't need the context caches for these entities, and this reduces memory usage.
The above is only a partial solution however!
Even with caching turned off for most of my entities, my memory usage is still constantly increasing! Below is snapshot over a 2.5 day period where the memory continuously increases from 36 MB to 81 MB. This is over the 4th of July weekend with low traffic.
I'm currently writing a Python daemon process that monitors a log file in realtime and updates entries in a Postgresql database based on their results. The process only cares about a unique key that appears in the log file and the most recent value it's seen from that key.
I'm using a polling approach,and process a new batch every 10 seconds. In order to reduce the overall set of data to avoid extraneous updates to the database, I'm only storing the key and the most recent value in a dict. Depending on how much activity there has been in the last 10 seconds, this dict can vary from 10-1000 unique entries. Then the dict gets "processed" and those results are sent to the database.
My main concern has revolves around memory management and the dict over time (days, weeks, etc). Since this is a daemon process that's constantly running, memory usage bloats based on the size of the dict, but never shrinks appropriately. I've tried reseting dict using a standard dereference, and the dict.clear() method after processing a batch, but noticed no changes in memory usage (FreeBSD/top). It seems that forcing a gc.collect() does recover some memory, but usually only around 50%.
Do you guys have any advice on how I should proceed? Is there something more I could be doing in my process? Feel free to chime in if you see a different road around the issue :)
When you clear() the dict or del the objects referenced by the dict, the contained objects are still around in memory. If they aren't referenced anywhere, they can be garbage-collected, as you have seen, but garbage-collection isn't run explicitly on a del or clear().
I found this similar question for you: https://stackoverflow.com/questions/996437/memory-management-and-python-how-much-do-you-need-to-know. In short, if you aren't running low on memory, you really don't need to worry a lot about this. FreeBSD itself does a good job handling virtual memory, so even if you have a huge amount of stale objects in your Python program, your machine probably won't be swapping to the disk.
I have a Twisted server under load. When the server is under load, memory usage increases, and it is never reclaimed (even when there are no more clients). Next time it goes into high load, memory usage increases again. Here's a snapshot of the situation at that point:
RSS memory is 400 MB (should be 200MB with usual max number of clients).
gc.garbage is empty, so there are no uncollectable objects.
Using objgraph.py shows no obvious candidates for leaks (no notable difference between a normal, healthy process and a leaking process).
Using pympler shows a few tens of MB (only) used by Python objects (mostly dict, list, str and other native containers).
Valgrind with leak-check=full enabled doesn't show any major leaks (only couple of MBs 'definitively lost') - so C extensions are not the culprit. The total memory also doesn't add up with the 400MB+ shown by top:
==23072== HEAP SUMMARY:
==23072== in use at exit: 65,650,760 bytes in 463,153 blocks
==23072== total heap usage: 124,269,475 allocs, 123,806,322 frees, 32,660,215,602 bytes allocated
The only explanation I can find is that some objects are not tracked by the garbage collector, so that they are not shown by objgraph and pympler, yet use an enormous amount of RAM.
What other tools or solutions do I have? Would compiling the Python interpreter in debug mode help, by using sys.getobjects?
If the code is only leaking under load (did you verify this?), I'd have a look at all spots where messages are buffered. Does the memory usage of the process itself increase? Or does the memory use of the system increase? If it's the latter case, your server might simply be too slow to keep up with the incoming messages and the OS buffer fill up..