TLDR: Adding a gc.collect() to my script has fixed the memory leak. How did this happen?
Long version: I was having a memory leak in my Flask server, after making a change in how the databases is updated. Before the change, the server processes would have a resident set size of 28kB. After applying the change, it would grow up to 250 Mb in two days.
I've did some tests with the heap, but I didn't get any clue where the dangling references might be. So I just added a gc.collect() after the database commit (which happens every 15 sec).
This somehow mysteriously solved it, because it has been running for an hour now and it stays below 29.5kb (before the fix it would be way higher now). I'm not sure why this change would solve this, since Python has automatic GC, and I'm just forcing an immediate one. Is using gc.collect() a viable solution for fixing the leak (e.g. no side-effects)?
Related
I wrote a python Dash app and made it available within my organization using OpenShift. I’m not really knowledgeable about OpenShift but it seems to be running correctly, including when multiple users are involved.
My problem is with memory management. Each time a user initiates a new session, the memory used by Dash app increases by ~200MB when I look on OpenShift. When the user closes the browser tab, the consumed memory does not go down (not even after weeks). Essentially the amount of memory the Dash app consumes keeps growing.
I am probably missing something, but how do I get Dash to clear memory after the user closes the browser tab or after some time passes since the last action? The dcc.Store objects in my code have "storage_type = ‘memory’ ". But from what I understand the dcc.Store keeps all the stored data on the client side in the browser, so this should not increase the memory on the server.
I deployed my app with
app.run_server(debug=True, dev_tools_hot_reload=False, port=8080, host=“0.0.0.0”)
in case this matters.
Any help would be really appreciated! Right now I keep manually restarting the app to clear the memory but this is not practical at all. Thank you!
How much memory are you allocating to the container? Also, does the memory continually go up? Or once it reaches a certain level does it plateau? Are you tracking any GC behavior in Python?
I'm not an expert on Python memory management, and know nothing about Dash, but Python does manage its own memory heap and has a garbage collector. Thus it is completely normal behavior for Python to never deallocate memory, Python is essentially reserving the memory for potential future use. Once it needs memory it will garbage collect the unreferenced objects.
As long as you aren't running out of memory or seeing undesirable GC behavior, the best thing to do is just set reasonable memory requests/limits for the container and let the Python GC manage the memory it has been allocated.
I also faced a similar issue with dash on one of our company legacy apps. Unfortunately, I cannot share the code.
I tried to use gc.collect() after each callback, it is not recommended to do and it didn't help.
My problem was that the script didn't properly initialize Dash.
I added the following at the start of the script:
app=dash.Dash(__name__)
server=app.server
This kind of solved memory issue problems.
app.run_server should be used for development environment (there should be a warning when you run it). In case of prod, you need to use something like gunicorn. It has a learning curve, but it is not a steep one. If you insist on using app.run_server, you should set debug=False it will reduce memory consumption and increase speed.
I have a script which sometimes runs successfully, providing the desired output, but when rerun moments later it provides the following error:
numpy.core._exceptions.MemoryError: Unable to allocate 70.8 MiB for an array with shape (4643100, 2) and data type float64
I realise this question has been answered several times (like here), but so far none of the solutions have worked for me. I was wondering if anyone has any idea how it's possible that sometimes the script runs fine and then moments later it provides an error?
I have lowered my computer's RAM usage, have increased the virtual memory, rebooted my laptop, none of which seemed to help (Windows 10, RAM 8.0GB, python 3.9.2 32 bit).
PS: Unfortunately not possible to share the script/create dummy.
Python is a garbage collected language. Garbage collection is non-deterministic. This means that peak memory usage may be different each time a program is run. So the first time you run the program, its peak memory usage is less than the available memory. But the next time you run the program, its peak memory usage is sufficient to consume all available memory. This assumes that the available memory on the host system is constant, which is an incorrect assumption. So the fluctuation in available memory, i.e. the memory not in use by the other running processes, is another reason that the program may raise a MemoryError one time, but terminate without error another time.
Sidenote: Increase virtual memory as a last resort. It isn't memory, it's disk that is used like memory, and it is much slower than memory.
Recently I started having some problems with Django (3.1) tests, which I finally tracked down to some kind of memory leak.
I normally run my suite (roughly 4000 tests at the moment) with --parallel=4 which results in a high memory watermark of roughly 3GB (starting from 500MB or so).
For auditing purposes, though, I occasionally run it with --parallel=1 - when I do this, the memory usage keeps increasing, ending up over the VM's allocated 6GB.
I spent some time looking at the data and it became clear that the culprit is, somehow, Webtest - more specifically, its response.html and response.forms: each call during the test case might allocate a few MBs (two or three, generally) which don't get released at the end of the test method and, more importantly, not even at the end of the TestCase.
I've tried everything I could think of - gc.collect() with gc.DEBUG_LEAK shows me a whole lot of collectable items, but it frees no memory at all; using delattr() on various TestCase and TestResponse attributes and so on resulted in no change at all, etc.
I'm quite literally at my wits' end, so any pointer to solve this (beside editing the thousand or so tests which use WebTest responses, which is really not feasible) would be very much appreciated.
(please note that I also tried using guppy and tracemalloc and memory_profiler but neither gave me any kind of actionable information.)
Update
I found that one of our EC2 testing instances isn't affected by the problem, so I spent some more time trying to figure this out.
Initially, I tried to find the "sensible" potential causes - for instance, the cached template loader, which was enabled on my local VM and disabled on the EC2 instance - without success.
Then I went all in: I replicated the EC2 virtualenv (with pip freeze) and the settings (copying the dotenv), and checked out the same commit where the tests were running normally on the EC2.
Et voilà! THE MEMORY LEAK IS STILL THERE!
Now, I'm officially giving up and will use --parallel=2 for future tests until some absolute guru can point me in the right directions.
Second update
And now the memory leak is there even with --parallel=2. I guess that's somehow better, since it looks increasingly like it's a system problem rather than an application problem. Doesn't solve it but at least I know it's not my fault.
Third update
Thanks to Tim Boddy's reply to this question I tried using chap to figure out what's making memory grow. Unfortunately I can't "read" the results properly but it looks like some non-python library is actually causing the problem.
So, this is what I've seen analyzing the core after a few minutes running the tests that I know cause the leak:
chap> summarize writable
49 ranges take 0x1e0aa000 bytes for use: unknown
1188 ranges take 0x12900000 bytes for use: python arena
1 ranges take 0x4d1c000 bytes for use: libc malloc main arena pages
7 ranges take 0x3021000 bytes for use: stack
139 ranges take 0x476000 bytes for use: used by module
1384 writable ranges use 0x38b5d000 (951,439,360) bytes.
chap> count used
3144197 allocations use 0x14191ac8 (337,189,576) bytes.
The interesting point is that the non-leaking EC2 instance shows pretty much the same values as the one I get from count used - which would suggest that those "unknown" ranges are the actual hogs.
This is also supported by the output of summarize used (showing first few lines):
Unrecognized allocations have 886033 instances taking 0x8b9ea38(146,401,848) bytes.
Unrecognized allocations of size 0x130 have 148679 instances taking 0x2b1ac50(45,198,416) bytes.
Unrecognized allocations of size 0x40 have 312166 instances taking 0x130d980(19,978,624) bytes.
Unrecognized allocations of size 0xb0 have 73886 instances taking 0xc66ca0(13,003,936) bytes.
Unrecognized allocations of size 0x8a8 have 3584 instances taking 0x793000(7,942,144) bytes.
Unrecognized allocations of size 0x30 have 149149 instances taking 0x6d3d70(7,159,152) bytes.
Unrecognized allocations of size 0x248 have 10137 instances taking 0x5a5508(5,920,008) bytes.
Unrecognized allocations of size 0x500018 have 1 instances taking 0x500018(5,242,904) bytes.
Unrecognized allocations of size 0x50 have 44213 instances taking 0x35f890(3,537,040) bytes.
Unrecognized allocations of size 0x458 have 2969 instances taking 0x326098(3,301,528) bytes.
Unrecognized allocations of size 0x205968 have 1 instances taking 0x205968(2,120,040) bytes.
The size of those single-instance allocations is very similar to the kind of deltas I see if I add calls to resource.getrusage(resource.RUSAGE_SELF).ru_maxrss in my test runner when starting/stopping tests - but they're not recognized as Python allocations, hence my feeling.
First of all, a huge apology: I was mistaken in thinking WebTest was the cause of this, and the reason was indeed in my own code, rather than libraries or anything else.
The real cause was a mixin class where I, unthinkingly, added a dict as class attribute, like
class MyMixin:
errors = dict()
Since this mixin is used in a few forms, and the tests generate a fair amout of form errors (that are added to the dict), this ended up hogging memory.
While this is not very interesting in itself, there are a few takeaways that may be helpful to future explorers who stumble across the same kind of problem. They might all be obvious to everybody except me and a single other developer - in which case, hello other developer.
The reason why the same commit had different behaviors on the EC2 machine and my own VM is that the branch in the remote machine hadn't been merged yet, so the commit that introduced the leak wasn't there poisoning the environment.
The takeaway here is: make sure the code you're testing is the same, not just the commit.
Low-level memory analysis might help in some cases but it's not a skill you pick up in half a day: I spent a long time trying to make sense of allocations and objects and whatever without getting any closer to the solution.
This kind of mistake can be incredibly costly - if I had a few hundred fewer tests, I wouldn't have ended up with an OOM error, and I probably wouldn't have noticed the problem at all. Until it was in production, that is.
That could be fixed with some kind of linter/static analysis too, if there were one which flags this kind of construction as potentially harmful. Unfortunately, there isn't one (that I could find).
git bisect is your friend, as long as you can find a commit that actually works.
I have huge array objects that are pickled with the python pickler.
I am trying to unpickle them and reading out the data in a for loop.
Every time I am done reading and assesing, I delete all the references to those objects.
After deletion, I even call gc.collect() along with time.sleep() to see if the heap memory reduces.
The heap memory doesn't reduce pointing to the fact that, the data is still referenced somewhere within the pickle loading. After 15 datafiles(I got 250+ files to process, 1.6GB each) I hit the memory error.
I have seen many other questions here, pointing out a memory leak issue which was supposedly solved.
I don't understand what is exactly happening in my case.
Python memory management does not free memory to OS till the process is running.
Running the for loop with a subprocess to call the script helped me solved the issue.
Thanks for the feedbacks.
I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?
Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.