Spark+Python set GC memory threshold

Spark+Python set GC memory threshold - python

I'm trying to run a Python worker (PySpark app) which is using too much memory and my app is getting killed my YARN because of exceeding memory limits (I'm trying to lower memory usage in order to being able to spawn more workers).
I come from Java/Scala, so Python GC works similar than JVM in my head...
Is there a way to tell Python what's the amount of "available memory" it has? I mean, Java GCs when your heap size is almost-full. I want to perform the same operation on Python, so yarn doesn't kill my application because of using too much memory when that memory is garbage (I'm on Python3.3 and there are memory references # my machine).
I've seen resource hard and soft limits, but no documentation say if GCs trigger on them or not. AFAIK nothing triggers GCs by memory usage, does any1 know a way to do so?
Thanks,

CPython (I assume this is the one you use) is significantly different compared to Java. The main garbage collecting method is reference counting. Unless you deal with circular references (IMHO it is not common in normal PySpark workflows) you won't need full GC sweeps at all (data related objects should be collected once data is spilled / pickled).
Spark is also known to kill idle Python workers, even if you enable reuse option, so quite often it skips GC completely.
You can control CPython garbage collecting behavior using set_threshold method:
gc.set_threshold(threshold0[, threshold1[, threshold2]]
or trigger GC sweep manually with collect:
gc.collect(generation=2)
but in my experience most of the GC problems in PySpark come from JVM part, not Python.

Related

PySpark. Correlation between spark.yarn.executor.memoryOverhead and spark.executor.pyspark.memory

I faced with an issue when serving ml model using PySpark 2.4.0 and MlFlow.
Executor fails with the following exception:
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (2048) Allocator(stdin reader for ./my-job-impl-condaenv.tar.gz/bin/python) 0/2048/8194/9223372036854775807 (res/actual/peak/limit)
From the articles about PySpark, I understood the following things:
spark runs at least one python process for each core of each executor;
spark.executor.memory parameter configures only JVM memory limits and doesn't affect the python process;
python worker process consumes memory from executor overhead, configured using spark.yarn.executor.memoryOverhead;
since spark 2.4.0 we can reserve memory for python worker explicitly using spark.executor.pyspark.memory that allows us to plan memory more granularly and stop overcommitting memory using spark.yarn.executor.memoryOverhead;
Here is an explanation of spark.executor.pyspark.memory from official docs:
The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python's memory use and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests.
At first, I just increased the amount of memory using spark.yarn.executor.memoryOverhead and error finally gone.
Then I decided to make things better and specify the amount of memory for python worker using spark.executor.pyspark.memory that caused to the same error.
So, seems that I didn't properly understood what exactly configures spark.executor.pyspark.memory and how it correlates with spark.yarn.executor.memoryOverhead
I don't have too much experience with PySpark, so I hope that you'll help me to understood the process of memory allocation in PySpark, thx !

How can I make use of swap space/virtual RAM in Jupyter lab/notebook?

I am running processes in Jupyter (lab) in a JupyterHub-created container running on Kubernetes.
The processes are too RAM-intensive, to the extent that the pod sometimes gets evicted due to an OOM.
Without modifications to my code/algorithms etc., how can I in the general case tell Jupyter(Lab) to use swap space/virtual memory when a predefined RAM limit is reached?
PS This question has no answer mentioning swap space - Jupyter Lab freezes the computer when out of RAM - how to prevent it?

You can't actively control swap space.
In Kubernetes specifically, you just don't supply a memory limit for the Kubernetes pod.
That would at least not kill it because of OOM (out of memory). However, I doubt it would work because this will make the whole node go out of RAM, then swap and become extremely slow and thus at some point declared dead by the Kubernetes master. Which in turn will cause the Pod to run somewhere else and start all over again.
A more scalable approach for you might be to use out-of-core algorithms, that can operate on disk directly (so just attach a PV/PVC to your pod), but that depends on the algorithm or process you're using.

Multiprocessing memory leak, processes that stay around forever

I'm trying to solve a multiprocessing memory leak and am trying to fully understand where the problem is. My architecture is looking for the following: A main process that delegates tasks to a few sub-processes. Right now there are only 3 sub-processes. I'm using Queues to send data to these sub-processes and it's working just fine except the memory leak.
It seems most issues people are having with memory leaks involve people either forgetting to join/exit/terminate their processes after completion. My case is a bit different. I want these processes to stay around forever for the entire duration of the application. So the main process will launch these 3 sub-processes, and they will never die until the entire app dies.
Do I still need to join them for any reason?
Is this a bad idea to keep processes around forever? Should I consider killing them and re-launching them at some point despite me not wanting to do that?
Should I not be using multiprocessing.Process for this use case?
I'm making a lot of API calls and generating a lot of dictionaries and arrays of data within my sub processes. I'm assuming my memory leak comes from not properly cleaning that up. Maybe my problem is entirely there and not related to the way I'm using multiprocessing.Process?
from multiprocessing import Process
# This is how I'm creating my 3 sub processes
procs = []
for name in names:
proc = Process(target=print_func, args=(name,))
procs.append(proc)
proc.start()
# Then I want my sub-processes to live forever for the remainder of the application's life
# But memory leaks until I run out of memory
Update 1:
I'm seeing this memory growth/leaking on MacOS 10.15.5 as well as Ubuntu 16.04. It behaves the same way in both OSs. I've tried python 3.6 and python 3.8 and have seen the same results
I never had this leak before going multiprocess. So that's why I was thinking this was related to multiprocess. So when I ran my code on one single process -> no leaking. Once I went multiprocess running the same code -> leaking/bloating memory.
The data that's actually bloating are lists of data (floats & strings). I confirmed this using the python package pympler, which is a memory profiler.
The biggest thing that changed since my multiprocess feature was added is, my data is gathered in the subprocesses then sent to the main process using Pyzmq. So I'm wondering if there are new pointers hanging around somehow preventing python from garbage collecting and fully releasing this lists of floats and strings.
I do have a feature that every ~30 seconds clears "old" data that I no longer need (since my data is time-sensitive). I'm currently investigating this to see if it's working as expected.
Update 2:
I've improved the way I'm deleting old dicts and lists. It seems to have helped but the problem still persists. The python package pympler is showing that I'm no longer leaking memory which is great. When I run it on mac, my activity monitor is showing a consistent increase of memory usage. When I run it on Ubuntu, the free -m command is also showing consistent memory bloating.
Here's what my memory looks like shortly after running the script:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 3920 2901 0 788 3438
Swap: 0 0 0
After running for a while, memory bloats according to free -m:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 7385 130 0 93 40
Swap: 0 0 0
ubuntu:~/Folder/$
It eventually crashes from using too much memory.
To test where the leak comes from, I've turned off my feature where my subprocess send data to my main processes via Pyzmq. So the subprocesses are still making API calls and collecting data, just not doing anything with it. The memory leak completely goes away when I do this. So clearly the process of sending data from my subprocesses and then handling the data on my main process is where the leak is happening. I'll continue to debug.
Update 3 POSSIBLY SOLVED:
I may have resolved the issue. Still testing more thoroughly. I did some extra memory clean up on my dicts and lists that contained data. I also gave my EC2 instances ~20 GB of memory. My apps memory usage timeline looks like this:
Runtime after 1 minutes: ~4 GB
Runtime after 2 minutes: ~5 GB
Runtime after 3 minutes: ~6 GB
Runtime after 5 minutes: ~7 GB
Runtime after 10 minutes: ~9 GB
Runtime after 6 hours: ~9 GB
Runtime after 10 hours: ~9 GB
What's odd is that slow increment. Based on how my code works, I don't understand how it slowly increases memory usage from minute 2 to minute 10. It should be using max memory by around minute 2 or 3. Also, previously when I was running ALL of this logic on one single process, my memory usage was pretty low. I don't recall exactly what it was, but it was much much lower than 9 GB.
I've done some reading on Pyzmq and it appears to use a ton of memory. I think the massive memory usage increase comes from Pyzmq. Since I'm using it to send a massive amount of data between processes. I've read that Pyzmq is incredibly slow to release memory from large data messages. So it's very possible that my memory leak was not really a memory leak, it was just me using way way more memory due to Pyzmq and multi-processing sending data around.. I could confirm this by running my code from before my recent changes on a machine with ~20GB of memory.
Update 4 SOLVED:
My previous theory checked out. There was never a memory leak to begin with. The usage of Pyzmq with massive amounts of data dramatically increases memory usage to the point to where I had to ~6x my memory on my EC2 instance. So Pyzmq seems to either use a ton of memory or be very slow at releasing memory or both. Regardless, this has been resolved.

Given that you are on Linux, I'd suggest using https://github.com/vmware/chap to understand why the processes are growing.
To do that, first use ps to figure out the process IDs for each of your processes (the main and the child processes) then use "gcore " for each process to gather a live core. Gather cores again for each process after they have grown a bit.
For each core, you can open it in chap and use the following commands:
redirect on
describe used
The result will be files named like the original cores, followed by ".describe_used".
You can compare them to see which allocations are new.
Once you have identified some interesting new allocations for a process, try using "describe incoming" repeatedly from the chap prompt until you have seen how those allocations are used.

Memory leak in Python Twisted: where is it?

I have a Twisted server under load. When the server is under load, memory usage increases, and it is never reclaimed (even when there are no more clients). Next time it goes into high load, memory usage increases again. Here's a snapshot of the situation at that point:
RSS memory is 400 MB (should be 200MB with usual max number of clients).
gc.garbage is empty, so there are no uncollectable objects.
Using objgraph.py shows no obvious candidates for leaks (no notable difference between a normal, healthy process and a leaking process).
Using pympler shows a few tens of MB (only) used by Python objects (mostly dict, list, str and other native containers).
Valgrind with leak-check=full enabled doesn't show any major leaks (only couple of MBs 'definitively lost') - so C extensions are not the culprit. The total memory also doesn't add up with the 400MB+ shown by top:
==23072== HEAP SUMMARY:
==23072== in use at exit: 65,650,760 bytes in 463,153 blocks
==23072== total heap usage: 124,269,475 allocs, 123,806,322 frees, 32,660,215,602 bytes allocated
The only explanation I can find is that some objects are not tracked by the garbage collector, so that they are not shown by objgraph and pympler, yet use an enormous amount of RAM.
What other tools or solutions do I have? Would compiling the Python interpreter in debug mode help, by using sys.getobjects?

If the code is only leaking under load (did you verify this?), I'd have a look at all spots where messages are buffered. Does the memory usage of the process itself increase? Or does the memory use of the system increase? If it's the latter case, your server might simply be too slow to keep up with the incoming messages and the OS buffer fill up..

How long do zipimported module imports remain cached in memory when using appengine / python and is there a way to keep them in memory?

I've recently uploaded an app that uses django appengine patch and currently have a cron job that runs every two minutes. On each invocation of the worker url it consumes quite a bit of resources
/worker_url 200 7633ms 34275cpu_ms 28116api_ms
That is because on each invocation it does a cold zipimport of all the libraries django etc.
How long do the imported modules stay in memory?
Is there a way to keep these modules in memory so even if subsequent calls are not within the timeframe that these modules stay in memory they still don't invoke the overhead?

app engine keeps everything in memory according to normal Python semantics as long as it's serving one or more requests in the same process in the same node; if and when it needs those resources, the process goes away (so nothing stays in memory that it used to have), and new processes may be started (on the same node or different nodes) any time to serve requests (whether other processes serving other requests are still running, or not). This is much like the fast-CGI model: you're assured of normal semantics within a single request but apart from that anything between 0 and N (no upper limit) different nodes may be running your code, each serving sequentially anything between 0 and K (no upper limit) different requests.
There is nothing you can do to "stay in memory" (for zipimported modules or anything else).
For completeness let me mention memcache, which is an explicit hint/request to the app engine runtime to keep something in a special form of memory, a distributed hash table that's shared among all processes running your code -- that's hard though not impossible to use for imported modules (you'd need pretty sophisticated import hooks) and I recommend against the effort needed to develop such hooks because even in presence of such explicit hints the app engine runtime can still at any time choose to eject anything you've stashed away in the cache, anyway.
Rather -- I'm not sure why a cron job in particular would need all of django nor of why you're zipimporting it rather than just using the 1.0.2 that now comes with app engine, per the docs -- care to elaborate? This might be a useful issue for you to optimize.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.