Python Multiprocess using Pool fails on AWS Ubuntu

Python Multiprocess using Pool fails on AWS Ubuntu - python

I have a simple string matching script that tests just fine for multiprocessing with up to 8 Pool workers on my local mac with 4 cores. However, the same script on an AWS c1.xlarge with 8 cores generally kills all but 2 workers, the CPU only works at 25%, and after a few rounds stops with MemoryError.
I'm not too familiar with server configuration, so I'm wondering if there are any settings to tweak?
The pool implementation looks as follows, but doesn't seem to be the issue as it works locally. There would be several thousand targets per worker, and it doesn't run past the first five or so. Happy to share more of the code if necessary.
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999)
pool.close()
pool.join()

The MemoryError means you're running out of system-wide virtual memory. How much virtual memory you have is an abstract thing, based on the actual physical RAM plus swapfile size plus stuff that's paged into memory from other files and stuff that isn't paged anywhere because the OS is being clever and so on.
According to your comments, each process averages 0.75GB of real memory, and 4GB of virtual memory. So, your total VM usage is 32GB.
One common reason for this is that each process might peak at 4GB, but spend almost all of its time using a lot less than that. Python rarely releases memory to the OS; it'll just get paged out.
Anyway, 6GB of real memory is no problem on an 8GB Mac or a 7GB c1.xlarge instance.
And 32GB of VM is no problem on a Mac. A typical OS X system has virtually unlimited VM size—if you actually try to use all of it, it'll start creating more swap space automatically, paging like mad, and slowing your system to a crawl and/or running out of disk space, but that isn't going to affect you in this case.
But 32GB of VM is likely to be a problem on linux. A typical linux system has fixed-size swap, and doesn't let you push the VM beyond what it can handle. (It has a different trick that avoids creating probably-unnecessary pages in the first place… but once you've created the pages, you have to have room for them.) I'm not sure what an xlarge comes configured for, but the swapon tool will tell you how much swap you've got (and how much you're using).
Anyway, the easy solution is to create and enable an extra 32GB swapfile on your xlarge.
However, a better solution would be to reduce your VM use. Often each subprocess is doing a whole lot of setup work that creates intermediate data that's never needed again; you can use multiprocessing to push that setup into different processes that quit as soon as they're done, freeing up the VM. Or maybe you can find a way to do the processing more lazily, to avoid needing all that intermediate data in the first place.

Related

How to prevent python script from freezing in ubuntu 16.04

I work for a digital marketing agency having multiple clients. And in one of the projects, I have a very resource intensive python script (which fetches data for Facebook ads), to be run on all those clients (say 500+ in number) in ubuntu 16.04 server.
Originally script took around 2 mins to complete, with 300 MB RES & 1000 MB VM (as per htop), for 1 client. Hence optimized it with ThreadPoolExecutor (max_workers=10) so that script can run on 4 clients concurrently (almost).
Then found out that sometimes, script froze during run (or basically its in "comatose state"). Debugged & profiled and found that its not the script that's causing issue, but its the system.
Then batched the script, means if there are 20 input clients, ran 5 instances (4*5=20) of script. Here sometimes it went fine but sometimes last instance froze.
Then found out that RAM (2G) was being overused, hence increased swapping memory from 0 to 1G. That did the trick. But if few clients are heavy in memory, same thing happens.
Have attached the screenshot of the latest run where after running the 8 instances, last 2 froze. They can be left for days for that matter.
I am thinking of increasing the server RAM from 2G to 4G but not sure if that's the permanent solution. Did anyone has faced similar issue?

You need to fix the Ram consumption of your script,
if your script allocates more memory than your system can provide it get's memory errors, in case you have them in threadpools or similar constructs the threads may never return under some circumstances.
You can fix this by using async functions with timeouts and implementing automatic restart handlers, in case a process does not yield an expected results.
The best way to do that is heavily dependent on the script and will probably require altering already created code
The issue is definitly with your script and not with the OS.
The fastetst workaround would be to increase she system memory or to reduce the amount of threads.

If just adding 1GB of swap area "almost" did the trick then definitely increasing the physical memory is a good way to go. Btw remember that swapping means you're using disk storage, whose speed is measured in millisecs, while RAM speed is measured in nanosecs - so avoiding swap guarantees a performance boost.
And then, reboot your system every now and then. Although Linux is far better than Windows in this respect, memory leaks do occur in Linux too, and a reboot every few months will surely help.

As Gornoka stated you need to alter the memory comsumption of the script as added details this can also be done by removing declared variables within the script once used with the keyword
del
This can also be done by ensuring that if it is processing massive files it does this line by line and saving it as it finishes each line.
I have had this happen and it usually is an indicator of working with to much data at once within the ram and it is always better to work with it partially whenever possible and if not possible get more RAM

The different performance of Python multiprocessing.Pool on MacOS and Linux systems

I'm a beginner in Python. I used multiprocessing.Pool in my project to imporve performance.
Here's a snippet of code I use the multiprocessing.Pool.
I build the pool at the starting of my resident server, and use the Pool.apply_async method every time when the server get a request :
# build pool when server started
mp.set_start_method('forkserver')
self._driver_pool = Pool(processes=10)
self._executor_pool = Pool(processes=30)
# use pool every time get a request
driver = driver_class(driver_context, init_table, self._manager, **kwargs_dict)
future = self._driver_pool.apply_async(driver.run)
I tested the code on my computer which's operating system is MacOS, and then I deploy the code on a Linux computer.
I found that when I run my code on MacOS, the Pool.apply_async method costs likely 10ms, but the same code on Linux will cost 2s.
I don't understand why there is such a big difference in performance, Is there something wrong with the way I use the multiprocessing.Pool?

After some tests, I have a conjecture.
The current phenomenon is when the size of Pool is set to be 30, the first 30 requests were slow, but after that, the performance of tasks will decrease significantly.
On MacOS, I compared performance in both scenarios with and without pyc files, I found that the cost will raise after I deleted the pyc files.
I suspect there are several possible reasons for the performance differences:
When using 'forkserver' method to start a process, it will load all the resources including import files, which means the process will try to find the pyc files, otherwise it will compile the python file to pyc files and then load them.
The processes in a Pool will never release, which means once a process load pyc files into its memory, it will never load again.
The Mac computer has SSD hard disk, which means if a process on Mac try to load pyc files, it will get better performance than the process on a computer which do not have SSD hard disk.
Now the question I'm running into is whether there are ways to pre-load resources for processes started with 'forkserver' method for better performance.

How can I make use of swap space/virtual RAM in Jupyter lab/notebook?

I am running processes in Jupyter (lab) in a JupyterHub-created container running on Kubernetes.
The processes are too RAM-intensive, to the extent that the pod sometimes gets evicted due to an OOM.
Without modifications to my code/algorithms etc., how can I in the general case tell Jupyter(Lab) to use swap space/virtual memory when a predefined RAM limit is reached?
PS This question has no answer mentioning swap space - Jupyter Lab freezes the computer when out of RAM - how to prevent it?

You can't actively control swap space.
In Kubernetes specifically, you just don't supply a memory limit for the Kubernetes pod.
That would at least not kill it because of OOM (out of memory). However, I doubt it would work because this will make the whole node go out of RAM, then swap and become extremely slow and thus at some point declared dead by the Kubernetes master. Which in turn will cause the Pod to run somewhere else and start all over again.
A more scalable approach for you might be to use out-of-core algorithms, that can operate on disk directly (so just attach a PV/PVC to your pod), but that depends on the algorithm or process you're using.

Multiprocessing memory leak, processes that stay around forever

I'm trying to solve a multiprocessing memory leak and am trying to fully understand where the problem is. My architecture is looking for the following: A main process that delegates tasks to a few sub-processes. Right now there are only 3 sub-processes. I'm using Queues to send data to these sub-processes and it's working just fine except the memory leak.
It seems most issues people are having with memory leaks involve people either forgetting to join/exit/terminate their processes after completion. My case is a bit different. I want these processes to stay around forever for the entire duration of the application. So the main process will launch these 3 sub-processes, and they will never die until the entire app dies.
Do I still need to join them for any reason?
Is this a bad idea to keep processes around forever? Should I consider killing them and re-launching them at some point despite me not wanting to do that?
Should I not be using multiprocessing.Process for this use case?
I'm making a lot of API calls and generating a lot of dictionaries and arrays of data within my sub processes. I'm assuming my memory leak comes from not properly cleaning that up. Maybe my problem is entirely there and not related to the way I'm using multiprocessing.Process?
from multiprocessing import Process
# This is how I'm creating my 3 sub processes
procs = []
for name in names:
proc = Process(target=print_func, args=(name,))
procs.append(proc)
proc.start()
# Then I want my sub-processes to live forever for the remainder of the application's life
# But memory leaks until I run out of memory
Update 1:
I'm seeing this memory growth/leaking on MacOS 10.15.5 as well as Ubuntu 16.04. It behaves the same way in both OSs. I've tried python 3.6 and python 3.8 and have seen the same results
I never had this leak before going multiprocess. So that's why I was thinking this was related to multiprocess. So when I ran my code on one single process -> no leaking. Once I went multiprocess running the same code -> leaking/bloating memory.
The data that's actually bloating are lists of data (floats & strings). I confirmed this using the python package pympler, which is a memory profiler.
The biggest thing that changed since my multiprocess feature was added is, my data is gathered in the subprocesses then sent to the main process using Pyzmq. So I'm wondering if there are new pointers hanging around somehow preventing python from garbage collecting and fully releasing this lists of floats and strings.
I do have a feature that every ~30 seconds clears "old" data that I no longer need (since my data is time-sensitive). I'm currently investigating this to see if it's working as expected.
Update 2:
I've improved the way I'm deleting old dicts and lists. It seems to have helped but the problem still persists. The python package pympler is showing that I'm no longer leaking memory which is great. When I run it on mac, my activity monitor is showing a consistent increase of memory usage. When I run it on Ubuntu, the free -m command is also showing consistent memory bloating.
Here's what my memory looks like shortly after running the script:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 3920 2901 0 788 3438
Swap: 0 0 0
After running for a while, memory bloats according to free -m:
ubuntu:~/Folder/$ free -m
total used free shared buff/cache available
Mem: 7610 7385 130 0 93 40
Swap: 0 0 0
ubuntu:~/Folder/$
It eventually crashes from using too much memory.
To test where the leak comes from, I've turned off my feature where my subprocess send data to my main processes via Pyzmq. So the subprocesses are still making API calls and collecting data, just not doing anything with it. The memory leak completely goes away when I do this. So clearly the process of sending data from my subprocesses and then handling the data on my main process is where the leak is happening. I'll continue to debug.
Update 3 POSSIBLY SOLVED:
I may have resolved the issue. Still testing more thoroughly. I did some extra memory clean up on my dicts and lists that contained data. I also gave my EC2 instances ~20 GB of memory. My apps memory usage timeline looks like this:
Runtime after 1 minutes: ~4 GB
Runtime after 2 minutes: ~5 GB
Runtime after 3 minutes: ~6 GB
Runtime after 5 minutes: ~7 GB
Runtime after 10 minutes: ~9 GB
Runtime after 6 hours: ~9 GB
Runtime after 10 hours: ~9 GB
What's odd is that slow increment. Based on how my code works, I don't understand how it slowly increases memory usage from minute 2 to minute 10. It should be using max memory by around minute 2 or 3. Also, previously when I was running ALL of this logic on one single process, my memory usage was pretty low. I don't recall exactly what it was, but it was much much lower than 9 GB.
I've done some reading on Pyzmq and it appears to use a ton of memory. I think the massive memory usage increase comes from Pyzmq. Since I'm using it to send a massive amount of data between processes. I've read that Pyzmq is incredibly slow to release memory from large data messages. So it's very possible that my memory leak was not really a memory leak, it was just me using way way more memory due to Pyzmq and multi-processing sending data around.. I could confirm this by running my code from before my recent changes on a machine with ~20GB of memory.
Update 4 SOLVED:
My previous theory checked out. There was never a memory leak to begin with. The usage of Pyzmq with massive amounts of data dramatically increases memory usage to the point to where I had to ~6x my memory on my EC2 instance. So Pyzmq seems to either use a ton of memory or be very slow at releasing memory or both. Regardless, this has been resolved.

Given that you are on Linux, I'd suggest using https://github.com/vmware/chap to understand why the processes are growing.
To do that, first use ps to figure out the process IDs for each of your processes (the main and the child processes) then use "gcore " for each process to gather a live core. Gather cores again for each process after they have grown a bit.
For each core, you can open it in chap and use the following commands:
redirect on
describe used
The result will be files named like the original cores, followed by ".describe_used".
You can compare them to see which allocations are new.
Once you have identified some interesting new allocations for a process, try using "describe incoming" repeatedly from the chap prompt until you have seen how those allocations are used.

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.