Flink Slots/Parallelism vs Max CPU capabilities

Flink Slots/Parallelism vs Max CPU capabilities - python

I'm trying to understand the logic behind flink's slots and parallelism configurations in .yaml document.
Official Flink Documentation states that for each core in your cpu, you have to allocate 1 slot and increase parallelism level by one simultaneously.
But i suppose that this is just a recommendation. If for a example i have a powerful core(e.g. the newest i7 with max GHz), it's different from having an old cpu with limited GHz. So running much more slots and parallelism than my system's cpu maxcores isn't irrational.
But is there any other way than just testing different configurations, to check my system's max capabilities with flink?
Just for the record, im using Flink's Batch Python API.

It is recommended to assign each slot at least one CPU core because each operator is executed by at least 1 thread. Given that you don't execute blocking calls in your operator and the bandwidth is high enough to feed the operators constantly with new data, 1 slot per CPU core should keep your CPU busy.
If on the other hand, your operators issue blocking calls (e.g. communicating with an external DB), it sometimes might make sense to configure more slots than you have cores.

There are several interesting points in your question.
First, the slots in Flink are the processing capabilities that each taskmanager brings to the cluster, and they limit, first, the number of applications that can be executed on it, as well as the number of executable operators at the same time. Tentatively, a computer should not provide more processing power than CPU units present in it. Of course, this is true if all the tasks that run on it are computation intensive in CPU and low IO operations. If you have operators in your application highly blocking by IO operations there will be no problem in configuring more slots than CPU cores available in your taskmanager as #Till_Rohrmann said.
On the other hand, the default parallelism is the number of CPU cores available to your application in the Flink cluster, although it is something you can specify as a parameter manually when you run your application or specify it in your code. Note that a Flink cluster can run multiple applications simultaneously and it is not convenient that only one block entire cluster, unless it is the target, so, the default parallelism is usually less than the number of slots available in your Cluster (the sum of all slots contributed by your taskmanagers).
However, an application with parallelism 4 means, tentatively, that if it contains an stream: input().Map().Reduce().Sink() there should be 4 instances of each operator, so, the sum of cores used by the application Is greater than 4. But, this is something that the developers of Flink should explain ;)

Related

Can a serverless architecture support high memory needs?

The challenge is to run a set of data processing and data science scripts that consume more memory than expected.
Here are my requirements:
Running 10-15 Python 3.5 scripts via Cron Scheduler
These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
They run on different hours of the day, some of them run every 10 minute while some run once a day
Each script logs what it has done so that I can take a look at it later if something goes wrong
Some of the scripts sends e-mails to me and to my team mates
None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.
Here is what I currently have:
Currently I am using 3 Digital Ocean servers (droplets) for these scripts
Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up
Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.
What would be a more efficient way to handle this?

What would be a more efficient way to handle this?
I'll answer in three parts:
Options to reduce memory consumption
Minimalistic architecture for serverless computing
How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
understand which parts of the scripts drive memory consumption
refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.

Limit CPU usage of a Python Bottle webserver

I'm using Python + Bottle as a webserver.
As I use the production server for many other websites, I don't want Python + Bottle to eat 70% of the CPU for example.
How is it possible to limit the CPU usage of a Python Bottle webserver?
I was thinking about using resource.setrlimit, but is this a good way to do it?
With which syntax should we use resource.setrlimit to set the limit to 20% of the CPU for example?

Step 1
You should ask yourself whether resource optimization is really necessary. If you're certain that specific application consumes too much resources than it could or should then go to step 2.
Step 2
When your application consumes too much resources, first thing you should do is try to identify bottlenecks in it and see if they can be optimized away. Python has various tools that can help you (code profilers, PyPy etc.) If there's nothing you can do in that regard, go to step 3.
Step 3
If you absolutely must limit process resources, keep in mind that:
OS has a very sophisticated scheduling mechanisms that do their best to ensure each running process gets a fair share of CPU time. Even on processor overload, things will still run fine (up to some point).
if you deliberately limit CPU time of one of your processes it may respond slowly or not at all due to network timeouts.
My answer to this question is - reduce static priority of your server if you think it may starve other services but then your server may suffer from starvation when processors are overload. Using nice utility would be my choice as it won't litter your code.

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

Python commands to check hardware capabilities

I use MATLAB and for the purposes of multiprocessing applications I know there are several commands in MATLAB that allow you to see the hardware capabilities of the machine you are operation on, Such as getenv('NUMBER_OF_PROCESSORS') which returns the number of processors and others which can return the max number of computational threads.
Is there anything like this available in Python. The reason I ask is because I have a python program that has a component which is executed in parallel and when the program has been deployed on other machines of lesser computation power it has tended to crash or freeze the weaker machines.
So I am looking for a method to check the capabilities of the machine programmatically and then scale down (or up) the number of workers (parallel operations) in order to fit the capabilities of that machine.

In unix, relevant file in /proc and /dev have info about system. you can read such files programmatically
>>> with open("/proc/cpuinfo") as file:
data = file.read().split("\n")
cores = data[12].split()[2]
print(cores)

DOS Batch multicore affinity not working

I have a batch that launches a few executables .exe and .py (python) to process some data.
With start /affinity X mybatch.bat
it will work as it should only if X equals to 0, 2, 4 or 8 (the individual cores)
But if I will use a multicore X like 15 or F or 0xF (meaning in my opinion all 4 cores) it will still run only on the first core.
Does it have to do with the fact that the batch is calling .exe files that maybe cannot be affinity controled this way?
OS:Windows 7 64bit

This is more of an answer to a question that arose in comments, but I hope it might help. I have to add it as an answer only because it grew too large for the comment limits:
There seems to be a misconception about two things here: what "processor affinity" actually means, and how the Windows scheduler actually works.
What this SetProcessAffinityMask(...) means is "which processors can this process (i.e. "all threads within the process") can run on,"
whereas
SetThreadAffinityMask(...) is distinctly thread-specific.
The Windows scheduler (at the most base level) makes absolutely no distinction between threads and processes - a "process" is simply a container that contains one or more threads. IOW (and over-simplified) - there is no such thing as a process to the scheduler, "threads" are schedulable things: processes have nothing to do with this ("processes" are more life-cycle-management issues about open handles, resources, etc.)
If you have a single-threaded process, it does not matter much what you set the "process" affinity mask to: that one thread will be scheduled by the scheduler (for whatever masked processors) according to 1) which processor it was last bound to - ideal case, less overhead, 2) whichever processor is next available for a given runnable thread of the same priority (more complicated than this, but the general idea), and 3) possibly transient issues about priority inversion, waitable objects, kernel APC events, etc.
So to answer your question (much more long-windedly than expected):
"But if I will use a multicore X like 15 or F or 0xF (meaning in my opinion all 4 cores) it will still run only on the first core"
What I said earlier about the scheduler attempting to use the most-recently-used processor is important here: if you have a (or an essentially) single-threaded process, the scheduling algorithm goes for the most-optimistic approach: previously-bound CPU for the switchback (likely cheaper for CPU/main memory cache, prior branch-prediction eval, etc). This explains why you'll see an app (regardless of process-level affinity) with only one (again, caveats apply here) thread seemingly "stuck" to one CPU/core.
So:
What you are effectively doing with the "/affinity X" switch is
1) constraining the scheduler to only schedule your threads on a subset of CPU cores (i.e. not all), and
2) limit them to a subset of what the scheduler kernel considers "available for next runnable thread switch-to", and
3) if they are not multithreaded apps (and capable of taking advantage of that), "more cores" does not really help anything - you might just be bouncing that single thread of execution around to different cores (although the scheduler tries to minimize this, as described above).
That is why your threads are "sticky" - you are telling the scheduler to make them so.

AFAIK /AFFINITY uses a hex values as parameter. Here's more information on this topic: Set affinity with start /AFFINITY command on Windows 7
EDIT: 0xF should be the correct parameter for you as you want to load all 4 CPUs which requires the bit-mask 00001111 which is F in hex.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.