I am using a PBS-based cluster and running IPython parallel over a set of nodes, each with either 24 or 32 cores and memory ranging from 24G to 72G; this heterogeneity is due to our cluster having history to it. In addition, I have jobs that I am sending to the IPython cluster that have varying resource requirements (cores and memory). I am looking for a way to submit jobs to the ipython cluster that know about their resource requirements and those of the available engines. I imagine there is a way to deal with this situation gracefully using IPython functionality, but I have not found it. Any suggestions as to how to proceed?
In addition to graph dependencies, which you indicate that you already get, IPython tasks can have functional dependencies. These can be arbitrary functions, like tasks themselves. A functional dependency runs before the real task, and if it returns False or raises a special parallel.UnmetDependency exception, the task will not be run on that engine, and will be retried somewhere else.
So to use this, you need a function that checks whatever metric you need. For instance, let's say we only want to run a task on your nodes with a minimum amount of memory. Here is a function that checks the total memory on the system (in bytes):
def minimum_mem(limit):
import sys
if sys.platform == 'darwin': # or BSD in general?
from subprocess import check_output
mem = int(check_output(['sysctl', '-n', 'hw.memsize']))
else: # linux
with open("/proc/meminfo") as f:
for line in f:
if line.startswith("MemTotal"):
mem = 1024 * int(line.split()[1])
break
return mem >= limit
kB = 1024.
MB = 1024 * kB
GB = 1024 * MB
so minimum_mem(4 * GB) will return True iff you have at least 4GB of memory on your system. If you want to check available memory instead of total memory, you can use the MemFree and Inactive values in /proc/meminfo to determine what is not already in use.
Now you can submit tasks only to engines with sufficient RAM by applying the #parallel.depend decorator:
#parallel.depend(minimum_mem, 8 * GB)
def big_mem_task(n):
import os, socket
return "big", socket.gethostname(), os.getpid(), n
amr = view.map(big_mem_task, range(10))
Similarly, you can apply restrictions based on the number of CPUs (multiprocessing.cpu_count is a useful function there).
Here is a notebook that uses these to restrict assignment of some dumb tasks.
Typically, the model is to run one IPython engine per core (not per node), but if you have specific multicore tasks, then you may want to use a smaller number (e.g. N/2 or N/4). If your tasks are really big, then you may actually want to restrict it to one engine per node. If you are running more engines per node, then you will want to be a bit careful about running high resource tasks together. As I have written them, these checks do not take into account other tasks on the same node, so if a node as 16 GB of RAM, and you have two tasks that each need 10, you will need to be more careful about how you track available resources.
Related
I'm trying to understand the logic behind flink's slots and parallelism configurations in .yaml document.
Official Flink Documentation states that for each core in your cpu, you have to allocate 1 slot and increase parallelism level by one simultaneously.
But i suppose that this is just a recommendation. If for a example i have a powerful core(e.g. the newest i7 with max GHz), it's different from having an old cpu with limited GHz. So running much more slots and parallelism than my system's cpu maxcores isn't irrational.
But is there any other way than just testing different configurations, to check my system's max capabilities with flink?
Just for the record, im using Flink's Batch Python API.
It is recommended to assign each slot at least one CPU core because each operator is executed by at least 1 thread. Given that you don't execute blocking calls in your operator and the bandwidth is high enough to feed the operators constantly with new data, 1 slot per CPU core should keep your CPU busy.
If on the other hand, your operators issue blocking calls (e.g. communicating with an external DB), it sometimes might make sense to configure more slots than you have cores.
There are several interesting points in your question.
First, the slots in Flink are the processing capabilities that each taskmanager brings to the cluster, and they limit, first, the number of applications that can be executed on it, as well as the number of executable operators at the same time. Tentatively, a computer should not provide more processing power than CPU units present in it. Of course, this is true if all the tasks that run on it are computation intensive in CPU and low IO operations. If you have operators in your application highly blocking by IO operations there will be no problem in configuring more slots than CPU cores available in your taskmanager as #Till_Rohrmann said.
On the other hand, the default parallelism is the number of CPU cores available to your application in the Flink cluster, although it is something you can specify as a parameter manually when you run your application or specify it in your code. Note that a Flink cluster can run multiple applications simultaneously and it is not convenient that only one block entire cluster, unless it is the target, so, the default parallelism is usually less than the number of slots available in your Cluster (the sum of all slots contributed by your taskmanagers).
However, an application with parallelism 4 means, tentatively, that if it contains an stream: input().Map().Reduce().Sink() there should be 4 instances of each operator, so, the sum of cores used by the application Is greater than 4. But, this is something that the developers of Flink should explain ;)
I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.
The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.
So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.
One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)
So I wanted to use the python driver to batch insert documents to Couchbase. However if the batch exceeds few thousand documents the python kernel gets restarted (I'm using notebook).
To reproduce this please try:
from couchbase.bucket import Bucket
cb = Bucket('couchbase://localhost/default')
keys_per_doc = 50
doc_count = 10000
docs = [dict(
[
('very_long_feature_{}'.format(i), float(i) if i % 2 == 0 else i)
for i in xrange(keys_per_doc)
] + [('id', id_)] ) for id_ in xrange(doc_count)]
def sframe_to_cb(sf, key_column, key_prefix):
pipe = cb.pipeline()
with pipe:
for r in sf:
cb.upsert(key_prefix + str(r[key_column]), r)
return 0
p = sframe_to_cb(docs, 'id', 'test_')
The fun thing is that all docs get inserted and I suppose the interpreter dies when gathering the results on the pipeline.exit method.
I don't get any error message and the notebook console just says that it has restarted the notebook.
I'm curious what is causing this behavior and if there is a way to fix it.
Obviously I can do mini-batches (up to 3000 docs in my case) but this makes it much slower if they are processed sequentially.
I cannot use multiprocessing because I run the inserts inside of celery.
I cannot use multiple celery tasks because the serialisation of batches is too expensive and could kill our redis instance.
So the questions:
What is causing the crash with large batches and is there a way to fix it?
Assuming that nothing can go wrong with upserts can I make the pipeline discard results.
Is there a different way to achieve high throughput from a single process?
Aditional info as requested in comments:
VMWarte fusion on Mac running a Ubuntu 14.04 LTS VM
The guest ubuntu has 4GB RAM, 12GB swap on SSD, 2 cores (4 threads)
The impression that doing mini batches is slower comes from watching the bucket statsistics (large batch peaks at 10K TPS smaller ones get c.a. 2K TPS)
There is a large speed up if I use multiprocessing and these batches are distributed across multiple CPUs (20-30K TPS) however I cannot do this in production because of celery limitations (I cannot use a ProcessPoolExecutor inside a celery task)
I cannot really tell when exactly does the crash happen (I'm not sure if this is relevant)
I am working in Python 3.4, performing a naive search against partitioned data in memory, and am attempting to fork processes to take advantage of all available processing power. I say naive, because I am certain there are other additional things that can be done to improve performance, but those potentials are out of scope for the question at hand.
The system I am testing on is a Windows 7 x64 environment.
What I would like to achieve is a relatively even, simultaneous distribution across cpu_count() - 1 cores (reading suggests that distributing against all cores rather than n-1 cores does not show any additional improvement due to baseline os system processes). So 75% pegged cpu Usage for a 4 core machine.
What I am seeing (using windows task manager 'performance tab' and the 'process tab') is that I never achieve greater than 25% system dedicated cpu utilization and that the process view shows computation occurring one core at a time, switching every few seconds between the forked processes.
I haven't instrumented the code for timing, but I am pretty sure that my subjective observations are correct in that I am not gaining the performance increase I expected (3x on an i5 3320m).
I haven't tested on Linux.
Based on the code presented:
- How can I achieve 75% CPU utilization?
#pseudo code
def search_method(search_term, partition):
<perform fuzzy search>
return results
partitions = [<list of lists>]
search_terms = [<list of search terms>]
#real code
import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count() - 1)
for search_term in search_terms:
results = []
results = [pool.apply(search_method, args=(search_term, partitions[x])) for x in range(len(partitions))]
You're actually not doing anything concurrently here, because you're using pool.apply, which will block until the task you pass to it is complete. So, for every item in partitions, you're running search_method in some process inside of pool, waiting for it to complete, and then moving on to the next item. That perfectly coincides with what you're seeing in the Windows process manager. You want pool.apply_async instead:
for search_term in search_terms:
results = []
results = [pool.apply_async(search_method, args=(search_term, partitions[x])) for x in range(len(partitions))]
# Get the actual results from the AsyncResult objects returned.
results = [r.get() for r in results]
Or better yet, use pool.map (along with functools.partial to enable passing multiple arguments to our worker function):
from functools import partial
...
for search_term in search_terms:
func = partial(search_method, search_term)
results = pool.map(func, partitions)
I have a simple string matching script that tests just fine for multiprocessing with up to 8 Pool workers on my local mac with 4 cores. However, the same script on an AWS c1.xlarge with 8 cores generally kills all but 2 workers, the CPU only works at 25%, and after a few rounds stops with MemoryError.
I'm not too familiar with server configuration, so I'm wondering if there are any settings to tweak?
The pool implementation looks as follows, but doesn't seem to be the issue as it works locally. There would be several thousand targets per worker, and it doesn't run past the first five or so. Happy to share more of the code if necessary.
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999)
pool.close()
pool.join()
The MemoryError means you're running out of system-wide virtual memory. How much virtual memory you have is an abstract thing, based on the actual physical RAM plus swapfile size plus stuff that's paged into memory from other files and stuff that isn't paged anywhere because the OS is being clever and so on.
According to your comments, each process averages 0.75GB of real memory, and 4GB of virtual memory. So, your total VM usage is 32GB.
One common reason for this is that each process might peak at 4GB, but spend almost all of its time using a lot less than that. Python rarely releases memory to the OS; it'll just get paged out.
Anyway, 6GB of real memory is no problem on an 8GB Mac or a 7GB c1.xlarge instance.
And 32GB of VM is no problem on a Mac. A typical OS X system has virtually unlimited VM size—if you actually try to use all of it, it'll start creating more swap space automatically, paging like mad, and slowing your system to a crawl and/or running out of disk space, but that isn't going to affect you in this case.
But 32GB of VM is likely to be a problem on linux. A typical linux system has fixed-size swap, and doesn't let you push the VM beyond what it can handle. (It has a different trick that avoids creating probably-unnecessary pages in the first place… but once you've created the pages, you have to have room for them.) I'm not sure what an xlarge comes configured for, but the swapon tool will tell you how much swap you've got (and how much you're using).
Anyway, the easy solution is to create and enable an extra 32GB swapfile on your xlarge.
However, a better solution would be to reduce your VM use. Often each subprocess is doing a whole lot of setup work that creates intermediate data that's never needed again; you can use multiprocessing to push that setup into different processes that quit as soon as they're done, freeing up the VM. Or maybe you can find a way to do the processing more lazily, to avoid needing all that intermediate data in the first place.