I have been working with dask and I have a question related the clients when you are processing a large script with high computational requirements
client = Client(n_workers = NUM_PARALLEL)
...
more code
...
client.shutdown()
I have seen some people that are shutting down the client in the middle of the process and then and then initializing it again, is it good for speed?
On the other hand, the workers are running out of memory, do you know if its a good practise to compute() several times the dask dataframe instead of computing it only once at the end, that may be beyond the performance capacity of the pc.
I have seen some people that are shutting down the client in the middle of the process and then and then initializing it again, is it good for speed?
IIUC, there is no effect on speed, there might be slight slow down in terms of time to spin up a scheduler/cluster. The only slight advantage is if you are sharing resources, then shutting down the cluster will free up the resources.
On the other hand, the workers are running out of memory, do you know if its a good practise to compute() several times the dask dataframe instead of computing it only once at the end, that may be beyond the performance capacity of the pc.
This really depends on the DAG, there might be an advantage in computing at an intermediate step if it reduces the number of partitions/tasks, especially if some of the results are computed multiple times.
Related
I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:
from itertools import repeat
from multiprocessing import Pool
def expensive_function(list_of_values, some_param, another_param):
do_some_python_pillow_tasks()
do_some_ffmpeg_tasks()
if __name__ == '__main__':
values = [
['a', 'b', 'c'],
['x', 'y', 'z'],
# ...
# there can be MANY items in this list, let's say 1000
]
pool = Pool(processes=len(values))
pool.starmap(
expensive_function,
zip(values, repeat('yada yada yada'), repeat('hello world')),
)
pool.close()
None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).
This may be a stupid question, but can you utilize the GPU to help speed up processing?
No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).
Q : " ... am I using multiprocessing to the best of it's ability ?"
A :Well, that actually does not matter here at all.Congratulations!You happened to enjoy a such seldom use-case, where the so called embarrasingly parallel process-orchestration may save most of otherwise present problems.
Incidentally, nothing new, this very same, exactly the same use-case, reasoning was successfully used by Peter Jackson's VFX-team for his "Lord of The Rings" frame-by-frame video-rendering & video-postprocessing & final LASER-deposition of each frame on color-film band computer power-plant setup in New Zealand. Except his factory was full of Silicon Graphics' workstations ( no Python reported to have been there ), yet the workflow orchestration principle was the same ...
Python multithreading is irrelevant here, as it keeps all threads still stand in a queue and wait one after another for its turn in acquiring the one-&-only-one central Python GIL-lock, so using it is rather an anti-pattern if you wish to gain processing speed here.
Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for ~1k promoted above ), as it
firstspends (in further context negligible [TIME]- and [SPACE]-domains costs on each spawning of a new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory management-service of the O/S starts to, concurrently to running your "useful" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away + CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )next ( and repeated per each of the 1.000 cases ... )you will have to pay ( CPU-wise + MEM-I/O-wise + O/S-IPC-wise )another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-SERialiser( at CPU + MEM-I/O add-on costs ) + DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) + DATA-DESerialise( again at CPU + MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )
What can be done instead?
a)split the list ( values ) of independent values, as was posted above, in say 4 parts ( quad-core, 2 hardware-threads each, CPU ), andb)let the embarrasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched fully independent, on respective quarter of the list( values )
There will be zero add-on cost for doing so,there will be zero add-on SER/DES penalty for 1000+ tasks' data distribution and results' recollection, andthere will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of them - so no magic but sufficient CPU-cooling can save us here anyway )
One may also test, whether PIL.Image processing could get faster, if using OpenCV with numpy.ndarray() smart vectorised processing tricks, yet these are another Level-of-Detail of boosting performance, once we prevent those gigantic overheads costs reminded above.
Except for using a magic wand, there is no other magic possible on Python-interpreter here
I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.
The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).
There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.
I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?
I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.
It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.
I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.
The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.
I am currently working on a project which involves performing a lot of statistical calculations on many relatively small datasets. Some of these calculations are as simple as computing a moving average, while others involve slightly more work, like Spearman's Rho or Kendell's Tau calculations.
The datasets are essentially a series of arrays packed into a dictionary, whose keys relate to a document id in MongoDb that provides further information about the subset. Each array in the dictionary has no more than 100 values. The dictionaries, however, may be infinitely large. In all reality however, around 150 values are added each year to the dictionary.
I can use mapreduce to perform all of the necessary calculations. Alternately, I can use Celery and RabbitMQ on a distributed system, and perform the same calculations in python.
My question is this: which avenue is most recommended or best-practice?
Here is some additional information:
I have not benchmarked anything yet, as I am just starting the process of building the scripts to compute the metrics for each dataset.
Using a celery/rabbitmq distributed queue will likely increase the number of queries made against the Mongo database.
I do not envision the memory usage of either method being a concern, unless the number of simultaneous tasks is very large. The majority of the tasks themselves are merely taking an item within a dataset, loading it, doing a calculation, and then releasing it. So even if the amount of data in a dataset is very large, not all of it will be loaded into memory at one time. Thus, the limiting factor, in my mind, comes down to the speed at which mapreduce or a queued system can perform the calculations. Additionally, it is dependent upon the number of concurrent tasks.
Thanks for your help!
It's impossible to say without benchmarking for certain, but my intuition leans toward doing more calculations in Python rather than mapreduce. My main concern is that mapreduce is single-threaded: One MongoDB process can only run one Javascript function at a time. It can, however, serve thousands of queries simultaneously, so you can take advantage of that concurrency by querying MongoDB from multiple Python processes.