Slow Dask performance compared to native sklearn

Slow Dask performance compared to native sklearn - python

I'm new to using Dask but have experienced painfully slow performance when attempting to re-write native sklearn functions in Dask. I've simplified the use-case as much as possible in hope of getting some help.
Using standard sklearn/numpy/pandas etc I have the following:
df = pd.read_csv(location, index_col=False) # A ~75MB CSV
# Build feature list and dependent variables, code irrelevant
from sklearn import linear_model
model = linear_model.Lasso(alpha=0.1, normalize=False, max_iter=100, tol=Tol)
model.fit(features.values, dependent)
print(model.coef_)
print(model.intercept_)
This takes a few seconds to compute. I then have the following in Dask:
# Read in CSV and prepare params like before but using dask arrays/dataframes instead
with joblib.parallel_backend('dask'):
from dask_glm.estimators import LinearRegression
# Coerce data
X = self.features.to_dask_array(lengths=True)
y = self.dependents
# Build regression
lr = LinearRegression(fit_intercept=True, solver='admm', tol=self.tolerance, regularizer='l1', max_iter=100, lamduh=0.1)
lr.fit(X, y)
print(lr.coef_)
print(lr.intercept_)
Which takes ages to compute (about 30 minutes). I only have 1 Dask worker in my development cluster but that has 16GB ram and unbounded CPU.
Has anyone any idea why this is so slow?
Hopefully my code omissions aren't significant!
NB: This is the simplest use-case before people ask why even use Dask - this was used as a proof of concept exercise to check that things would function as expected.

A quote from the documentation you may want to consider:
For large arguments that are used by multiple tasks, it may be more efficient to pre-scatter the data to every worker, rather than serializing it once for every task. This can be done using the scatter keyword argument, which takes an iterable of objects to send to each worker.
But in general, Dask has a lot of diagnostics available to you, especially the scheduler's dashboard, to help figure out what your workers are doing and how time is being spent - you would do well to investigate it. Other system-wide factors are also very important, as with any computation: how close are you coming to your memory capacity, for instance?
In general, though, Dask is not magic, and when data fits comfortable into memory anyway, there will certainly be cases where dask add significant overhead. Read the documentation carefully on the intended use for the method you are considering - is it supposed to speed things up, or merely allow you to process more data than would normally fit on your system?

Related

What is the logic behind "compute()" in Dask dataframes?

I'm struggling to understand when and when not to use compute() in Dask dataframes. I usually write my code by adding/removing compute() until the code works, but that's extremely error-prone. How should I use compute() in Dask? Does it differ in Dast Distributed?

The core idea of delayed computations is to delay the actual calculation until the final target is known. This allows:
increased speed of coding (e.g. as a data scientist, I don't need to wait for every transformation step to complete before designing the workflow),
distribution of work across multiple workers,
overcoming resource constraints of my client, e.g. if I am using a laptop with limited memory, I can run heavy computations on dask workers that are in the cloud or another machine with more resources,
better efficiency if the final target requires only some tasks to be done (e.g. if the final calculation requires only a subset of the dataframe, then dask will load only the relevant columns/partitions).
Some of the alternatives to calling .compute are:
.visualize(): this helps visualize the task graph. The DAG can become hairy when there are lots of tasks, so this is useful to run on smaller subsets of the data (e.g. only loading two/three partitions of the dataframe)
using client.submit: this launches computations right away providing you with a future, an object that refers to results of a task being computed. This gives the advantages of scaling work across multiple workers, but it can be a bit more resource intensive (since dask doesn't know the full workflow, it might run computations that are not needed to achieve the final target).
With regards to distributed, I don't think there is a difference except for where the result will be: dask.compute will put the result in local machine, while client.compute will keep the result on a remote worker.

Each partition in a Dask DataFrame is a Pandas DataFrame.
compute() combines all the partitions (Pandas DataFrames) into a single Pandas DataFrame.
Dask is fast because it can perform computations on partitions in parallel. Pandas can be slower because it only works on one partition.
You should avoid calling compute() whenever possible. It better to have the data spread out in multiple partitions, so it can be processed in parallel.
In rare cases you can compute to Pandas (e.g. when doing a large to small join or after a huge filtering operation), but it's best to learn how to use the Dask API to run computations in parallel.

Unexpected performance of multiprocessing.Pool()

I find that multiprocessing.Pool() doesn't behave as expected in my case below. Could anyone explain why it behaved in the way and how to improve the performance if possible. Following is just simplistic code:
import numpy as np
import multiprocessing
from itertools import repeat
def group_data_by_runID(args):
data, runID = args
return data[data[:,0].astype(int)==runID,:]
%%time
DATA = np.array([[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[1,8],[2,9],[2,10],[2,11],[2,12]])
runIDs = [0,1,2]*10000000
pool = multiprocessing.Pool(40)
list(pool.map(group_data_by_runID, zip(repeat(DATA), runIDs)))
As you can see in the above code that I intended to use 40 cores (56 cores and far more than enough memory available on this system) to run the code, it took 1min 31s. Then I used:
list(map(group_data_by_runID, zip(repeat(DATA), runIDs)))
It took 2min 33s. So the performance of using 40 cores only again less than twice performance, which is very weird to me. I also notice that even I 40 cores, it sometimes doesn't actually launch it in 40 cores as it can be seen in htop.
Where I did wrong? And how can I improve the speed. Please note that the actual data is much larger.

Maybe there are still many people like me who are confused by the performance of multiprocessing in python. Sometime you may achieve a performance gain and sometime you may even get worse performance. Thus I decided to answer this question by myself according to my own experience with multiprocessing.
There may be a overhead by using multiprocessing if your input data is large because these data would be copied and sent across the wire to different processes as juanpa commented above. This overhead could be very significant. However, we can still get a huge performance gain by chopping the input data into small chunks and let each process handle each chunk.
Another scenario where a significant performance gain can be achieved is there is no input data. Such as reading data from tens or hundreds of files.
Although multiprocessing can boost the speed, the majority of energy shall still be spent on the algorithm itself, which may fundamentally determine the efficiency of the code.

Python - go beyond RAM limits?

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

Python - Loop parallelisation with joblib

I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!

Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.

It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!

Data analysis using MapReduce in MongoDb vs a Distributed Queue using Celery & RabbitMq

I am currently working on a project which involves performing a lot of statistical calculations on many relatively small datasets. Some of these calculations are as simple as computing a moving average, while others involve slightly more work, like Spearman's Rho or Kendell's Tau calculations.
The datasets are essentially a series of arrays packed into a dictionary, whose keys relate to a document id in MongoDb that provides further information about the subset. Each array in the dictionary has no more than 100 values. The dictionaries, however, may be infinitely large. In all reality however, around 150 values are added each year to the dictionary.
I can use mapreduce to perform all of the necessary calculations. Alternately, I can use Celery and RabbitMQ on a distributed system, and perform the same calculations in python.
My question is this: which avenue is most recommended or best-practice?
Here is some additional information:
I have not benchmarked anything yet, as I am just starting the process of building the scripts to compute the metrics for each dataset.
Using a celery/rabbitmq distributed queue will likely increase the number of queries made against the Mongo database.
I do not envision the memory usage of either method being a concern, unless the number of simultaneous tasks is very large. The majority of the tasks themselves are merely taking an item within a dataset, loading it, doing a calculation, and then releasing it. So even if the amount of data in a dataset is very large, not all of it will be loaded into memory at one time. Thus, the limiting factor, in my mind, comes down to the speed at which mapreduce or a queued system can perform the calculations. Additionally, it is dependent upon the number of concurrent tasks.
Thanks for your help!

It's impossible to say without benchmarking for certain, but my intuition leans toward doing more calculations in Python rather than mapreduce. My main concern is that mapreduce is single-threaded: One MongoDB process can only run one Javascript function at a time. It can, however, serve thousands of queries simultaneously, so you can take advantage of that concurrency by querying MongoDB from multiple Python processes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.