I'm struggling to understand when and when not to use compute() in Dask dataframes. I usually write my code by adding/removing compute() until the code works, but that's extremely error-prone. How should I use compute() in Dask? Does it differ in Dast Distributed?
The core idea of delayed computations is to delay the actual calculation until the final target is known. This allows:
increased speed of coding (e.g. as a data scientist, I don't need to wait for every transformation step to complete before designing the workflow),
distribution of work across multiple workers,
overcoming resource constraints of my client, e.g. if I am using a laptop with limited memory, I can run heavy computations on dask workers that are in the cloud or another machine with more resources,
better efficiency if the final target requires only some tasks to be done (e.g. if the final calculation requires only a subset of the dataframe, then dask will load only the relevant columns/partitions).
Some of the alternatives to calling .compute are:
.visualize(): this helps visualize the task graph. The DAG can become hairy when there are lots of tasks, so this is useful to run on smaller subsets of the data (e.g. only loading two/three partitions of the dataframe)
using client.submit: this launches computations right away providing you with a future, an object that refers to results of a task being computed. This gives the advantages of scaling work across multiple workers, but it can be a bit more resource intensive (since dask doesn't know the full workflow, it might run computations that are not needed to achieve the final target).
With regards to distributed, I don't think there is a difference except for where the result will be: dask.compute will put the result in local machine, while client.compute will keep the result on a remote worker.
Each partition in a Dask DataFrame is a Pandas DataFrame.
compute() combines all the partitions (Pandas DataFrames) into a single Pandas DataFrame.
Dask is fast because it can perform computations on partitions in parallel. Pandas can be slower because it only works on one partition.
You should avoid calling compute() whenever possible. It better to have the data spread out in multiple partitions, so it can be processed in parallel.
In rare cases you can compute to Pandas (e.g. when doing a large to small join or after a huge filtering operation), but it's best to learn how to use the Dask API to run computations in parallel.
Related
I have been working with dask and I have a question related the clients when you are processing a large script with high computational requirements
client = Client(n_workers = NUM_PARALLEL)
...
more code
...
client.shutdown()
I have seen some people that are shutting down the client in the middle of the process and then and then initializing it again, is it good for speed?
On the other hand, the workers are running out of memory, do you know if its a good practise to compute() several times the dask dataframe instead of computing it only once at the end, that may be beyond the performance capacity of the pc.
I have seen some people that are shutting down the client in the middle of the process and then and then initializing it again, is it good for speed?
IIUC, there is no effect on speed, there might be slight slow down in terms of time to spin up a scheduler/cluster. The only slight advantage is if you are sharing resources, then shutting down the cluster will free up the resources.
On the other hand, the workers are running out of memory, do you know if its a good practise to compute() several times the dask dataframe instead of computing it only once at the end, that may be beyond the performance capacity of the pc.
This really depends on the DAG, there might be an advantage in computing at an intermediate step if it reduces the number of partitions/tasks, especially if some of the results are computed multiple times.
I'm new to using Dask but have experienced painfully slow performance when attempting to re-write native sklearn functions in Dask. I've simplified the use-case as much as possible in hope of getting some help.
Using standard sklearn/numpy/pandas etc I have the following:
df = pd.read_csv(location, index_col=False) # A ~75MB CSV
# Build feature list and dependent variables, code irrelevant
from sklearn import linear_model
model = linear_model.Lasso(alpha=0.1, normalize=False, max_iter=100, tol=Tol)
model.fit(features.values, dependent)
print(model.coef_)
print(model.intercept_)
This takes a few seconds to compute. I then have the following in Dask:
# Read in CSV and prepare params like before but using dask arrays/dataframes instead
with joblib.parallel_backend('dask'):
from dask_glm.estimators import LinearRegression
# Coerce data
X = self.features.to_dask_array(lengths=True)
y = self.dependents
# Build regression
lr = LinearRegression(fit_intercept=True, solver='admm', tol=self.tolerance, regularizer='l1', max_iter=100, lamduh=0.1)
lr.fit(X, y)
print(lr.coef_)
print(lr.intercept_)
Which takes ages to compute (about 30 minutes). I only have 1 Dask worker in my development cluster but that has 16GB ram and unbounded CPU.
Has anyone any idea why this is so slow?
Hopefully my code omissions aren't significant!
NB: This is the simplest use-case before people ask why even use Dask - this was used as a proof of concept exercise to check that things would function as expected.
A quote from the documentation you may want to consider:
For large arguments that are used by multiple tasks, it may be more efficient to pre-scatter the data to every worker, rather than serializing it once for every task. This can be done using the scatter keyword argument, which takes an iterable of objects to send to each worker.
But in general, Dask has a lot of diagnostics available to you, especially the scheduler's dashboard, to help figure out what your workers are doing and how time is being spent - you would do well to investigate it. Other system-wide factors are also very important, as with any computation: how close are you coming to your memory capacity, for instance?
In general, though, Dask is not magic, and when data fits comfortable into memory anyway, there will certainly be cases where dask add significant overhead. Read the documentation carefully on the intended use for the method you are considering - is it supposed to speed things up, or merely allow you to process more data than would normally fit on your system?
I am working on an assignment in a class that I now realize may be a little out of my reach (this is first sememster I have done any programming)
The stipulation is that I use paralell programming with mpi.
I have to input a csv file of up to a terabyte, of tick data (every micro second) that may be locally out of sort. run a process on the data to identify noise, and output a cleaned data file.
I have written a serial program using Pandas that takes the data determines significant outliers and writes them to a dataset labeled noise, then create the final data set by doing original minus noise based on the index (time)
I have no idea on where to start for parellizing the program. I understand that because my computations are all local, I should import from csv in paralell and run the process to identify noise.
I believe the best way to do this (and i may be completely wrong) is to scatter run the computation and gather using a hdf5. But i do not know how to implement this.
I do not want someone to write an entire code, but maybe a specific example of importing in paralell from csv and regathering the data, or a better approach to the problem.
If you can boil down your program to a function to run against a list of rows, then yes a simple multiprocessing approach would be easy and effective. For instance:
from multiprocessing import Pool
def clean_tickData(filename):
<your code>
pool = Pool()
pool.map(clean_tickData, cvs_row)
pool.close()
pool.join()
map from Pool runs in parallel. One can control how many parallel processes, but the default, set with empty Pool() call, starts as many processes as you have CPU cores. So, if you reduce your clean-up work to a function that can be run over the various rows in your cvs, using pool.map would be a easy and fast implementation.
I have a directory containing n h5 file each of which has m image stacks to filter. For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing (Ref to Dask). I will use the dask arrays through the apply_parallel() function in scikit-image.
I will run the processing on a small server with 20 cpus.
I would like to get an advice to which parallel strategy will make more sense to use:
1) Sequential processing of the h5 files and all the cpus for dask processing
2) Parallel processing of the h5 files with x cores and use the remaining 20-x to dask processing.
3) Distribute the resources and parallel processing the h5 files, the images in each h5 files and the remaining resources for dask.
thanks for the help!
It is always best to parallelize in the simplest way possible. If you have several files and just want to run the same computation on each of them then this is almost certainly the simplest approach. If this saturates your computational resources then you can stop here without diving into more sophisticated methods.
If this is indeed your situation then you can parallelize done with dask, make, concurrent.futures or any of a variety of other libraries.
If there are other concerns, like trying to parallelize the operation itself or making sure you don't run out of memory then you are forced into more sophisticated systems like dask, but this may not be the case.
Use make for parallelization.
With make -j20 you can tell make to run 20 processes in parallel.
By using multiple processes, you avoid the cost of the "global interpreter lock". For independent tasks, it is more efficient to use multiple independent processes (benchmark if you have doubt). Make is great for processing whole folders where you need to apply the same command to each file - it is traditionally used for compiling source code, but it can be used to run arbitrary commands.
I am currently working on a project which involves performing a lot of statistical calculations on many relatively small datasets. Some of these calculations are as simple as computing a moving average, while others involve slightly more work, like Spearman's Rho or Kendell's Tau calculations.
The datasets are essentially a series of arrays packed into a dictionary, whose keys relate to a document id in MongoDb that provides further information about the subset. Each array in the dictionary has no more than 100 values. The dictionaries, however, may be infinitely large. In all reality however, around 150 values are added each year to the dictionary.
I can use mapreduce to perform all of the necessary calculations. Alternately, I can use Celery and RabbitMQ on a distributed system, and perform the same calculations in python.
My question is this: which avenue is most recommended or best-practice?
Here is some additional information:
I have not benchmarked anything yet, as I am just starting the process of building the scripts to compute the metrics for each dataset.
Using a celery/rabbitmq distributed queue will likely increase the number of queries made against the Mongo database.
I do not envision the memory usage of either method being a concern, unless the number of simultaneous tasks is very large. The majority of the tasks themselves are merely taking an item within a dataset, loading it, doing a calculation, and then releasing it. So even if the amount of data in a dataset is very large, not all of it will be loaded into memory at one time. Thus, the limiting factor, in my mind, comes down to the speed at which mapreduce or a queued system can perform the calculations. Additionally, it is dependent upon the number of concurrent tasks.
Thanks for your help!
It's impossible to say without benchmarking for certain, but my intuition leans toward doing more calculations in Python rather than mapreduce. My main concern is that mapreduce is single-threaded: One MongoDB process can only run one Javascript function at a time. It can, however, serve thousands of queries simultaneously, so you can take advantage of that concurrency by querying MongoDB from multiple Python processes.