How does LocalCluster() affect the number of tasks?

How does LocalCluster() affect the number of tasks? - python

Do the calculations (like dask method dd.merge) need to be done inside or outside the LocalCluster? Do final calculations (like .compute) need to be done inside or outside the LocalCluster?
My main question is - how does LocalCluster() affect the number of tasks?
I and my colleague noticed that placing dd.merge outside of LocalCLuster() downgraded the number of tasks significantly (like 10x or smth like that). What is the reason for that?
pseudo example
many tasks:
dd.read_parquet(somewhere, index=False)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
dd.merge(smth)
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
few tasks:
dd.read_parquet(somewhere, index=False)
dd.merge(smth)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)

The performance difference is due to the difference in the schedulers being used.
According the the dask docs:
The dask collections each have a default scheduler
dask.dataframe use the threaded scheduler by default
The default scheduler is what is used when there is not another scheduler registered.
Additionally, according to the dask distributed docs:
When we create a Client object it registers itself as the default Dask scheduler. All .compute() methods will automatically start using the distributed system.
So when operating within the context manager for the cluster, computations implicitly use that scheduler.
A couple of additional notes:
It may be the case that the default scheduler is using more threads than the local cluster you are defining. It is also possible that a significant difference in performance is due to the overhead of inter-process communication that is not incurred by the threaded scheduler. More information about the schedulers is available here.

Related

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)

This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.

The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)

dask: What does memory_limit control?

In dask's LocalCluster, there is a parameter memory_limit. I can't find in the documentation (https://distributed.dask.org/en/latest/worker.html#memory-management) details about whether the limit is per worker, per thread, or for the whole cluster. This is probably at least in part because I have trouble following how keywords are passed around.
For example, in this code:
cluster = LocalCluster(n_workers=2,
threads_per_worker=4,
memory_target_fraction=0.95,
memory_limit='32GB')
will that be 32 GB for each worker? For both workers together? Or for each thread?
My question is motivated partly by running a LocalCluster with n_workers=1 and memory_limit=32GB, but it gets killed by the Slurm Out-Of-Memory killer for using too much memory.

The memory_limit keyword argument to LocalCluster sets the limit per worker.
Related documentaion: https://github.com/dask/distributed/blob/7bf884b941363242c3884b598205c75373287190/distributed/deploy/local.py#L76-L78
Note, if the memory_limit given is greater than the available memory, the total available memory will be set for each worker. This behavior hasn't been documented yet, but a relevant issue is here: https://github.com/dask/dask/issues/8224
Screenshot of cluster with code:
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=2,
threads_per_worker=4,
memory_target_fraction=0.95,
memory_limit='8GB')
client = Client(cluster)
client

relation between regular Dask and dask.distributed

I don't understand the relation between regular Dask and dask.distributed.
With dask.distributed, e.g. using the Futures interface, I have to explicitly create a client, which is backed by a local or remote cluster, and then submit to it using client.submit().
With regular Dask, e.g. using the Delayed interface, I just use delayed() on my functions.
How does delayed (or compute) determine where my computation takes place? There must be some global state behind it – but how would I access it? If I understand correctly, delayed uses a dask.distributed client if it exists. Does it use something like
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# use default scheduler
If so, why not use the same logic for submit?
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# fail because futures don't work on the default scheduler
And finally, delayed objects and future objects appear very similar. Why can the first use both a dask.distributed client and the default scheduler, while futures need dask.distributed?

Yes, there is some global state that assigns a current client
https://github.com/dask/distributed/blob/f3f4bffea0640c01fc54f49c3219cf5807d14c66/distributed/client.py#L93
If you call the compute method on a delayed object you'll end up using the current client
Dask delayed is just syntatic sugar that builds up a computation graph. When you call compute, the graph ends up being dispatched via the distributed client.
A future refers to a remote result on a cluster that may not be computed yet. The delayed object hasn't been submitted to the cluster
#delayed
def func(x):
return x
a = func(1)
In this case, a is a delayed object. That task hasn't been queued on the cluster at all
future = client.compute(a, sync=False)
You get a future after the task has been submitted to the cluster.

Dask has multiple backends. If you don't specify one everything runs on a local cluster with as many processes as you have cores in your CPU. When defining a cluster (local, Kubernetes, HPC, Spark) you can specify what you want exactly. However there is no difference on what the client sees only were and how it is executed.
All futures are executed on your backend as you send them, but you have to wait for the result to return. In the meantime you can do other stuff on the client. When it's finished, you can fetch the result with .result. I haven't worked with the futures API as much, but it should work like Python concurrent futures. This is also probably why you have to start a client beforehand. Dask wants to mirror the API as close as possible.
More information here.
The delayed, dataframe or array API only sends calculation to the backend, after you called .compute(). You then have to wait for the result to return and can't do anything in between.
More information here.

future cannot be used on a local machine (without a local cluster), since it triggers computation right away, so any further calculations in the same code will be blocked. delayed allows you to postpone computation until DAG is formed. So delayed can be run on a single machine with or without a cluster.

dask, joblib, ipyparallel and other schedulers for embarrassingly parallel problems

This is a more general question about how to run "embarassingly paralllel" problems with python "schedulers" in a science environment.
I have a code that is a Python/Cython/C hybrid (for this example I'm using github.com/tardis-sn/tardis .. but I have more such problems for other codes) that is internally OpenMP parallalized. It provides a single function that takes a parameter dictionary and evaluates to an object within a few hundred seconds running on ~8 cores (result=fun(paramset, calibdata) where paramset is a dict and result is an object (collection of pandas and numpy arrays basically) and calibdata is a pre-loaded pandas dataframe/object). It logs using the standard Python logging function.
I would like a python framework that can easily evaluate ~10-100k parameter sets using fun on a SLURM/TORQUE/... cluster environment.
Ideally, this framework would automatically spawn workers (given availability with a few cores each) and distribute the parameter sets between the workers (different parameter sets take different amount of time). It would be nice to see the state (in_queue, running, finished, failed) for each of the parameter-sets as well as logs (if it failed or finished or is running).
It would be nice if it keeps track of what is finished and what needs to be done so that I can restart this if my scheduler tasks fails. It would be nice if this seemlessly integrates into jupyter notebook and runs locally for testing.
I have tried dask but that does not seem to queue the tasks but runs them all-at-once with client.map(fun, [list of parameter sets]). Maybe there are better tools or maybe this is a very niche problem. It's also unclear to me what the difference between dask, joblib and ipyparallel is (having quickly tried all three of them at various stages).
Happy to give additional info if things are not clear.
UPDATE: so dask seems to provide some functionality of what I require - but dealing with an OpenMP parallelized code in addition to dask is not straightforward - see issue https://github.com/dask/dask-jobqueue/issues/181

Spark fair scheduling not working

I'm investigating the suitability of using spark as the back end for my REST API. A problem with that seems to be Spark's FIFO scheduling approach. This means that if a large task is under execution, no small task can finish until that heavy task has finished. According to https://spark.apache.org/docs/latest/job-scheduling.html a fair scheduler should fix this. However, I cannot notice this changing anything. Am I configuring the scheduler wrong?
scheduler.xml:
<?xml version="1.0"?>
<allocations>
<pool name="test">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>10</minShare>
</pool>
</allocations>
My code:
$ pyspark --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/home/hadoop/scheduler.xml
>>> import threading
>>> sc.setLocalProperty("spark.scheduler.pool", "test")
>>> def heavy_spark_job():
# Do some heavy work
>>>
>>> def smaller_spark_job():
# Do something simple
>>>
>>> threading.Thread(target=heavy_spark_job).start()
>>> smaller_spark_job()
The smaller spark job can only start when the first task of the heavy spark job doesn't need all of the available CPU cores.

You just need to set a different pools for your tasks:
By default, each pool gets an equal share of the cluster (also equal in share to each job in the default pool), but inside each pool, jobs run in FIFO order. For example, if you create one pool per user, this means that each user will get an equal share of the cluster, and that each user’s queries will run in order instead of later queries taking resources from that user’s earlier ones.
https://spark.apache.org/docs/latest/job-scheduling.html#default-behavior-of-pools
Also, in PySpark the child thread couldn't inherit the parent thread's local properties, you have to set the pool inside the thread target functions.

I believe, you have implemented fair scheduling correctly. There is a problem in threading as you are not running smaller_spark_job as a thread. Hence, it is waiting for heavy_spark_job to get complete first. It should be like below.
threading.Thread(target=heavy_spark_job).start()
threading.Thread(target= smaller_spark_job).start()
Also, my post can help you to verify FAIR Scheduling from Spark UI.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.