dask: What does memory_limit control?

dask: What does memory_limit control? - python

In dask's LocalCluster, there is a parameter memory_limit. I can't find in the documentation (https://distributed.dask.org/en/latest/worker.html#memory-management) details about whether the limit is per worker, per thread, or for the whole cluster. This is probably at least in part because I have trouble following how keywords are passed around.
For example, in this code:
cluster = LocalCluster(n_workers=2,
threads_per_worker=4,
memory_target_fraction=0.95,
memory_limit='32GB')
will that be 32 GB for each worker? For both workers together? Or for each thread?
My question is motivated partly by running a LocalCluster with n_workers=1 and memory_limit=32GB, but it gets killed by the Slurm Out-Of-Memory killer for using too much memory.

The memory_limit keyword argument to LocalCluster sets the limit per worker.
Related documentaion: https://github.com/dask/distributed/blob/7bf884b941363242c3884b598205c75373287190/distributed/deploy/local.py#L76-L78
Note, if the memory_limit given is greater than the available memory, the total available memory will be set for each worker. This behavior hasn't been documented yet, but a relevant issue is here: https://github.com/dask/dask/issues/8224
Screenshot of cluster with code:
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=2,
threads_per_worker=4,
memory_target_fraction=0.95,
memory_limit='8GB')
client = Client(cluster)
client

Related

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)

This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.

The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)

How does LocalCluster() affect the number of tasks?

Do the calculations (like dask method dd.merge) need to be done inside or outside the LocalCluster? Do final calculations (like .compute) need to be done inside or outside the LocalCluster?
My main question is - how does LocalCluster() affect the number of tasks?
I and my colleague noticed that placing dd.merge outside of LocalCLuster() downgraded the number of tasks significantly (like 10x or smth like that). What is the reason for that?
pseudo example
many tasks:
dd.read_parquet(somewhere, index=False)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
dd.merge(smth)
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
few tasks:
dd.read_parquet(somewhere, index=False)
dd.merge(smth)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)

The performance difference is due to the difference in the schedulers being used.
According the the dask docs:
The dask collections each have a default scheduler
dask.dataframe use the threaded scheduler by default
The default scheduler is what is used when there is not another scheduler registered.
Additionally, according to the dask distributed docs:
When we create a Client object it registers itself as the default Dask scheduler. All .compute() methods will automatically start using the distributed system.
So when operating within the context manager for the cluster, computations implicitly use that scheduler.
A couple of additional notes:
It may be the case that the default scheduler is using more threads than the local cluster you are defining. It is also possible that a significant difference in performance is due to the overhead of inter-process communication that is not incurred by the threaded scheduler. More information about the schedulers is available here.

Is it possible to modify number of workers/threads in an existing distributed client?

I'm optimizing a TPOT pipeline with dask on my local machine. I expect this to run for 48 hrs or even more.
I started a client with a few cores so I can continue using my machine while it's running in the background.
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=6, memory_limit="14GB")
I'm wondering if I could add workers/threads to the client so it has increased computing capacity when I'm sleeping or not using my computer.

Yes, you might consider using Dask's Adaptive scaling: https://docs.dask.org/en/latest/setup/adaptive.html
client.cluster.adapt(minimum=0, maximum=10)

Dask: Jobs on multiple nodes with one worker, run on one node only

I am trying to process some files using a python function and would like to parallelize the task on a PBS cluster using dask. On the cluster I can only launch one job but have access to 10 nodes with 24 cores each.
So my dask PBSCluster looks like:
import dask
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=240,
memory="1GB",
project='X',
queue='normal',
local_directory='$TMPDIR',
walltime='12:00:00',
resource_spec='select=10:ncpus=24:mem=1GB',
)
cluster.scale(1) # one worker
from dask.distributed import Client
client = Client(cluster)
client
After the Cluster in Dask shows 1 worker with 240 cores (not sure if that make sense).
When I run
result = compute(*foo, scheduler='distributed')
and access the allocated nodes only one of them is actually running the computation. I am not sure if I using the right PBS configuration.

cluster = PBSCluster(cores=240,
memory="1GB",
The values you give to the Dask Jobqueue constructors are the values for a single job for a single node. So here you are asking for a node with 240 cores, which probably doesn't make sense today.
If you can only launch one job then dask-jobqueue's model probably won't work for you. I recommnd looking at dask-mpi as an alternative.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dask: What does memory_limit control? - python

Related

Job, Worker, and Task in dask_jobqueue

Running two Tensorflow trainings in parallel using joblib and dask

How does LocalCluster() affect the number of tasks?

Is it possible to modify number of workers/threads in an existing distributed client?

Dask: Jobs on multiple nodes with one worker, run on one node only

Categories

Resources