I don't understand the relation between regular Dask and dask.distributed.
With dask.distributed, e.g. using the Futures interface, I have to explicitly create a client, which is backed by a local or remote cluster, and then submit to it using client.submit().
With regular Dask, e.g. using the Delayed interface, I just use delayed() on my functions.
How does delayed (or compute) determine where my computation takes place? There must be some global state behind it – but how would I access it? If I understand correctly, delayed uses a dask.distributed client if it exists. Does it use something like
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# use default scheduler
If so, why not use the same logic for submit?
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# fail because futures don't work on the default scheduler
And finally, delayed objects and future objects appear very similar. Why can the first use both a dask.distributed client and the default scheduler, while futures need dask.distributed?
Yes, there is some global state that assigns a current client
https://github.com/dask/distributed/blob/f3f4bffea0640c01fc54f49c3219cf5807d14c66/distributed/client.py#L93
If you call the compute method on a delayed object you'll end up using the current client
Dask delayed is just syntatic sugar that builds up a computation graph. When you call compute, the graph ends up being dispatched via the distributed client.
A future refers to a remote result on a cluster that may not be computed yet. The delayed object hasn't been submitted to the cluster
#delayed
def func(x):
return x
a = func(1)
In this case, a is a delayed object. That task hasn't been queued on the cluster at all
future = client.compute(a, sync=False)
You get a future after the task has been submitted to the cluster.
Dask has multiple backends. If you don't specify one everything runs on a local cluster with as many processes as you have cores in your CPU. When defining a cluster (local, Kubernetes, HPC, Spark) you can specify what you want exactly. However there is no difference on what the client sees only were and how it is executed.
All futures are executed on your backend as you send them, but you have to wait for the result to return. In the meantime you can do other stuff on the client. When it's finished, you can fetch the result with .result. I haven't worked with the futures API as much, but it should work like Python concurrent futures. This is also probably why you have to start a client beforehand. Dask wants to mirror the API as close as possible.
More information here.
The delayed, dataframe or array API only sends calculation to the backend, after you called .compute(). You then have to wait for the result to return and can't do anything in between.
More information here.
future cannot be used on a local machine (without a local cluster), since it triggers computation right away, so any further calculations in the same code will be blocked. delayed allows you to postpone computation until DAG is formed. So delayed can be run on a single machine with or without a cluster.
Related
I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)
This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.
The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)
I'm building a FastAPI application that has an endpoint to trigger a Dask computation. The API endpoint sends this call to the Dask scheduler and just returns the key of the Future.
trigger
x = client.submit
(
function_name,
arg1,
arg2
)
return x.key
I have two other endpoints to retrieve the status and result of the task, which take the key as input.
status
status = Future(key=key, client=client).status
return status
result
result = Future(key=key, client=client).result()
return result
Of course, this way, I lose the reference to the future after trigger returns, in which case Dask doesn't compute this anymore. So even if the key is given to the client, we get the status as pending forever.
What I'm doing now is storing references to the Future object as a python dictionary in the application, which works. But ideally, I would like my API application to be stateless. What would be store these Futures outside this application? Are there good caching libraries in Python that can store Python objects (with references)?
Try using Flask + Celery to handle the background Dask computation. Below are few links for reference:
https://flask.palletsprojects.com/en/1.1.x/patterns/celery/
https://blog.miguelgrinberg.com/post/using-celery-with-flask
https://medium.com/#frassetto.stefano/flask-celery-howto-d106958a15fe
Do the calculations (like dask method dd.merge) need to be done inside or outside the LocalCluster? Do final calculations (like .compute) need to be done inside or outside the LocalCluster?
My main question is - how does LocalCluster() affect the number of tasks?
I and my colleague noticed that placing dd.merge outside of LocalCLuster() downgraded the number of tasks significantly (like 10x or smth like that). What is the reason for that?
pseudo example
many tasks:
dd.read_parquet(somewhere, index=False)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
dd.merge(smth)
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
few tasks:
dd.read_parquet(somewhere, index=False)
dd.merge(smth)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
The performance difference is due to the difference in the schedulers being used.
According the the dask docs:
The dask collections each have a default scheduler
dask.dataframe use the threaded scheduler by default
The default scheduler is what is used when there is not another scheduler registered.
Additionally, according to the dask distributed docs:
When we create a Client object it registers itself as the default Dask scheduler. All .compute() methods will automatically start using the distributed system.
So when operating within the context manager for the cluster, computations implicitly use that scheduler.
A couple of additional notes:
It may be the case that the default scheduler is using more threads than the local cluster you are defining. It is also possible that a significant difference in performance is due to the overhead of inter-process communication that is not incurred by the threaded scheduler. More information about the schedulers is available here.
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)
Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html