Store references to Dask Futures in a Flask/FastAPI Application - python

I'm building a FastAPI application that has an endpoint to trigger a Dask computation. The API endpoint sends this call to the Dask scheduler and just returns the key of the Future.
trigger
x = client.submit
(
function_name,
arg1,
arg2
)
return x.key
I have two other endpoints to retrieve the status and result of the task, which take the key as input.
status
status = Future(key=key, client=client).status
return status
result
result = Future(key=key, client=client).result()
return result
Of course, this way, I lose the reference to the future after trigger returns, in which case Dask doesn't compute this anymore. So even if the key is given to the client, we get the status as pending forever.
What I'm doing now is storing references to the Future object as a python dictionary in the application, which works. But ideally, I would like my API application to be stateless. What would be store these Futures outside this application? Are there good caching libraries in Python that can store Python objects (with references)?

Try using Flask + Celery to handle the background Dask computation. Below are few links for reference:
https://flask.palletsprojects.com/en/1.1.x/patterns/celery/
https://blog.miguelgrinberg.com/post/using-celery-with-flask
https://medium.com/#frassetto.stefano/flask-celery-howto-d106958a15fe

Related

relation between regular Dask and dask.distributed

I don't understand the relation between regular Dask and dask.distributed.
With dask.distributed, e.g. using the Futures interface, I have to explicitly create a client, which is backed by a local or remote cluster, and then submit to it using client.submit().
With regular Dask, e.g. using the Delayed interface, I just use delayed() on my functions.
How does delayed (or compute) determine where my computation takes place? There must be some global state behind it – but how would I access it? If I understand correctly, delayed uses a dask.distributed client if it exists. Does it use something like
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# use default scheduler
If so, why not use the same logic for submit?
client = None
try:
client = Client.current()
except ValueError:
pass
if client is not None:
# use client
else:
# fail because futures don't work on the default scheduler
And finally, delayed objects and future objects appear very similar. Why can the first use both a dask.distributed client and the default scheduler, while futures need dask.distributed?
Yes, there is some global state that assigns a current client
https://github.com/dask/distributed/blob/f3f4bffea0640c01fc54f49c3219cf5807d14c66/distributed/client.py#L93
If you call the compute method on a delayed object you'll end up using the current client
Dask delayed is just syntatic sugar that builds up a computation graph. When you call compute, the graph ends up being dispatched via the distributed client.
A future refers to a remote result on a cluster that may not be computed yet. The delayed object hasn't been submitted to the cluster
#delayed
def func(x):
return x
a = func(1)
In this case, a is a delayed object. That task hasn't been queued on the cluster at all
future = client.compute(a, sync=False)
You get a future after the task has been submitted to the cluster.
Dask has multiple backends. If you don't specify one everything runs on a local cluster with as many processes as you have cores in your CPU. When defining a cluster (local, Kubernetes, HPC, Spark) you can specify what you want exactly. However there is no difference on what the client sees only were and how it is executed.
All futures are executed on your backend as you send them, but you have to wait for the result to return. In the meantime you can do other stuff on the client. When it's finished, you can fetch the result with .result. I haven't worked with the futures API as much, but it should work like Python concurrent futures. This is also probably why you have to start a client beforehand. Dask wants to mirror the API as close as possible.
More information here.
The delayed, dataframe or array API only sends calculation to the backend, after you called .compute(). You then have to wait for the result to return and can't do anything in between.
More information here.
future cannot be used on a local machine (without a local cluster), since it triggers computation right away, so any further calculations in the same code will be blocked. delayed allows you to postpone computation until DAG is formed. So delayed can be run on a single machine with or without a cluster.

Serialising a ray future and returning the results later

I've wrapped Ray in a web API (using ray start --head and uvicorn with ray.init). Now I'm trying to:
Submit a job to Ray (via the API) and serialise the future and return to the user
Later, let the user hit an API to see if the task is finished and return the results
The kicker is that being a multi-thread I can't guarantee that the next call will be from the same thread. Here is what I naively though would work:
id = my_function.remote()
id_hex = id.hex()
Then in another request/invocation:
id = ray._raylet.ObjectID(binascii.unhexlify(id_hex))
ray.get(id)
Now this never gets finished (it times out) even though I know the future is finished and that the code works if I run it in the same thread as the original request.
I'm guessing this has to do with using another initialisation of Ray.
Is there anyway to force Ray to "refresh" a futures result from Redis?
Getting objectID directly in your way can cause unexpected behaviors because of Ray's ref counting / optimization mechanism. One recommendation is to use "detached actor". You can create a detached actor and delegate the call in there. Detached actors will survive in the lifetime of Ray (unless you kill it), so you don't need to worry about problems you mentioned. Yes. It can make the program a bit slower as it requires 2 hops, but I guess this overhead wouldn't matter for your workload (client submission model).
https://docs.ray.io/en/latest/advanced.html?highlight=detached#detached-actors
ray.remote
class TaskInvocator:
def __init__(self):
self.futures = {}
def your_function(self):
object_id = real_function.remote()
self.futures[object_id.hex()] = object_id
def get_result(self, hex_object_id):
return ray.get(self.futures[hex_object_id])
TaskInvocator.remote(detached=True, name='invocator-1')
task_invocator = ray.util.get_actor('invocator-1')
object_id = task_invocator.your_function.remote()
blah blah...
result = ray.get(task_invocator.get_result.remote(object_id.hex()))
Serializing a ray future to a string out-of-band and then deserializing it is not supported. The reason for this is because these futures are more than just IDs, they have a lot of state associated with them in various system components.
One thing you could do to support this type of API is have an actor that manages the lifetime of these tasks. When you start the task, you pass its ObjectID to the actor. Then, when a user hits the endpoint to check if it's finished, it pings the actor which looks up the corresponding ObjectID and calls ray.wait() on it.

How can I asynchronously receive processed messages in Celery?

I'm writing a data processing pipeline using Celery because this speeds things up considerably.
Consider the following pseudo-code:
from celery.result import ResultSet
from some_celery_app import processing_task # of type #app.task
def crunch_data():
results = ResultSet([])
for document in mongo.find(): #Around 100K - 1M documents
job = processing_task.delay(document)
results.add(job)
return results.get()
collected_data = crunch_data()
#Do some stuff with this collected data
I successfully spawn four workers with concurrency enabled and when I run this script, the data is processed accordingly and I can do whatever I want.
I'm using RabbitMQ as message broker and rpc as backend.
What I see when I open the RabbitMQ management UI:
First, all the documents are processed
then, and only then, are the documents retrieved by the collective results.get() call.
My question: Is there a way to do the processing and subsequent retrieval simultaneously? In my case, as all documents are atomic entities that do not rely on each other, there seems to be no need to wait for the job to be processed completely.
You could try the callback parameter in ResultSet.get(callback=cbResult) and then you could process the result in the callback.
def cbResult(task_id, value):
print(value)
results.get(callback=cbResult)

How to circumvent Django's req/resp cycle when updating it's internal state

I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).

Is reading a global collections.deque from within a Flask request safe?

I have a Flask application that is supposed to display the result of a long running function to the user on a specified route. The result is about to change every hour or so. In order to avoid the user having to wait for the result, I want to have it cached somewhere in the application and re-compute it in specific intervals in the background (e.g. every hour) so that no user request ever has to wait for the long running computation function.
The idea I came up with to solve this is as follows, however, I am not completely sure whether this is really "safe" to do in a production environment with a multi-threaded or even multi-processed webserver such as waitress, eventlet, gunicorn or what not.
To re-compute the result in the background, I use a BackgroundScheduler from the APScheduler library.
The result is then left-appended in a collections.deque object which is registered as a module-wide variable (since there is no better possibility to save application wide globals in a Flask application as far as I know?!). Since the maximum size of the deque is set as 2, old results will pop out on the right side of the deque as new ones come in.
A Flask view now returns deque[0] to the requester which should always be the newest result. I decided for deque over Queue since the latter has no built-in possibility to read the first item without removing it.
Thus, it is guaranteed that no user ever has to wait for the result because the old one only disappears from "cache" in the very moment the new one comes in.
See below for a minimal example of this. When running the script and hitting http://localhost:5000, one can see the caching in action - "Job finished at" should never be later than 10 seconds plus some very short time for re-computing it behind "Current time", still one should never have to wait the time.sleep(5) seconds from the job function until the request returns.
Is this a valid implementation for the given requirement that will also work in a production-ready WSGI server setting or should this be accomplished differently?
from flask import Flask
from apscheduler.schedulers.background import BackgroundScheduler
import time
import datetime
from collections import deque
# a global deque that is filled by APScheduler and read by a Flask view
deque = deque(maxlen=2)
# a function filling the deque that is executed in regular intervals by APScheduler
def some_long_running_job():
print('complicated long running job started...')
time.sleep(5)
job_finished_at = datetime.datetime.now()
deque.appendleft(job_finished_at)
# a function setting up the scheduler
def start_scheduler():
scheduler = BackgroundScheduler()
scheduler.add_job(some_long_running_job,
trigger='interval',
seconds=10,
next_run_time=datetime.datetime.utcnow(),
id='1',
name='Some Job name'
)
scheduler.start()
# a flask application
app = Flask(__name__)
# a flask route returning an item from the global deque
#app.route('/')
def display_job_result():
current_time = datetime.datetime.now()
job_finished_at = deque[0]
return '''
Current time is: {0} <br>
Job finished at: {1}
'''.format(current_time, job_finished_at)
# start the scheduler and flask server
if __name__ == '__main__':
start_scheduler()
app.run()
Thread-safety is not enough if you run multiple processes:
Even though collections.deque is thread-safe:
Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.
Source: https://docs.python.org/3/library/collections.html#collections.deque
Depending on your configuration, your webserver might run multiple workers in multiple processes, so each of those processes has their own instance of the object.
Even with one worker, thread-safety might not be enough:
You might have selected an asynchronous worker type. The asynchronous worker won't know when it's safe to yield and your code would have to be protected against scenarios like this:
Worker for request 1 reads value a and yields
Worker for request 2 also reads value a, writes a + 1 and yields
Worker for request 1 writes value a + 1, even though it should be a + 1 + 1
Possible solutions:
Use something outside of the Flask app to store the data. This can be a database, in this case preferably an in-memory database like Redis. Or if your worker type is compatible with the multiprocessing module, you can try to use multiprocessing.managers.BaseManager to provide your Python object to all worker processes.
Are global variables thread safe in flask? How do I share data between requests?
How can I provide shared state to my Flask app with multiple workers without depending on additional software?
Store large data or a service connection per Flask session

Categories

Resources