Flask / Python Queue for more (parallel) users

Flask / Python Queue for more (parallel) users - python

i need help with Python Queue for more (pareallel) users.
I make webapp, based on Flask, for processing relative lot of data. This processing works fine in parallel threads, but i need percentage information about processing (progressbar).
Percentage values i created on Python Queue and works great, but of course, when two (or more) users start processing, queues are mixed.
Is possible use maybe parallel queues or something like separate queues?
I tried use:
queues = {}
queues['user.id'] = Queue()
but i think, that is not best practice for this situation.
Thanks for any answers.

Related

Best way to create a queue for handling request to REST Api created via Django

I have the following scenario:
Back end => a geospatial database and the Open Data Cube tool
API => Users can define parameters (xmin,xmax,ymin,ymax) to make GET
requests
Process => On each requests analytics are calculated and satellite
images pixels' values are given back to the user
My question is the following: As the process is quite heavy (it can reserve many GB of RAM) how it is possible to handle multiple requests at the same time? Is there any queue that I can save the requests and serve each one sequently?
Language/frameworks => Python 3.8 and Django
Thanks in advance

Celery + Rabbitmq/Redis is probably what you need.
In this configuration, your heavy processes become "tasks". When called with .delay() they go in the queue and are not handle by your main process anymore.
You might want to check the tuto
https://docs.celeryproject.org/en/stable/django/first-steps-with-django.html

There are many asynchronous message queueing technologies that allow you to do this, lots of which have Python APIs too.
You probably want to use request-response messaging, to correlate the requests you get with the replies you want to send.
A message queueing technology will allow you to take the requests, store them on a queue, and have your server handle them when it's ready. Storing requests on a queue means that they won't get lost. This also allows your application to scale - as more requests come in, they can be dealt with by multiple application instances and still return only one result!
The answer above recommends celery, which is a great choice for this kind of project. Depending on the requirements you have, you can also use pymqi: https://dsuch.github.io/pymqi/examples.html or ZeroMQ: (example here for using a request-response pattern) ZeroMQ - Multiple Publishers and Listener if you need technologies that are more heavy-duty.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

How to avoid pickling Celery task?

My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?

Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.

You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.

That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.

Suggestions on distributing python data/code over worker nodes?

I'm starting to venture into distributed code and am having trouble figuring out which solution fits my needs based on all the stuff out there. Basically I have a python list of data that I need to process with a single function. This function has a few nested for loops but doesn't take too long(about a min) for each item on the list. My problem is the list is very large(3000+ items). I'm looking at multiprocessing but I think I want to experiment with multi-server processing it(because ideally, if the data gets larger I want to be able to have the choice of adding more servers during the job to make it run quicker).
I basically looking for something that I can distribute this data list through(and not super needed but it would be nice if I could distribute my code base through this also)
So my question is, what package can I use to achieve this? My database is hbase so I already have hadoop running(never used hadoop though, just using it for the database). I looked at celery and twisted as well but I'm confused on which will fit my needs.
Any suggestions?

I would highly recommend celery. You can define a task that operates on a single item of your list:
from celery.task import task
#task
def process(i):
# do something with i
i += 1
# return a result
return i
You can easily parallelize a list like this:
results = []
todo = [1,2,3,4,5]
for arg in todo:
res = process.apply_async(args=(arg))
results.append(res)
all_results = [res.get() for res in results]
This is easily scalable by just adding more celery workers.

check out rabbitMQ. Python bindings are available through pika. start with a simple work_queue and run few rpc calls.
It may look troublesome to experiment distributed computing in python with an external engine like rabbitMQ (there's a small learning curve for installing and configuring the rabbit) but you may find it even more useful later.
... and celery can work hand-in-hand with rabbitMQ, checkout robert pogorzelski's tutorial and Simple distributed tasks with Celery and RabbitMQ

Simple non-network concurrency with Twisted

I have a problem with using Twisted for simple concurrency in python. The problem is - I don't know how to do it and all online resources are about Twisted networking abilities. So I am turning to SO-gurus for some guidance.
Python 2.5 is used.
Simplified version of my problem runs as follows:
A bunch of scientific data
A function that munches on the data and creates output
??? < here enters concurrency, it takes chunks of data from 1 and feeds it to 2
Output from 3 is joined and stored
My guess is that Twisted reactor can do the number three job. But how?
Thanks a lot for any help and suggestions.
upd1:
Simple example code. No idea how reactor deals with processes, so I have given it imaginary functions:
datum = 'abcdefg'
def dataServer(data):
for char in data:
yield chara
def dataWorker(chara):
return ord(chara)
r = reactor()
NUMBER_OF_PROCESSES_AV = 4
serv = dataserver(datum)
id = 0
result = array(len(datum))
while r.working():
if NUMBER_OF_PROCESSES_AV > 0:
r.addTask(dataWorker(serv.next(), id)
NUMBER_OF_PROCESSES_AV -= 1
id += 1
for pr, id in r.finishedProcesses():
result[id] = pr

As Jean-Paul said, Twisted is great for coordinating multiple processes. However, unless you need to use Twisted, and simply need a distributed processing pool, there are possibly better suited tools out there.
One I can think of which hasn't been mentioned is celery. Celery is a distributed task queue - you set up a queue of tasks running a DB, Redis or RabbitMQ (you can choose from a number of free software options), and write a number of compute tasks. These can be arbitrary scientific computing type tasks. Tasks can spawn subtasks (implementing your "joining" step you mention above). You then start as many workers as you need and compute away.
I'm a heavy user of Twisted and Celery, so in any case, both options are good.

To actually compute things concurrently, you'll probably need to employ multiple Python processes. A single Python process can interleave calculations, but it won't execute them in parallel (with a few exceptions).
Twisted is a good way to coordinate these multiple processes and collect their results. One library oriented towards solving this task is Ampoule. You can find more information about Ampoule on its Launchpad page: https://launchpad.net/ampoule.

Do you need Twisted at all?
From your description of the problem I'd say that multiprocessing would fit the bill. Create a number of Process objects that are given a reference to a single Queue instance. Get them to start their work and put their results on the Queue. Just use blocking get()s to read the results.

It seems to me that you are misunderstanding the fundamentals of how Twisted operates. I recommend you give the Twisted Intro a shot by Dave Peticolas. It has been a great help to me, and I've been using Twisted for years!
HINT: Everything in Twisted relies on the reactor!
(source: krondo.com)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.