I'm using send_task to send remote calls from a web server to a server that is running Celery. I'm trying to break down the tasks into smaller chunks and I know Celery has the chunks function.
However, I couldn't find any example on how this could be used in a remote call. Docs provided only show how it could be done if the Celery codes resides together in the same repository. Is this even possible in the first place? Any tips would help. Thanks!
Related
I have built a webserver written in python using the flask framework and psycopg2 and I have some questions about concurrent processing as it relates to dbs and the server itself. I am using gunicorn to start my app with
web:gunicorn app:app.
From my understanding a webserver such as this processes requests one at a time. So, if someone makes a get or post request to the server the server must finish responding to that get or post request before it can then move on to another request. If this is the case, then why would I need to make more than one connection cursor object? For example, if someone were making a post request that requires me to update something in the db, then my server can't process requests until I return out of that post end point anyway so that one connection object isn't bottle necking anything is it?
Ultimately, I am trying to allow my server to process a large number of requests simultaneously. In order to do so, I think I would first have to make multiple instances of my server, and THEN the connection pool comes into play right? I think in order to make multiple instances of my server (apologies if any terminologies are being used incorrectly here), I would do one of these things:
one way would be to: I would need to use multiple threads and if the machine my application is deployed on in the cloud has multiple cpu cores, then it can do this(?). However, I have read that python does not support "True multithreading" meaning a multi threaded program is not actually running concurrently, it's just switching back and forth between those threads really quickly, so would this really be any different than my set up currently?
the second way: use multiple gunicorn workers, or use multiple dynos. I think this route is the solution here, but I don't understand the theory on how to set this up at all. If I spawn additional gunicorn workers, what is happening behind the scenes? Would this still all run on my heroku application instance? Does the amount of cores I have access to on heroku affect this in anyway? Also, regardless of which way I pick, what would I be looking to change in the app.py code or would the change solely be inside the procfile?
Assuming I manage to set up multithreading or gunicorn workers, how would this then affect the connection pool set up/what should I do in regards to the connection pool? If anyone familiar with this can help provide some theory or explanations or some resources, I would greatly appreciate it. Thanks all!
From my experience with python here's what I've learned...
If you are using multiple threads or async then you need to use a pool or an async connection
If you have multiple processes and your code is strictly synchronous with no threads then a pool is not necessary. You can reuse a single connection for each process since they are not shared between each other.
Threads dont speed up execution speed in python usually since python will only ever run one thread at a time. Though they can help speed if threads need to block.
For web servers the true bottle neck is IO usually, meaning connecting to db or read file or w.e. Multiple process and making those process async gives the greatest performance. Starlette is a async version of Flask... kinda and is usually much faster when setup properly and using async libraries
I build a distributed application with Django and server A needs to execute a task locally with thousands of arguments, asynchronously, and then use the task results in server B as the argument of a different task executed asynchronously. What is the best way to achieve that ?
Considerations
There is no need to get the result of the remote task back in server A;
Server A is located in different areas than server B;
Performance is important.
One idea I wanted to explore is to POST the result of the local task to server B and execute the remote task asynchronously in sever B with apply_async(). But thousands of API calls that need to be autenticated will consume a lot of CPU resource I assume.
Would love to hear your thoughts,
Thanks
While working on a asynchronous task queue for a webserver (built with python and flask), I was finding a way to get the server to actually perform some work once a task update comes in. There is a function for a task that can be used on the client side (celery.app.task.get), and one to send updates on the worker side (celery.app.task.update_state).
But this requires a result backend to be configured. This is not a problem, perse. But I came across celery events (https://docs.celeryproject.org/en/stable/userguide/monitoring.html#real-time-processing).
This apparently allows to omit the result backend. On the worker side, this requires to use the celery.app.task.send_event function.
I do not need to send the result of a task to the client (it is a file on a shared volume), or store it in a database, but I do like to receive progress updates (percentage) of the tasks. Is using the event system a good alternative to update_state()?
Imagine that I've written a celery task, and put the code to the server, however, when I want to send the task to the server, I need to reuse the code written before.
So my question is that are there any methods to seperate the code between server and client.
Try a web server like flask that forwards requests to the celery workers. Or try a server that reads from a queue (SQS, AMQP,...) and does the same.
No matter the solution you choose, you end up with 2 services: the celery worker itself and the "server" that calls the celery tasks. They both share the same code but are launched with different command lines.
Alternately, if the task code is small enough, you could just import the git repository in your code and call it from there
I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.
In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.