There's a lot of info out there and honestly it's a bit too much to digest and I'm a bit lost.
My web app has to do so some very resource intensive tasks. Standard setup right now app on server static / media on another for hosting. What I would like to do is setup celery so I can call task.delay for these resource intensive tasks.
I'd like to dedicate the resources of entire separate servers to these resource intensive tasks.
Here's the question: How do I setup celery in this way so that from my main server (where the app is hosted) the calls for .delay are sent from the apps to these servers?
Note: These functions will be kicking data back to the database / affecting models so data integrity is important here. So, how does the data (assuming the above is possible...) retrieved get sent back to the database from the seperate servers while preserving integrity?
Is this possible and if so wth do I begin - information overload?
If not what should I be doing / what am I doing wrong?
The whole point of Celery is to work in exactly this way, ie as a distributed task server. You can spin up workers on as many machines as you like, and the broker - ie rabbitmq - will distribute them as necessary.
I'm not sure what you're asking about data integrity, though. Data doesn't get "sent back" to the database; the workers connect directly to the database in exactly the same way as the rest of your Django code.
Related
I am currently planning a django project which consists of two parts. The normal django application and an additional application which uses MQTT to read sensors. For better loading times for the HTTP responses, I planned to receive the MQTT publish messages in an external process or thread and write it into the database used by django. The sensor data in this database then is always used whenever a HTTP request is made.
Do you guys have any better architectural solutions to my problem?
Best regards
Django background tasks could be a simple way of doing what you need.
It uses your existing database to run tasks in the background and is very easy to set up. It also supports scheduling repeated tasks.
I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.
There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).
I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?
I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.
If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.
I'm working on a distributed system where one process is controlling a hardware piece and I want it to be running as a service. My app is Django + Twisted based, so Twisted maintains the main loop and I access the database (SQLite) through Django, the entry point being a Django Management Command.
On the other hand, for user interface, I am writing a web application on the same Django project on the same database (also using Crossbar as websockets and WAMP server). This is a second Django process accessing the same database.
I'm looking for some validation here. Is anything fundamentally wrong to this approach? I'm particularly scared of issues with database (two different processes accessing it via Django ORM).
Consider that Django, like all WSGI-based web servers, almost always has multiple processes accessing the database. Because a single WSGI process can handle only one connection at a time, it's normal for servers to run multiple processes in parallel when they get any significant amount of traffic.
That doesn't mean there's no cause for concern. You have the database as if data might change between any two calls to it. Familiarize yourself with how Django uses transactions (default is autocommit mode, not atomic requests), and …
and oh, you said sqlite. Yeah, sqlite is probably not the best database to use when you need to write to it from multiple processes. I can imagine that might work for a single-user interface to a piece of hardware, but if you run in to any problems when adding the webapp, you'll want to trade up to a database server like postgresql.
No there is nothing inherently wrong with that approach. We currently use a similar approach for a lot of our work.
I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)