I have the following code (simplified version)
for data in my_data_array:
res = api_request(data)
#write result to db
These request may take some time and there are a lot of them. How can I make each iteration of loop asynchronous and send progress with percentage of completed requests to the front-end with Django.
If I have to use Tornado or Celery, please give me the links with information how to integrate Django with them.
You will need Celery (or other async task queue). To integrate it with Django, see http://celery.readthedocs.org/en/latest/django/first-steps-with-django.html. I recommend to use Celery with Redis, because Redis is often used as a cache, so you don't need to install another backend for Celery (mostly RabbitMQ).
To get the progress bar, count total number of tasks (len(my_data_array)), store the value in cache (e.g. key total_count) and add the second key (e.g. complete_count) with zero value. In every task that completes, increase the complete_count value.
Last step is to query the status. It is just a simple view that loads these two values from cache and returns to the user (html/json).
Related
For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)
redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right
To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.
I have two servers, A primary server that provide REST API to accept data from user and maintain a product details list. This server is also responsible to share product list (a subset of product data) with secondary server as soon as product is updated/created.
also note that secondary url depends on product details, not a fix server.
Primary server written in Django. I have used django model db signal as product update, create and delete event.
Now problem is that I don’t want to bock my primary server REST call until it populates detail to secondary server. I need some scheduler stuff to do that, i.e. create a task to populate data in background without blocking my current thread.
I found python asyncio module comes with a function 'run_in_executor', and its working till now, But I don’t have a knowledge of the side effect over django run in wsgi server, can anyone explain ? or any other alternate ?
I found django channel, but it need extra stuff like run worker thread separately, redis cache.
You should use Django Celery for running Tasks asynchronously or in the background.
Celery is a task queue with batteries included. It’s easy to use so that you can get started without learning the full complexities of the problem it solves.
You can get more information on celery from http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#first-steps
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I have an web app in which I am trying to use celery to load background tasks from a database. I am currently loading the database upon request, but would like to load the tasks on an hourly interval and have them work in the background. I am using flask and am coding in python.I have redis running as well.
So far using celery I have gotten the worker to process the task and the beat to send the tasks to the worker on an interval. But I want to retrieve the results[a dataframe or query] from the worker and if the result is not ready then it should load the previous result of the worker.
Any ideas on how to do this?
Edit
I am retrieving the results from a database using sqlalchemy and I am rendering the results in a webpage. I have my homepage which has all the various links which all lead to different graphs which I want to be loaded in the background so the user does not have to wait long loading times.
The Celery Task is being executed by a Worker, and it's Result is being stored in the Celery Backend.
If I get you correctly, then I think you got few options:
Ignore the result of the graph-loading-task, store what ever you need, as a side effect of the task, in your database. When needed, query for the most recent result in that database. If the DB is Redis, you may find ZADD and ZRANGE suitable. This way you'll get the new if available, or the previous if not.
You can look for the result of a task if you provide it's id. You can do this when you want to find out the status, something like (where celery is the Celery app): result = celery.AsyncResult(<the task id>)
Use callback to update farther when new result is ready.
Let a background thread wait for the AsyncResult, or native_join, which is supported with Redis, and update accordingly (not recommended)
I personally used option #1 in similar cases (using MongoDB) and found it to be very maintainable and flexible. But possibly, due the nature of your UI, option #3 will more suitable for you needs.
I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.