For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)
redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right
To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.
Related
I'm building a service with Celery to process data from a large number of users. I want each user to be able to use this pipeline in a perfectly-fair round-robin fashion. It's infeasible for me to manually create a separate permanent queue for each user, so I rather do it dynamically.
In the perfect scenario, when a user wants to start work, Celery would either create a new queue or re-use the same queue previously created that user (idempotent). Then, I know Celery already has round-robin mechanisms built-in so processing is "fair". After the user leaves, the queue can be auto-deleted through a TTL so as not to not take up space in my messaging broker. This queue will be re-created when the user comes back and needs to do more work.
My current broker is RabbitMQ/AmazonMQ but I'm flexible and I can change this to Redis or SQS.
I'm looking to see if there's already an out-of-box solution for this. Otherwise, a manual solution could be to maintain a database table of active queues and use Beat to check activeness and TTL.
I stumbled upon these other posts but they haven't been super helpful as the proposed solutions are not significantly better than the manual queue table.
RabbitMq Dynamically Add/Remove Queues/Consumers
Auto Delete SQS queue
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
This isn't specifically related to programming, more-so infrastructure, but of all the exchange sites StackOverflow seems to be most knowledgeable in departments of RESTful APIs.
I have a single endpoint configured for handling events that could take in up to 1k events within a 3 minute window. I am noticing a lot of events "missed", but I'm not sure that I'm willing to blame over-utilization right away without fully understanding.
The listening endpoint is /users/events?user=2345345 where 2345345 is the user id. From here we perform necessary actions on that particular user, but what if during this the next user, 2895467 performs an action which results in a new event being sent to /users/events?user=2895467 before the first could be processed. What happens?
I intend to alleviate the concern by using celery to signal tasks which would greatly reduce this, but is it fair to assume that events could be missed while this single endpoint remains synchronous?
Real-life behavior depends on approach used for "deployment".
For example if you are using uwsgi with single unthreaded worker behind nginx, then requests will be processed "sequentially": if second request arrives before first is processed, then second will be "queued" (added to backlog).
How long it can be queued and how many requests may be in queue
depends on the configuration of nginx (listen backlog), configuration of uwsgi (concurrency, listen backlog) and even on configuration
of OS kernel (search for net.core.somaxconn,
net.core.netdev_max_backlog). When queue becomes "full" then new
"concurrent" connections will be dropped instead of being added to queue.
I have two servers, A primary server that provide REST API to accept data from user and maintain a product details list. This server is also responsible to share product list (a subset of product data) with secondary server as soon as product is updated/created.
also note that secondary url depends on product details, not a fix server.
Primary server written in Django. I have used django model db signal as product update, create and delete event.
Now problem is that I don’t want to bock my primary server REST call until it populates detail to secondary server. I need some scheduler stuff to do that, i.e. create a task to populate data in background without blocking my current thread.
I found python asyncio module comes with a function 'run_in_executor', and its working till now, But I don’t have a knowledge of the side effect over django run in wsgi server, can anyone explain ? or any other alternate ?
I found django channel, but it need extra stuff like run worker thread separately, redis cache.
You should use Django Celery for running Tasks asynchronously or in the background.
Celery is a task queue with batteries included. It’s easy to use so that you can get started without learning the full complexities of the problem it solves.
You can get more information on celery from http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#first-steps
We have a Windows based Celery/RabbitMQ server that executes long-running python tasks out-of-process for our web application.
What this does, for example, is take a CSV file and process each line. For every line it books one or more records in our database.
This seems to work fine, I can see the records being booked by the worker processes. However, when I check the rabbitMQ server with the management plugin (the web based management tool) I see the Queued messages increasing, and not coming back down.
Under connections I see 116 connections, about 10-15 per virtual host, all "running" but when I click through, most of them have 'idle' as State.
I'm also wondering why these connections are still open, and if there is something I need to change to make them close themselves:
Under 'Queues' I can see more than 6200 items with state 'idle', and not decreasing.
So concretely I'm asking if these are normal statistics or if I should worry about the Queues increasing but not coming back down and the persistent connections that don't seem to close...
Other than the rather concise help inside the management tool, I can't seem to find any information about what these stats mean and if they are good or bad.
I'd also like to know why the messages are still visible in the queues, and why they are not removed, as the tasks seem t be completed just fine.
Any help is appreciated.
Answering my own question;
Celery sends a result message back for every task in the calling code. This message is sent back via the same AMPQ queue.
This is why the tasks were working, but the queue kept filling up. We were not handling these results, or even interested in them.
I added ignore_result=True to the celery task, so the task does not send result messages back into the queue. This was the main solution to the problem.
Furthermore, the configuration option CELERY_SEND_EVENTS=False was added to speed up celery. If set to TRUE, this option has Celery send events for external monitoring tools.
On top of that CELERY_TASK_RESULT_EXPIRES=3600 now makes sure that even if results are sent back, that they expire after one hour if not picked up/acknowledged.
Finally CELERY_RESULT_PERSISTENT was set to False, this configures celery to not store these result messages on disk. They will vanish when the server crashes, which is fine in our case, as we don't use them.
So in short; if you don't need feedback in your app about if and when the tasks are finished, use ignore_result=True on the celery task, so that no messages are sent back.
If you do need that information, make sure you pick up and handle the results, so that the queue stops filling up.
If you don't need the reliability then you can make your queues transient.
http://celery.readthedocs.org/en/latest/userguide/optimizing.html#optimizing-transient-queues
CELERY_DEFAULT_DELIVERY_MODE = 'transient'