Synchronous Task "Queues" with AWS Lambda

Synchronous Task "Queues" with AWS Lambda - python

I have a deployed & functioning Django application that uses celery workers to execute background tasks. We are looking at the possibility of switching to AWS Lambda for running our background tasks instead to help save on costs, but are unsure how we would structure our project.
Current Structure
The app currently consists of an Elastic Beanstalk app with an EC2 instance running the web server and Celery Beat (for some regularly-scheduled tasks), and a separate EC2 Celery worker instance that executes the tasks. In the app's current configuration, I am using Celery with Amazon's SQS to trigger functions executed by the worker instance.
Key to our application is that we may receive several items to be queued up at once. Multiple queue items can execute at the same time, but only one item from each user can execute concurrently. We have accomplished this by using named queues for each user and configuring Celery to only run one task from any given queue at a time.
Questions / Desired Structure
We would like to transition to using AWS lambda functions to execute our background tasks, because we have significant gaps in application usage (we typically get tasks in large groups) and this can save on costs. Our biggest question is whether there is a way to "categorize" lambda invocations such that we can have multiple functions executing at once, but only one executing from each category. Lambda seems to have features for managing concurrency, but only per-function and not with any equivalent to the multiple queues that we are using currently.

It feels like you may want to leverage AWS SQS fifo queues.
You can set your MessageGroupId to your user's Ids, that way you will be able to process Users' messages in parallel.
Per the documentation:
Receiving messages
You can't request to receive messages with a specific message group ID.
You won't be able to filter by user ID.
When receiving messages from a FIFO queue with multiple message group IDs, Amazon SQS first attempts to return as many messages with the same message group ID as possible. This allows other consumers to process messages with a different message group ID. When you receive a message with a message group ID, no more messages for the same message group ID are returned unless you delete the message or it becomes visible.
This explains that SQS Fifo is processing messages by messageGroupID. That means that until you didn't process and delete messages with the same messageGroupId from the queue (it's automatically done by AWS lambda trigger on lambda success), it won't process the others.
On the other hand, it can process other messages with different messageGroupId.
Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order).
I recommend you try and do a mini load-test with Messages in the to simulate the behavior you expect and see if it feat your needs!
You can do a simple SQS FIFO that triggers a lambda with maximum concurrency allowed (1000 for standard accounts).
Then you push messages using a script, with different messageGroupIds, and random ordering. And see how is it processed.
This won't cost so much since it's serverless and will save you a lot of time and issues in the future. (probably money too).

Related

Update single database value on a website with many users

For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)

redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right

To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.

Can I use a celery task's send_event instead of update_state for state updates?

While working on a asynchronous task queue for a webserver (built with python and flask), I was finding a way to get the server to actually perform some work once a task update comes in. There is a function for a task that can be used on the client side (celery.app.task.get), and one to send updates on the worker side (celery.app.task.update_state).
But this requires a result backend to be configured. This is not a problem, perse. But I came across celery events (https://docs.celeryproject.org/en/stable/userguide/monitoring.html#real-time-processing).
This apparently allows to omit the result backend. On the worker side, this requires to use the celery.app.task.send_event function.
I do not need to send the result of a task to the client (it is a file on a shared volume), or store it in a database, but I do like to receive progress updates (percentage) of the tasks. Is using the event system a good alternative to update_state()?

Best Way to Handle user triggered task (like import data) in Django

I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.

You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.

Polling and consuming messages from Amazon SQS as a service

I am using a SQS as a 'bridge' in my system, which receives tasks from GAE and will be processed on the EC2. Currently, I am able to add tasks to this queue from GAE, but with some difficulties on how to consume these tasks on EC2. So my questions are:
What are the recommended ways of building a task consumer (python based) on EC2, which would keep an eye on SQS and assign new inbound jobs to works?
Does AWS has this type of SQS monitoring product? If not, is Celery's period task a good candidate?

The task consumer is just a (long) poller. The boto Python library has SQS support. AWS provides the SQS service, but they don't make the consumers. I'm not familiar with Celery.
Standard practice is to poll for message (which marks it as 'invisible'), and then perform action at consumer end. At action completion, delete message. If action fails because one of your compute nodes disappeared the message will become visible after a period of time and it'll be picked up in a future poll. If you need something smarter you may want to implement an external ESB or experiment with AWS SWF.
http://boto.readthedocs.org/en/latest/ref/sqs.html
http://aws.amazon.com/swf/

You can use the AWS Beanstalk service to consume the tasks in the Queue
AWS Beanstalk with SQS
If you don't want to breakdown your code to run within Beanstalk you can write some code for beanstalk to pull an item off the queue and then send it to your ec2 server, essentially making Beanstalk hand out the messages/tasks in the queue. This would remove the need for your ec2 server to constantly poll the queue

How does Amazon's SQS notify one of my "worker" servers whenever there is something in the queue?

I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.

Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue

This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.

Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.

Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.

There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.