Polling and consuming messages from Amazon SQS as a service - python

I am using a SQS as a 'bridge' in my system, which receives tasks from GAE and will be processed on the EC2. Currently, I am able to add tasks to this queue from GAE, but with some difficulties on how to consume these tasks on EC2. So my questions are:
What are the recommended ways of building a task consumer (python based) on EC2, which would keep an eye on SQS and assign new inbound jobs to works?
Does AWS has this type of SQS monitoring product? If not, is Celery's period task a good candidate?

The task consumer is just a (long) poller. The boto Python library has SQS support. AWS provides the SQS service, but they don't make the consumers. I'm not familiar with Celery.
Standard practice is to poll for message (which marks it as 'invisible'), and then perform action at consumer end. At action completion, delete message. If action fails because one of your compute nodes disappeared the message will become visible after a period of time and it'll be picked up in a future poll. If you need something smarter you may want to implement an external ESB or experiment with AWS SWF.
http://boto.readthedocs.org/en/latest/ref/sqs.html
http://aws.amazon.com/swf/

You can use the AWS Beanstalk service to consume the tasks in the Queue
AWS Beanstalk with SQS
If you don't want to breakdown your code to run within Beanstalk you can write some code for beanstalk to pull an item off the queue and then send it to your ec2 server, essentially making Beanstalk hand out the messages/tasks in the queue. This would remove the need for your ec2 server to constantly poll the queue

Related

Synchronous Task "Queues" with AWS Lambda

I have a deployed & functioning Django application that uses celery workers to execute background tasks. We are looking at the possibility of switching to AWS Lambda for running our background tasks instead to help save on costs, but are unsure how we would structure our project.
Current Structure
The app currently consists of an Elastic Beanstalk app with an EC2 instance running the web server and Celery Beat (for some regularly-scheduled tasks), and a separate EC2 Celery worker instance that executes the tasks. In the app's current configuration, I am using Celery with Amazon's SQS to trigger functions executed by the worker instance.
Key to our application is that we may receive several items to be queued up at once. Multiple queue items can execute at the same time, but only one item from each user can execute concurrently. We have accomplished this by using named queues for each user and configuring Celery to only run one task from any given queue at a time.
Questions / Desired Structure
We would like to transition to using AWS lambda functions to execute our background tasks, because we have significant gaps in application usage (we typically get tasks in large groups) and this can save on costs. Our biggest question is whether there is a way to "categorize" lambda invocations such that we can have multiple functions executing at once, but only one executing from each category. Lambda seems to have features for managing concurrency, but only per-function and not with any equivalent to the multiple queues that we are using currently.
It feels like you may want to leverage AWS SQS fifo queues.
You can set your MessageGroupId to your user's Ids, that way you will be able to process Users' messages in parallel.
Per the documentation:
Receiving messages
You can't request to receive messages with a specific message group ID.
You won't be able to filter by user ID.
When receiving messages from a FIFO queue with multiple message group IDs, Amazon SQS first attempts to return as many messages with the same message group ID as possible. This allows other consumers to process messages with a different message group ID. When you receive a message with a message group ID, no more messages for the same message group ID are returned unless you delete the message or it becomes visible.
This explains that SQS Fifo is processing messages by messageGroupID. That means that until you didn't process and delete messages with the same messageGroupId from the queue (it's automatically done by AWS lambda trigger on lambda success), it won't process the others.
On the other hand, it can process other messages with different messageGroupId.
Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order).
I recommend you try and do a mini load-test with Messages in the to simulate the behavior you expect and see if it feat your needs!
You can do a simple SQS FIFO that triggers a lambda with maximum concurrency allowed (1000 for standard accounts).
Then you push messages using a script, with different messageGroupIds, and random ordering. And see how is it processed.
This won't cost so much since it's serverless and will save you a lot of time and issues in the future. (probably money too).

How to manage long running tasks via website

I have a django website, where I can register some event listeners and monitoring tasks on certain websites, see an info about these tasks, edit, delete, etc. These tasks are long running, so I launch them as tasks in a asyncio event loop. I want them to be independent on the django website, so I run these tasks in event loop alongside Sanic webserver, and control it with api calls from the django server. I dont know why, but I still feel that this solution is pretty scuffed, so is there a better way to do it? I was thinking about using kubernetes, but these tasks arent resource heavy and are simple, so I dont think it's worth launching new pod for each.
Thanks for help.
Ideally, it is always a good idea to launch a new pod for a new event or job.
You can use cronjob in Kubernetes so they auto-deleted when work is done.
It's always better keep to separate and small microservices rather than running the whole monolith application inside the container.
On the management side using starting the new pod will be easy to manage also, also cost-efficient if you scale up & down your cluster as per resource requirement.
You can also use the message broker and listener which will listen to the channel in the message broker and perform the async task or event if any. Listen consider as separate pod.

Run parallel Python code on multiple AWS instances

I have a Python algorithm that can be parallelized fairly easily.
I don't have the resources locally to run the whole thing in an acceptable time frame.
For each work unit, I would like to be able to:
Launch an AWS instance (EC2?)
Send input data to the instance
Run the Python code with the data as input
Return the result and aggregate it when all instances are done
What is the best way to do this?
Is AWS Lambda used for this purpose? Can this be done only with Boto3?
I am completely lost here.
Thank you
A common architecture for running tasks in parallel is:
Put inputs into an Amazon SQS queue
Run workers on multiple Amazon EC2 instances that:
Retrieve a message from the SQS queue
Process the data
Write results to Amazon S3
Delete the message from the SQS queue (to signify that the job is complete)
You can then retrieve all the results from Amazon S3. Depending on their format, you could even use Amazon Athena to run SQL queries against all the output files simultaneously.
You could even run multiple workers on the same instance if each worker is single-threaded and there is spare CPU available.

Send task to specific celery worker

I'm building a web application (Using Python/Django) that is hosted on two machines connected to a load balancer.
I have a central storage server, and I have a central Redis server, single celery beat, and two celery workers on each hosting machine.
I receive files from an API endpoint (on any of the hosting machines) and then schedule a task to copy to the storage server.
The problem is that the task is scheduled using:
task.delay(args)
and then any worker can receive it, while the received files exist only on one of the 2 machines, and have to be copied from it.
I tried finding if there's a unique id for the worker that I can assign the task to but didn't find any help in the docs.
Any solution to this ? Given that the number of hosting machines can scale to more than 2.
The best solution is to put the task onto a named queue and have each worker look for jobs from their specific queue. So if you have Machine A and Machine B you could have Queue A, Queue B and Queue Shared. Machine A would watch for jobs on Queue A and Queue Shared while Machine B looked for jobs on Queue B and Queue Shared.
The best way to do this is to have a dedicated queue for each worker.
When I was learning Celery I did exactly this, and after few years completely abandoned this approach as it creates more problems than it actually solves.
Instead, I would recommend the following: any resource that you may need to share among tasks should be on a shared filesystem (NFS), or in some sort of in-memory caching servise like Redis, KeyDb or memcached. We use a combination of S3 and Redis, depending on the type of resource.
Sure, if you do not really care about scalability the queue-per-worker approach will work fine.

speech to text processing - python

I have a project in which the user will send an audio file from android/web to the server.
I need to perform speech to text processing on the server and return some files to the user back on android/web. However the server side is to be done using Python.
Please guide me as to how it could be done?
Alongside your web application, you can have a queue of tasks that need to be run and worker process(es) to run and track those tasks. This is a popular pattern when web requests need to either start tasks in the background, check in on tasks, or get the result of a task. An introduction to this pattern can be found in the Task Queues section of the Full Stack Python open book. Celery and RQ are two popular projects that supply task queue management and can plug into an existing Python web application, such as one built with Django or Flask.
Once you have task management, you'll have to decide how to keep the user up to date on the status of a task. If you're stuck with having to use RPC-style web service calls only, then you can have clients (e.g. Android or browser) poll for the status by making a call to a web service you've created that checks on the task via your task queue manager's API.
If you want the user to be informed faster or want to reduce wasteful overhead from constant polling, consider supplying a websocket instead. Through a websocket connection, clients could subscribe to notifications of events such as the completion of a speech-to-text job. The Autobahn|Python library provides server code for implementing websockets as well as support for a protocol on top called WAMP that can be used to communicate subscriptions and messages or call upon services. If you need to stick with Django, consider something like django-websocket-redis instead.

Categories

Resources