I have a Python algorithm that can be parallelized fairly easily.
I don't have the resources locally to run the whole thing in an acceptable time frame.
For each work unit, I would like to be able to:
Launch an AWS instance (EC2?)
Send input data to the instance
Run the Python code with the data as input
Return the result and aggregate it when all instances are done
What is the best way to do this?
Is AWS Lambda used for this purpose? Can this be done only with Boto3?
I am completely lost here.
Thank you
A common architecture for running tasks in parallel is:
Put inputs into an Amazon SQS queue
Run workers on multiple Amazon EC2 instances that:
Retrieve a message from the SQS queue
Process the data
Write results to Amazon S3
Delete the message from the SQS queue (to signify that the job is complete)
You can then retrieve all the results from Amazon S3. Depending on their format, you could even use Amazon Athena to run SQL queries against all the output files simultaneously.
You could even run multiple workers on the same instance if each worker is single-threaded and there is spare CPU available.
Related
I have a deployed & functioning Django application that uses celery workers to execute background tasks. We are looking at the possibility of switching to AWS Lambda for running our background tasks instead to help save on costs, but are unsure how we would structure our project.
Current Structure
The app currently consists of an Elastic Beanstalk app with an EC2 instance running the web server and Celery Beat (for some regularly-scheduled tasks), and a separate EC2 Celery worker instance that executes the tasks. In the app's current configuration, I am using Celery with Amazon's SQS to trigger functions executed by the worker instance.
Key to our application is that we may receive several items to be queued up at once. Multiple queue items can execute at the same time, but only one item from each user can execute concurrently. We have accomplished this by using named queues for each user and configuring Celery to only run one task from any given queue at a time.
Questions / Desired Structure
We would like to transition to using AWS lambda functions to execute our background tasks, because we have significant gaps in application usage (we typically get tasks in large groups) and this can save on costs. Our biggest question is whether there is a way to "categorize" lambda invocations such that we can have multiple functions executing at once, but only one executing from each category. Lambda seems to have features for managing concurrency, but only per-function and not with any equivalent to the multiple queues that we are using currently.
It feels like you may want to leverage AWS SQS fifo queues.
You can set your MessageGroupId to your user's Ids, that way you will be able to process Users' messages in parallel.
Per the documentation:
Receiving messages
You can't request to receive messages with a specific message group ID.
You won't be able to filter by user ID.
When receiving messages from a FIFO queue with multiple message group IDs, Amazon SQS first attempts to return as many messages with the same message group ID as possible. This allows other consumers to process messages with a different message group ID. When you receive a message with a message group ID, no more messages for the same message group ID are returned unless you delete the message or it becomes visible.
This explains that SQS Fifo is processing messages by messageGroupID. That means that until you didn't process and delete messages with the same messageGroupId from the queue (it's automatically done by AWS lambda trigger on lambda success), it won't process the others.
On the other hand, it can process other messages with different messageGroupId.
Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order).
I recommend you try and do a mini load-test with Messages in the to simulate the behavior you expect and see if it feat your needs!
You can do a simple SQS FIFO that triggers a lambda with maximum concurrency allowed (1000 for standard accounts).
Then you push messages using a script, with different messageGroupIds, and random ordering. And see how is it processed.
This won't cost so much since it's serverless and will save you a lot of time and issues in the future. (probably money too).
I have this python script that needs to be scheduled to run once a day. It will take around 4-6GB of memory (due to large amount of dataframe operations). I will be using AWS and I would like to what is the best practice to handle such task. Is it a good idea to put it in a container like docker before deployment?
Since your memory needs to be on ram then I'd recommend using a memory optimized ec2 instance with a CloudWatch event.
To minimize the cost, however, you don't want to have this EC2 running the whole day so what you can do is have a couple of lambda functions sitting between CloudWatch and the EC2 to :
start the ec2 instance once the daily trigger runs and
stop the ec2 instance with a trigger from your Python code that runs once it's finished
If that doesn't make much sense let me know and I'll try and elaborate with a diagram.
(ubuntu 12.04). I envision some sort of Queue to put thousands of tasks, and have the ec2 isntances plow through it in parallel (100 ec2 instances) where each instance handles one task from the queue.
Also, each ec2 instance to use the image I provide, which will have the binaries and software installed on it for use.
Essentially what I am trying to do is, run 100 processing (a python function using packages that depend on binaries installed on that image) in parallel on Amazon's EC2 for an hour or less, shut them all off, and repeat this process whenever it is needed.
Is this doable? I am using Python Boto to do this.
This is doable. You should look into using SQS. Jobs are placed on a queue and the worker instances pop jobs off the queue and perform the appropriate work. As a job is completed, the worker deletes the job from the queue so no job is run more than once.
You can configure your instances using user-data at boot time or you can bake AMIs with all of your software pre-installed. I recommend Packer for baking AMIs as it works really well and is very scriptable so your AMIs can be rebuilt consistently as things need to be changed.
For turning on and off lots of instances, look into using AutoScaling. Simply set the group's desired capacity to the number of worker instances you want running and it will take care of the rest.
This sounds like it might be easier to with EMR.
You mentioned in comments you are doing computer vision. You can make your job hadoop friendly by preparing a file where each line a base64 encoding of the image file.
You can prepare a simple bootstrap script to make sure each node of the cluster has your software installed. Hadoop streaming will allow you to use your image processing code as is for the job (instead of rewriting in java).
When your job is over, the cluster instances will be shut down. You can also specify your output be streamed directly to an S3 bucket, its all baked in. EMR is also cheap, 100 m1.medium EC2 instances running for an hour will only cost you around 2 dollars according to the most recent pricing: http://aws.amazon.com/elasticmapreduce/pricing/
I occasionally have really high-CPU intensive tasks. They are launched into a separate high-intensity queue, that is consumed by a really large machine (lots of CPUs, lots of RAM). However, this machine only has to run about one hour per day.
I would like automate deployment of this image on AWS, to be triggered by outstanding messages in the high-intensity queue, and then safely stopped once it is not busy. Something along the lines of:
Some agent (presumably my own software running on my monitor server) checks the queue size, determines there are x > x_threshold new jobs to be done (e.g. I want to trigger if there are 5 outstanding "big" jobs")
A specific AWS instance is started, registers itself with the broker (RabbitMQ) and consumes the jobs
Once the worker has been idle for some t > t_idle (say, longer than 10 minutes), the machine is shut down.
Are there any tools that can I use for this, to ease the automation process, or am I going to have to bootstrap everything myself?
You can public a custom metric to AWS CloudWatch, then set up an autoscale trigger and scaling policy based on your custom metrics. Autoscale can start the instance for you and will kill it based on your policy. You'll have to include the appropriate user data in the launch configuration to bootstrap your host. Just like userdata for any EC2 instance, it could be a bash script or ansible playbook or whatever your config management tool of choice is.
Maybe overkill for your scenario, but as a starting point you may want to check out AWS OpsWorks.
http://aws.amazon.com/opsworks/
http://aws.amazon.com/opsworks/faqs/
if that is indeed a bit higher level than you need, you could use aws cloudformation - perhaps a bit 'closer to the metal' for what you want.
http://aws.amazon.com/cloudformation/
I am using a SQS as a 'bridge' in my system, which receives tasks from GAE and will be processed on the EC2. Currently, I am able to add tasks to this queue from GAE, but with some difficulties on how to consume these tasks on EC2. So my questions are:
What are the recommended ways of building a task consumer (python based) on EC2, which would keep an eye on SQS and assign new inbound jobs to works?
Does AWS has this type of SQS monitoring product? If not, is Celery's period task a good candidate?
The task consumer is just a (long) poller. The boto Python library has SQS support. AWS provides the SQS service, but they don't make the consumers. I'm not familiar with Celery.
Standard practice is to poll for message (which marks it as 'invisible'), and then perform action at consumer end. At action completion, delete message. If action fails because one of your compute nodes disappeared the message will become visible after a period of time and it'll be picked up in a future poll. If you need something smarter you may want to implement an external ESB or experiment with AWS SWF.
http://boto.readthedocs.org/en/latest/ref/sqs.html
http://aws.amazon.com/swf/
You can use the AWS Beanstalk service to consume the tasks in the Queue
AWS Beanstalk with SQS
If you don't want to breakdown your code to run within Beanstalk you can write some code for beanstalk to pull an item off the queue and then send it to your ec2 server, essentially making Beanstalk hand out the messages/tasks in the queue. This would remove the need for your ec2 server to constantly poll the queue