AWS EC2: How to process queue with 100 parallel ec2 instances? - python

(ubuntu 12.04). I envision some sort of Queue to put thousands of tasks, and have the ec2 isntances plow through it in parallel (100 ec2 instances) where each instance handles one task from the queue.
Also, each ec2 instance to use the image I provide, which will have the binaries and software installed on it for use.
Essentially what I am trying to do is, run 100 processing (a python function using packages that depend on binaries installed on that image) in parallel on Amazon's EC2 for an hour or less, shut them all off, and repeat this process whenever it is needed.
Is this doable? I am using Python Boto to do this.

This is doable. You should look into using SQS. Jobs are placed on a queue and the worker instances pop jobs off the queue and perform the appropriate work. As a job is completed, the worker deletes the job from the queue so no job is run more than once.
You can configure your instances using user-data at boot time or you can bake AMIs with all of your software pre-installed. I recommend Packer for baking AMIs as it works really well and is very scriptable so your AMIs can be rebuilt consistently as things need to be changed.
For turning on and off lots of instances, look into using AutoScaling. Simply set the group's desired capacity to the number of worker instances you want running and it will take care of the rest.

This sounds like it might be easier to with EMR.
You mentioned in comments you are doing computer vision. You can make your job hadoop friendly by preparing a file where each line a base64 encoding of the image file.
You can prepare a simple bootstrap script to make sure each node of the cluster has your software installed. Hadoop streaming will allow you to use your image processing code as is for the job (instead of rewriting in java).
When your job is over, the cluster instances will be shut down. You can also specify your output be streamed directly to an S3 bucket, its all baked in. EMR is also cheap, 100 m1.medium EC2 instances running for an hour will only cost you around 2 dollars according to the most recent pricing: http://aws.amazon.com/elasticmapreduce/pricing/

Related

Is there a way to execute distributed code in Python on AWS?

I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.

Run parallel Python code on multiple AWS instances

I have a Python algorithm that can be parallelized fairly easily.
I don't have the resources locally to run the whole thing in an acceptable time frame.
For each work unit, I would like to be able to:
Launch an AWS instance (EC2?)
Send input data to the instance
Run the Python code with the data as input
Return the result and aggregate it when all instances are done
What is the best way to do this?
Is AWS Lambda used for this purpose? Can this be done only with Boto3?
I am completely lost here.
Thank you
A common architecture for running tasks in parallel is:
Put inputs into an Amazon SQS queue
Run workers on multiple Amazon EC2 instances that:
Retrieve a message from the SQS queue
Process the data
Write results to Amazon S3
Delete the message from the SQS queue (to signify that the job is complete)
You can then retrieve all the results from Amazon S3. Depending on their format, you could even use Amazon Athena to run SQL queries against all the output files simultaneously.
You could even run multiple workers on the same instance if each worker is single-threaded and there is spare CPU available.

Scheduled Python Script to run on AWS that has high memory use

I have this python script that needs to be scheduled to run once a day. It will take around 4-6GB of memory (due to large amount of dataframe operations). I will be using AWS and I would like to what is the best practice to handle such task. Is it a good idea to put it in a container like docker before deployment?
Since your memory needs to be on ram then I'd recommend using a memory optimized ec2 instance with a CloudWatch event.
To minimize the cost, however, you don't want to have this EC2 running the whole day so what you can do is have a couple of lambda functions sitting between CloudWatch and the EC2 to :
start the ec2 instance once the daily trigger runs and
stop the ec2 instance with a trigger from your Python code that runs once it's finished
If that doesn't make much sense let me know and I'll try and elaborate with a diagram.

Autoscale Python Celery with Amazon EC2

I have a Celery Task-Manager to crunch some numbers for company analytics.
The Task-Manager and workers are hosted on an Amazon EC2 Linux Server.
I need to set up the system such if we send too many tasks to celery Amazon automatically sets up a new EC2 instance to run more workers and balances the load across these workers.
The services that I'm aware exist are the Amazon Autoscale and Amazon Load balancing services which seem like exactly what I want to use however, I'm not sure what the best way to configure the Celery is.
I think that I ought to have a celery "master" which is collecting all the tasks and a number of celery workers which execute them. As the number of tasks increases I want to add more workers. The way the autoscale works (by taking an AMI of the celery server) I think that I'm currently cloning the Master as well as the workers which seems like not what I want to do.
How do I organise this to achieve my end goal which is flexible autoscaling task management using Celery to manage the tasks and Amazon Web Service to host the computing.
As much detail as possible in any answers (or links to tutorials!) would be greatly appreciated as most tutorials or advice seems to assume large quantities of knowledge which I don't currently have!
You do not need a master-worker architecture to get this to work. If I understand your question correctly, you want to be able to scale based on queue size. I would say it will be easier if you have the following steps
Setup elasticache/sqs for the broker (since you're in aws)
For custom scaling - A periodic task which checks queue sizes using something like this OR add amazon autoscaling to just add/remove machines when CPU usage is high (assuming that that is a good enough indication of load). Also, start workers with --autoscale so that the CPU usage gets reflected correctly.

Managing workers on AWS

I occasionally have really high-CPU intensive tasks. They are launched into a separate high-intensity queue, that is consumed by a really large machine (lots of CPUs, lots of RAM). However, this machine only has to run about one hour per day.
I would like automate deployment of this image on AWS, to be triggered by outstanding messages in the high-intensity queue, and then safely stopped once it is not busy. Something along the lines of:
Some agent (presumably my own software running on my monitor server) checks the queue size, determines there are x > x_threshold new jobs to be done (e.g. I want to trigger if there are 5 outstanding "big" jobs")
A specific AWS instance is started, registers itself with the broker (RabbitMQ) and consumes the jobs
Once the worker has been idle for some t > t_idle (say, longer than 10 minutes), the machine is shut down.
Are there any tools that can I use for this, to ease the automation process, or am I going to have to bootstrap everything myself?
You can public a custom metric to AWS CloudWatch, then set up an autoscale trigger and scaling policy based on your custom metrics. Autoscale can start the instance for you and will kill it based on your policy. You'll have to include the appropriate user data in the launch configuration to bootstrap your host. Just like userdata for any EC2 instance, it could be a bash script or ansible playbook or whatever your config management tool of choice is.
Maybe overkill for your scenario, but as a starting point you may want to check out AWS OpsWorks.
http://aws.amazon.com/opsworks/
http://aws.amazon.com/opsworks/faqs/
if that is indeed a bit higher level than you need, you could use aws cloudformation - perhaps a bit 'closer to the metal' for what you want.
http://aws.amazon.com/cloudformation/

Categories

Resources