EC2 run_instances: Many instances, slightly different startup scripts? - python

I'm doing an embarrassingly parallel operation on Amazon Web Services, in which I'm spinning up a large number of EC2 instance that all have slightly different scripts to run on startup. Currently, I'm starting up each instance individually within a for loop like so (I'm using the Python boto package to talk to AWS):
for parameters in parameter_list:
#Create this instance's startup script
user_data = startup_script%parameters
#Run this instance
reservation = ec2.run_instances(ami,
key_name=key_name,
security_groups=group_name,
instance_type=instance_type,
user_data=user_data)
However, this takes too long. ec2.run_instances allows one to start many instances at once, using the max_count keyword. I would like to create many instance simultaneously passing each their own unique startup script (user_data). Is there any way to do this? One cannot just pass a list of scripts to user_data.
One option would be to pass the same startup script, but have the script reference another peice of data associated with that instance. EC2's tag system could work, but I don't know of a way to assign tags in a similarly parallel fashion. Is there any kind of instance-specific data I can assign to a set of instances in parallel?

AFAIK, there is no simple solution. How about using Simple Queue Service(SQS)?
Add start-up scripts (aka user-data) to SQS
write user-data as
read a start-up script from SQS and run it
If your script is upper than 256k, you do not add it to SQS directly. So, try this procedure.
Add start-up scripts (aka user-data) to S3
Add the S3 url of the script to SQS
write user-data as
read a url from SQS
download the script from S3
run it
Sorry, It's very complicated. Hope this helps.

Simple. Fork just before you initialize each node.
newPid = os.fork()
if newPid == 0:
is_master = False
# Create the instance
...blah blah blah...
else:
logging.info( 'Launched host %s ...' % hostname )

Related

Run two different instances of the same script with different configuration

I have a simple script that is responsible of fetching data from an external API, lets call it connector.py.
That script takes some params as an input ,do its job and then write it to a file and return the output.
I want to implement a scheduler that would create and manage two instances of that script, each with his own input(different settings) and make them run in configured intervals with the next constraint:
Input: Pass the parameters of the connector from the settings, to the sub-process via the stdin channel (not as process args)
Output: Pass the connector output from the sub-process to the service via the stdout channel
I have to implement the constant loop cycle by myself (not use a Scheduler for example)
What mechanisem should I use in order to acheive that goal processes?, threads?, sub-process?
Im mainly struggling to understand how to deal with stdin/stdout issue for the different connector instances.
Any advice would be appericiated.
You have two possibilities whith the scheduling of tasks.
Make your script a factory which will run everytime until something stop it. So you will have the possibility to choose either threads or processes (subprocess use porcess). Here a little description of threads and processes. (If I use this method I would use sub-processes)
What is the difference between a process and a thread?
https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
However I don't see the utility of using threads or subprocesses in your case because you're telling us that you will make them run in configured intervals. You can just integerate the program to your to make them run separatly.
For task scheduling you also have the use of cronjobs. It allows the execution of commands depending of the date, repetition, user, etc. Here some detail on how setting up a cronjob:
https://phoenixnap.com/kb/set-up-cron-job-linux

Trying to create a same empty file using AWS s3.put_object by more than one process in parallel using python

Suppose if two(it can be any number) processes are trying to access same block of code in parallel, to avoid that parallel access, I am trying to create an empty file in s3 bucket, so that if that file exists, then other process which is trying to access has to wait before the first process ends up using the block of code. After its usage the first process will delete the empty file which means that the second process can now be able to use that block of code by creating the empty file and holds the lock with it.
import boto3
s3 = boto3.client('s3')
def create_obj(bucket, file):
s3.put_object(Bucket=bucket, Key=file)
return "file created"
job1 = create_obj(bucket="s3bucketname", file='xyz/empty_file.txt')
job2 = create_obj(bucket="s3bucketname", file='xyz/empty_file.txt')
Here suppose job1 and job2 are trying to access the same create_obj in parallel to create empty_file.txt, which means they are hitting the line s3.put_object at the same time. Then one of the jobs has to wait. Here n number of jobs can access the create_obj function in parallel. We need to make sure that those jobs execute properly as explained above.
Please help me with this.
My understanding is that you're trying to implement a distributed locking mechanism on the basis of S3.
Since the updates to the S3 consistency model at re:invent 2020 this could be possible, but you could also use a service like DynamoDB for it, which makes building these easier.
I recommend you check out this blog post on the AWS blog, which describes the process. Since link-only answers are discouraged here, I'll try to summarize the idea, but you should really read the full article.
You do a conditional PutItem call on a lock-item in the table. The condition is, that an item with that key doesn't exist or has expired. The new item contains:
The name of the lock (Partition Key)
How long that lock is supposed to be valid
Timestamp when the lock was created
Some identifier that identifies your system (the locking entity)
If that put works, you know that you have acquired the lock, if it fails you know it's already locked and can retry later
You then perform your work
In the end you remove the lock item
There's an implementation of this for Python as well python-dynamodb-lock on PyPi

Ansible script to spin up a failed instance in AWS EC2

I have a use case, where I have an EC2 instance with Fedora Linux and some applications running. When this instance fails, I have to spin up a new instance with the same OS and install the applications. I am trying to do in Ansible (and Python), I'm a complete novice and have no idea how to do it.
For my applications, I have a variable (a structure) that tells me how many of each type of server I need in each of three subnets. The environment creation playbook loops through that structure and builds however many are needed to fill the requirements. So if I need five (5) and only have three (3), it builds two (2). I use exact_count for that in the ec2 module.
So if one fails, I can delete that instance and re-run my create playbook, which will also re-write all the configuration files on the other servers that they use to communicate with each other. For instance, if I delete a JBoss server create a new one to replace it, the load balancer has to know about it.
Good practise here would be to have a base image that covers what you need, use that as a feeder for an AMI, and then plug it into an Auto-scaling group. As part of the auto-scaling group, you can use user-data to load specific updates/etc onto the instance at boot time.
Autoscale group min 1 max 1 will do exactly what you want, if you can configure it the above way.

Use python to shut down instance script runs on

I am running machine learning scripts that take a long time to finish. I want to run them on AWS on a faster processor and stop the instance when it finishes.
Can boto be used within the running script to stop its own instance? Is there a simpler way?
If your EC2 instance is running Linux, you can simply issue a halt or shutdown command to stop your EC2 instance. This allows you to shutdown your EC2 instance without requiring IAM permissions.
See Creating a Connection on how to create a connection. Never tried this one before, so use caution. Also make sure the instance is EBS backed. Otherwise the instance will be terminated when you stop it.
import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region("us-east-1") # or your region
# Get the current instance's id
my_id = boto.utils.get_instance_metadata()['instance-id']
conn.stop_instances(instance_ids=[my_id])

BOTO distribute scraping tasks among AWS

I have 200,000 URLs that I need to scrape from a website. This website has a very strict scraping policy and you will get blocked if the scraping frequency is 10+ /min. So I need to control my pace. And I am thinking about start a few AWS instances (say 3) to run in parallel.
In this way, the estimated time to collect all the data will be:
200,000 URL / (10 URL/min) = 20,000 min (one instance only)
4.6 days (three instances)
which is a legit amount of time to get my work done.
However, I am thinking about building a framework using boto. That I have a paragraph of code and a queue of input (a list of URLs) in this case. Meanwhile I also don't want to do any damage to their website so I only want to scrape during the night and weekend. So I am thinking about all of this should be controlled on one box.
And the code should look similar like this:
class worker (job, queue)
url = queue.pop()
aws = new AWSInstance()
result aws.scrape(url)
return result
worker1 = new worker()
worker2 = new worker()
worker3 = new worker()
worker1.start()
worker2.start()
worker3.start()
The code above is totally pseudo and my idea is to pass the work to AWS.
Question:
(1) How to use boto to pass the variable/argument to another AWS instance and start a script to work on those variable and .. use boto to retrieve the result back to the master box.
(2) What is the best way to schedule a job only on specific time period inside Python code.
Say only work on 6:00pm to 6:00 am everyday... I don't think the Linux crontab will fit my need in this situation.
Sorry about that if my question is more verbally descriptive and philosophical.. Even if you can offer me any hint or throw away some package/library name that meet my need. I will be gratefully appreciated!
Question: (1) How to use boto to pass the variable/argument to another
AWS instance and start a script to work on those variable
Use shared datasource, such as DynamoDB or messaging framework such as SQS
and .. use boto to retrieve the result back to the master box.
Again, shared datasource, or messaging.
(2) What is the best way to schedule a job only on specific time
period inside Python code. Say only work on 6:00pm to 6:00 am
everyday... I don't think the Linux crontab will fit my need in this
situation.
I think crontab fits well here.

Categories

Resources