Python 2.7.12
boto3==1.3.1
How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds?
Create the cluster
response = client.run_job_flow(
Name=name,
LogUri='s3://mybucket/emr/',
ReleaseLabel='emr-5.9.0',
Instances={
'MasterInstanceType': instance_type,
'SlaveInstanceType': instance_type,
'InstanceCount': instance_count,
'KeepJobFlowAliveWhenNoSteps': True,
'Ec2KeyName': 'KeyPair',
'EmrManagedSlaveSecurityGroup': 'sg-1234',
'EmrManagedMasterSecurityGroup': 'sg-1234',
'Ec2SubnetId': 'subnet-1q234',
},
Applications=[
{'Name': 'Spark'},
{'Name': 'Hadoop'}
],
BootstrapActions=[
{
'Name': 'Install Python packages',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
}
}
],
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
Configurations=[
{
'Classification': 'spark',
'Properties': {
'maximizeResourceAllocation': 'true'
}
},
],
)
Add a step
response = client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
'Name': 'Run Step',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
'--py-files',
's3://mybucket/code/spark/spark_udfs.py',
's3://mybucket/code/spark/{}'.format(spark_script),
'--some-arg'
],
'Jar': 'command-runner.jar'
}
}
]
)
This successfully adds a step and runs, however, when the step completes successfully, I would like the cluster to auto-terminate as noted in the AWS CLI: http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
In your case (creating the cluster using boto3) you can add these flags
'TerminationProtected': False, 'AutoTerminate': True, to your cluster creation. In this way after your step finished to run the cluster will be shut-down.
Another solution is to add another step to kill the cluster immediately after the step that you want to run. So basically you need to run this command as step
aws emr terminate-clusters --cluster-ids your_cluster_id
The tricky part is to retrive the cluster_id.
Here you can find some solution: Does an EMR master node know it's cluster id?
The 'AutoTerminate': True parameter as suggested did not work for me. However, it worked when I set the parameter 'KeepJobFlowAliveWhenNoSteps' from True to False. Your Code should look then as the following:
response = client.run_job_flow(
Name=name,
LogUri='s3://mybucket/emr/',
ReleaseLabel='emr-5.9.0',
Instances={
'MasterInstanceType': instance_type,
'SlaveInstanceType': instance_type,
'InstanceCount': instance_count,
'KeepJobFlowAliveWhenNoSteps': False,
'Ec2KeyName': 'KeyPair',
'EmrManagedSlaveSecurityGroup': 'sg-1234',
'EmrManagedMasterSecurityGroup': 'sg-1234',
'Ec2SubnetId': 'subnet-1q234',
},
Applications=[
{'Name': 'Spark'},
{'Name': 'Hadoop'}
],
BootstrapActions=[
{
'Name': 'Install Python packages',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
}
}
],
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
Configurations=[
{
'Classification': 'spark',
'Properties': {
'maximizeResourceAllocation': 'true'
}
},
],
)
You can create a short-lived cluster that automatically terminates after all steps have been run by specifying 'KeepJobFlowAliveWhenNoSteps': False in the Instances param. I've added a complete example to GitHub that shows how to do this.
Here's some of the code from the demo:
def run_job_flow(
name, log_uri, keep_alive, applications, job_flow_role, service_role,
security_groups, steps, emr_client):
try:
response = emr_client.run_job_flow(
Name=name,
LogUri=log_uri,
ReleaseLabel='emr-5.30.1',
Instances={
'MasterInstanceType': 'm5.xlarge',
'SlaveInstanceType': 'm5.xlarge',
'InstanceCount': 3,
'KeepJobFlowAliveWhenNoSteps': keep_alive,
'EmrManagedMasterSecurityGroup': security_groups['manager'].id,
'EmrManagedSlaveSecurityGroup': security_groups['worker'].id,
},
Steps=[{
'Name': step['name'],
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'cluster',
step['script_uri'], *step['script_args']]
}
} for step in steps],
Applications=[{
'Name': app
} for app in applications],
JobFlowRole=job_flow_role.name,
ServiceRole=service_role.name,
EbsRootVolumeSize=10,
VisibleToAllUsers=True
)
cluster_id = response['JobFlowId']
logger.info("Created cluster %s.", cluster_id)
except ClientError:
logger.exception("Couldn't create cluster.")
raise
else:
return cluster_id
And here's some code that calls this function with some real params:
output_prefix = 'pi-calc-output'
pi_step = {
'name': 'estimate-pi-step',
'script_uri': f's3://{bucket_name}/{script_key}',
'script_args':
['--partitions', '3', '--output_uri',
f's3://{bucket_name}/{output_prefix}']
}
cluster_id = emr_basics.run_job_flow(
f'{prefix}-cluster', f's3://{bucket_name}/logs',
False, ['Hadoop', 'Hive', 'Spark'], job_flow_role, service_role,
security_groups, [pi_step], emr_client)
Related
I am setting up entire GCP architecture using Deployment Manager using Python template structure.
I have tried to execute the script below:
'name': 'dataproccluster',
'type': 'dataproc.py',
'subnetwork': 'default',
'properties': {
'zone': ZONE_NORTH,
'region': REGION_NORTH,
'serviceAccountEmail': 'X#appspot.gserviceaccount.com',
'softwareConfig': {
'imageVersion': '1.4-debian9',
'properties': {
'dataproc:dataproc.conscrypt.provider.enable' : 'False'
}
},
'master': {
'numInstances': 1,
'machineType': 'n1-standard-1',
'diskSizeGb': 50,
'diskType': 'pd-standard',
'numLocalSsds': 0
},
'worker': {
'numInstances': 2,
'machineType': 'n1-standard-1',
'diskType': 'pd-standard',
'diskSizeGb': 50,
'numLocalSsds': 0
},
'initializationActions':[{
'executableFile': 'gs://dataproc-initialization-actions/python/pip-install.sh'
}],
'metadata': {
'PIP_PACKAGES':'requests_toolbelt==0.9.1 google-auth==1.6.31'
},
'labels': {
'environment': 'dev',
'data_type': 'X'
}
}
Which results in the following error:
Initialization action failed. Failed action 'gs://dataproc-initialization-actions/python/pip-install.sh',\
I would like to evaluate if it is an error on my side, or an API problem of any sort? I found Google tickets related to this topic covering CLI deployment, however they were marked as solved. I found nothing on Deployment Manager side.
If it is an error on my side what am I doing wrong?
I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function.
Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step.
You can, I had to same thing last week!
Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a step to an already up cluster.
Defining the cluster with the step
def lambda_handler(event, context):
conn = boto3.client("emr")
cluster_id = conn.run_job_flow(
Name='ClusterName',
ServiceRole='EMR_DefaultRole',
JobFlowRole='EMR_EC2_DefaultRole',
VisibleToAllUsers=True,
LogUri='s3n://some-log-uri/elasticmapreduce/',
ReleaseLabel='emr-5.8.0',
Instances={
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm3.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Slave nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm3.xlarge',
'InstanceCount': 2,
}
],
'Ec2KeyName': 'key-name',
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False
},
Applications=[{
'Name': 'Spark'
}],
Configurations=[{
"Classification":"spark-env",
"Properties":{},
"Configurations":[{
"Classification":"export",
"Properties":{
"PYSPARK_PYTHON":"python35",
"PYSPARK_DRIVER_PYTHON":"python35"
}
}]
}],
BootstrapActions=[{
'Name': 'Install',
'ScriptBootstrapAction': {
'Path': 's3://path/to/bootstrap.script'
}
}],
Steps=[{
'Name': 'StepName',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"/usr/bin/spark-submit", "--deploy-mode", "cluster",
's3://path/to/code.file', '-i', 'input_arg',
'-o', 'output_arg'
]
}
}],
)
return "Started cluster {}".format(cluster_id)
Attaching a step to an already running cluster
As per here
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
# choose the correct cluster
clusters = [c["Id"] for c in clusters["Clusters"]
if c["Status"]["State"] in ["RUNNING", "WAITING"]]
if not clusters:
sys.stderr.write("No valid clusters\n")
sys.stderr.exit()
# take the first relevant cluster
cluster_id = clusters[0]
# code location on your emr master node
CODE_DIR = "/home/hadoop/code/"
# spark configuration example
step_args = ["/usr/bin/spark-submit", "--spark-conf", "your-configuration",
CODE_DIR + "your_file.py", '--your-parameters', 'parameters']
step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': step_args
}
}
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
return "Added step: %s"%(action)
AWS Lambda function python code if you want to execute Spark jar using spark submit command:
from botocore.vendored import requests
import json
def lambda_handler(event, context):
headers = { "content-type": "application/json" }
url = 'http://ip-address.ec2.internal:8998/batches'
payload = {
'file' : 's3://Bucket/Orchestration/RedshiftJDBC41.jar
s3://Bucket/Orchestration/mysql-connector-java-8.0.12.jar
s3://Bucket/Orchestration/SparkCode.jar',
'className' : 'Main Class Name',
'args' : [event.get('rootPath')]
}
res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)
json_data = json.loads(res.text)
return json_data.get('id')
I'm trying to create a Premium SSD disk to attach to a VM in azure but can't seem to figure out how to specify that correctly - I keep ending up with a Standard HDD.
azure_client.compute_client.disks.create_or_update("my_resource_group", 'deleteme-' + str(disk_num), {
"location": "westus",
"disk_size_gb": 256,
'creation_data': {
'create_option': 'empty',
'sku': {
'name': 'Premium_LRS' # <=== What I want
}
},
'tags': {
"fake": "tags"
}
}).result().as_dict()
{
'id': '/subscriptions/5efe2633-26ac-4638-9f1f-6e24e494d9b4/resourceGroups/my_resource_group/providers/Microsoft.Compute/disks/deleteme-26',
'provisioning_state': 'Succeeded',
'name': 'deleteme-26',
'type': 'Microsoft.Compute/disks',
'time_created': '2019-02-05T00:37:41.907815Z',
'tags': {
'fake': 'tags'
},
'creation_data': {
'create_option': 'Empty'
},
'sku': {
'tier': 'Standard',
'name': 'Standard_LRS' # <== What I actually get
},
'location': 'westus',
'disk_size_gb': 256
}
I'm open to connecting the disk directly to the host at creation, but can't figure out the API for tagging the disk that way.
I've also tried also specifying 'tier': 'Premium' in the sku description - but no change. Here's the documentation I've found:
A bit embarrassing, but maybe someone else will do this in the future... I put the SKU in the wrong sub-dictionary. Azure doesn't yell at you if you put random stuff it doesn't understand in the creation_data section.
azure_client.compute_client.disks.create_or_update("my_resource_group", 'deleteme-' + str(disk_num), {
"location": "westus",
"disk_size_gb": 256,
'creation_data': {
'create_option': 'empty'
},
'sku': {
'name': 'Premium_LRS' # <=== Moved out of creation_data dict
}
'tags': {
"fake": "tags"
}
}).result().as_dict()
I'm trying to deploy a Custom Instance Template using gcloud deployment-manager, but I keep getting this error:
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1507833758152-55b5de788f540-e3be8bf6-a792d98e]: errors:
- code: RESOURCE_ERROR
location: /deployments/my-project/resources/worker-template
message: '{"ResourceType":"compute.v1.instanceTemplate","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"errors":[{"domain":"global","message":"Invalid
value for field ''resource.properties'': ''''. Instance Templates must provide
instance properties.","reason":"invalid"}],"message":"Invalid value for field
''resource.properties'': ''''. Instance Templates must provide instance properties.","statusMessage":"Bad
Request","requestPath":"https://www.googleapis.com/compute/v1/projects/my-project/global/instanceTemplates","httpMethod":"POST"}}'
My python generate_config function is this:
def generate_config(context):
resources = [{
'type': 'compute.v1.instanceTemplate',
'name': 'worker-template',
'properties': {
'zone': context.properties['zone'],
'description': 'Worker Template',
'machineType': context.properties['machineType'],
'disks': [{
'deviceName': 'boot',
'type': 'PERSISTENT',
'boot': True,
'autoDelete': True,
'initializeParams': {
'sourceImage': '/'.join([
context.properties['compute_base_url'],
'projects', context.properties['os_project'],
'global/images/family', context.properties['os_project_family']
])
}
}],
'networkInterfaces': [{
'network': '$(ref.' + context.properties['network'] + '.selfLink)',
'accessConfigs': [{
'name': 'External NAT',
'type': 'ONE_TO_ONE_NAT'
}]
}]
}
}]
return {'resources': resources}
Properties is not empty, so the error message doesn't make much sense. Any ideas?
Thx!
After reading this example, I just found that the correct structure for compute.v1.instanceTemplate is:
...
'type': 'compute.v1.instanceTemplate',
'name': 'worker-template',
'properties': {
'project': 'my-project',
'properties': {
'zone': context.properties['zone'],
...
}
}
...
The structure follows this doc
When writing .ebextensions .config files, Amazon allows for long and shortform entries, for example these two configurations are identical:
Long form:
"option_settings": [
{
'Namespace': 'aws:rds:dbinstance',
'OptionName': 'DBEngine',
'Value': 'postgres'
},
{
'Namespace': 'aws:rds:dbinstance',
'OptionName': 'DBInstanceClass',
'Value': 'db.t2.micro'
}
]
Shortform:
"option_settings": {
"aws:rds:dbinstance": {
"DBEngine": "postgres",
"DBInstanceClass": "db.t2.micro"
}
}
However, all of the configurations I've seen only specify using a long form with boto3:
response = eb_client.create_environment(
... trimmed ...
OptionSettings=[
{
'Namespace': 'aws:rds:dbinstance',
'OptionName': 'DBEngineVersion',
'Value': '5.6'
},
... trimmed ...
)
Is it possible to use a dictionary with shortform entries with boto3?
Bonus: If not, why not?
Trial and error suggests no, you can not use the shortform config type.
However, if you are of that sort of persuasion you can do this:
def short_to_long(_in):
out = []
for namespace,key_vals in _in.items():
for optname,value in key_vals.items():
out.append(
{
'Namespace': namespace,
'OptionName': optname,
'Value': value
}
)
return out
Then elsewhere:
response = eb_client.create_environment(
OptionSettings=short_to_long({
"aws:rds:dbinstance": {
"DBDeletionPolicy": "Delete", # or snapshot
"DBEngine": "postgres",
"DBInstanceClass": "db.t2.micro"
},
})