I am starting to use aws sagemaker on the development of my machine learning model and I'm trying to build a lambda function to process the responses of a sagemaker labeling job. I already created my own lambda function but when I try to read the event contents I can see that the event dict is completely empty, so I'm not getting any data to read.
I have already given enough permissions to the role of the lambda function. Including:
- AmazonS3FullAccess.
- AmazonSagemakerFullAccess.
- AWSLambdaBasicExecutionRole
I've tried using this code for the Post-annotation Lambda (adapted for python 3.6):
https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step2-demo1.html#sms-custom-templates-step2-demo1-post-annotation
As well as this one in this git repository:
https://github.com/aws-samples/aws-sagemaker-ground-truth-recipe/blob/master/aws_sagemaker_ground_truth_sample_lambda/annotation_consolidation_lambda.py
But none of them seemed to work.
For creating the labeling job I'm using boto3's functions for sagemaker:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_labeling_job
This is the code i have for creating the labeling job:
def create_labeling_job(client,bucket_name ,labeling_job_name, manifest_uri, output_path):
print("Creating labeling job with name: %s"%(labeling_job_name))
response = client.create_labeling_job(
LabelingJobName=labeling_job_name,
LabelAttributeName='annotations',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': manifest_uri
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': output_path
},
RoleArn='arn:aws:myrolearn',
LabelCategoryConfigS3Uri='s3://'+bucket_name+'/config.json',
StoppingConditions={
'MaxPercentageOfInputDatasetLabeled': 100,
},
LabelingJobAlgorithmsConfig={
'LabelingJobAlgorithmSpecificationArn': 'arn:image-classification'
},
HumanTaskConfig={
'WorkteamArn': 'arn:my-private-workforce-arn',
'UiConfig': {
'UiTemplateS3Uri':'s3://'+bucket_name+'/templatefile'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-BoundingBox',
'TaskTitle': 'Title',
'TaskDescription': 'Description',
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 600,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:my-custom-post-annotation-lambda'
}
}
)
return response
And this is the one i have for the lambda function:
print("Received event: " + json.dumps(event, indent=2))
print("event: %s"%(event))
print("context: %s"%(context))
print("event headers: %s"%(event["headers"]))
parsed_url = urlparse(event['payload']['s3Uri']);
print("parsed_url: ",parsed_url)
labeling_job_arn = event["labelingJobArn"]
label_attribute_name = event["labelAttributeName"]
label_categories = None
if "label_categories" in event:
label_categories = event["labelCategories"]
print(" Label Categories are : " + label_categories)
payload = event["payload"]
role_arn = event["roleArn"]
output_config = None # Output s3 location. You can choose to write your annotation to this location
if "outputConfig" in event:
output_config = event["outputConfig"]
# If you specified a KMS key in your labeling job, you can use the key to write
# consolidated_output to s3 location specified in outputConfig.
kms_key_id = None
if "kmsKeyId" in event:
kms_key_id = event["kmsKeyId"]
# Create s3 client object
s3_client = S3Client(role_arn, kms_key_id)
# Perform consolidation
return do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client)
I've tried debugging the event object with:
print("Received event: " + json.dumps(event, indent=2))
But it just prints an empty dictionary: Received event: {}
I expect the output to be something like:
#Content of an example event:
{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>,
"labelCategories": [<string>], # If you created labeling job using aws console, labelCategories will be null
"labelAttributeName": <string>,
"roleArn" : "string",
"payload": {
"s3Uri": <string>
}
"outputConfig":"s3://<consolidated_output configured for labeling job>"
}
Lastly, when I try yo get the labeling job ARN with:
labeling_job_arn = event["labelingJobArn"]
I just get a KeyError (which makes sense because the dictionary is empty).
I am doing the same but in Labeled object section I am getting failed result and inside my output objects I am getting following error from Post Lambda function:
"annotation-case0-test3-metadata": {
"retry-count": 1,
"failure-reason": "ClientError: The JSON output from the AnnotationConsolidationLambda function could not be read. Check the output of the Lambda function and try your request again.",
"human-annotated": "true"
}
}
I found the problem, I needed to add the ARN of the role used by my Lamda function as a Trusted Entity on the Role used for the Sagemaker Labeling Job.
I just went to Roles > MySagemakerExecutionRole > Trust Relationships and added:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::xxxxxxxxx:role/My-Lambda-Role",
...
],
"Service": [
"lambda.amazonaws.com",
"sagemaker.amazonaws.com",
...
]
},
"Action": "sts:AssumeRole"
}
]
}
This made it work for me.
Related
initially this was python script on ec2 but now i want it become aws lambda - generated from terraform! Since aws lambda needs lambda_handler function vs "__ main __".
I wonder what to put in my tf code for handler arg.
The python script(it gets zipped by terraform then loaded up to aws lambda):
#!/usr/bin/env python3
import boto3
import json
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
queue = boto3.resource(
'sqs', region_name='us-east-1').get_queue_by_name(QueueName="erjan")
table = boto3.resource('dynamodb', region_name='us-east-1').Table('Votes')
def process_message(message):
try:
payload = message.message_attributes
voter = payload['voter']['StringValue']
vote = payload['vote']['StringValue']
logging.info("Voter: %s, Vote: %s", voter, vote)
update_count(vote)
message.delete()
except Exception as e:
print('-----EXCEPTION-----')
def update_count(vote):
logging.info('update count....')
cur_count = 0
if vote == 'b':
response = table.get_item(Key={'voter': 'count'})
item = response['Item']
item['b'] += 1
table.put_item(Item=item)
elif vote == 'a':
table.update_item(
Key={'voter': 'count'},
UpdateExpression="ADD a :incr",
ExpressionAttributeValues={':incr': 1})
if __name__ == "__main__":
logging.info('--------inside main-------')
while True:
try:
messages = queue.receive_messages(MessageAttributeNames=['vote', 'voter'])
except KeyboardInterrupt:
logging.info("Stopping...")
break
except:
logging.error(sys.exc_info()[0])
continue
for message in messages:
process_message(message)
the tf code:
resource "aws_iam_role" "vote_processor_lambda_iam_role" {
name = "vote_processor_lambda_iam_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_policy" "vote_processor_dynamodb_policy" {
name = "vote_processor_dynamodb_policy"
policy = <<EOF
{
"some json"
}
}
resource "aws_iam_role_policy_attachment" "attach_vote_processor_dynamodb_policy_to_iam_role" {
role = aws_iam_role.vote_processor_lambda_iam_role.name
policy_arn = aws_iam_policy.vote_processor_dynamodb_policy.arn
}
resource "aws_iam_policy" "vote_processor_sqs_policy" {
name = "vote_processor_sqs_policy"
policy = <<EOF
{
"some json"
}
EOF
}
resource "aws_iam_role_policy_attachment" "vote_processor_sqs_access_policy" {
role = aws_iam_role.vote_processor_lambda_iam_role.name
policy_arn = aws_iam_policy.vote_processor_sqs_policy.arn
}
data "archive_file" "vote_processor_zip_code" {
type = "zip"
source_file = "${path.module}/vote_processor.py"
output_path = "${path.module}/vote_processor.zip"
}
#what to put in handler arg?
resource "aws_lambda_function" "vote_processor_lambda_backend" {
filename = "${path.module}/vote_processor.zip"
function_name = "vote_processor"
role = aws_iam_role.vote_processor_lambda_iam_role.arn
handler = "result.lambda_handler" #should this be __main__?
runtime = "python3.9"
}
should i rename the python script main function be "lambda handler"? or vice versa in tf code?
"__ main __" is not necessary. The following syntax defines lambda functions.
Lambda function handler in Python - AWS Lambda
You can use the following general syntax when creating a function handler in Python:
def handler_name(event, context):
...
return some_value
The handler to be set for the tf is as follows.
Lambda function handler in Python - AWS Lambda
The Lambda function handler name specified at the time that you create a Lambda function is derived from:
The name of the file in which the Lambda handler function is located.
The name of the Python handler function.
A function handler can be any name; however, the default name in the Lambda console is lambda_function.lambda_handler. This function handler name reflects the function name (lambda_handler) and the file where the handler code is stored (lambda_function.py).
I'm trying to create a DynamoDB table using a CloudFormation stack, however I keep receiving the 'CREATE_FAILED' error in the AWS console and I'm not sure where I'm going wrong.
My method to create_stack:
cf = boto3.client('cloudformation')
stack_name = 'teststack'
with open('dynamoDBTemplate.json') as json_file:
template = json.load(json_file)
template = str(template)
try:
response = cf.create_stack(
StackName = stack_name,
TemplateBody = template,
TimeoutInMinutes = 123,
ResourceTypes = [
'AWS::DynamoDB::Table',
],
OnFailure = 'DO_NOTHING',
EnableTerminationProtection = True
)
print(response)
except ClientError as e:
print(e)
And here is my JSON file:
{
"AWSTemplateFormatVersion":"2010-09-09",
"Resources":{
"myDynamoDBTable":{
"Type":"AWS::DynamoDB::Table",
"Properties":{
"AttributeDefinitions":[
{
"AttributeName":"Filename",
"AttributeType":"S"
},
{
"AttributeName":"Positive Score",
"AttributeType":"S"
},
{
"AttributeName":"Negative Score",
"AttributeType":"S"
},
{
"AttributeName":"Mixed Score",
"AttributeType":"S"
}
],
"KeySchema":[
{
"AttributeName":"Filename",
"KeyType":"HASH"
}
],
"ProvisionedThroughput":{
"ReadCapacityUnits":"5",
"WriteCapacityUnits":"5"
},
"TableName":"testtable"
}
}
}
}
My console prints the created stack but there is no clear indication in the console as to why it keeps failing.
Take a look at the Events tab for your stack. It will show you the detailed actions and explain which step first failed. Specifically it will tell you:
One or more parameter values were invalid: Number of attributes in KeySchema does not exactly match number of attributes defined in AttributeDefinitions (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: 12345; Proxy: null)
The problem is that you have provided definitions for all of your table attributes. You should only provide the key attributes.
Per the AttributeDefinitions documentation:
[AttributeDefinitions is] A list of attributes that describe the key schema for the table and indexes.
I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function.
Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step.
You can, I had to same thing last week!
Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a step to an already up cluster.
Defining the cluster with the step
def lambda_handler(event, context):
conn = boto3.client("emr")
cluster_id = conn.run_job_flow(
Name='ClusterName',
ServiceRole='EMR_DefaultRole',
JobFlowRole='EMR_EC2_DefaultRole',
VisibleToAllUsers=True,
LogUri='s3n://some-log-uri/elasticmapreduce/',
ReleaseLabel='emr-5.8.0',
Instances={
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm3.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Slave nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm3.xlarge',
'InstanceCount': 2,
}
],
'Ec2KeyName': 'key-name',
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False
},
Applications=[{
'Name': 'Spark'
}],
Configurations=[{
"Classification":"spark-env",
"Properties":{},
"Configurations":[{
"Classification":"export",
"Properties":{
"PYSPARK_PYTHON":"python35",
"PYSPARK_DRIVER_PYTHON":"python35"
}
}]
}],
BootstrapActions=[{
'Name': 'Install',
'ScriptBootstrapAction': {
'Path': 's3://path/to/bootstrap.script'
}
}],
Steps=[{
'Name': 'StepName',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"/usr/bin/spark-submit", "--deploy-mode", "cluster",
's3://path/to/code.file', '-i', 'input_arg',
'-o', 'output_arg'
]
}
}],
)
return "Started cluster {}".format(cluster_id)
Attaching a step to an already running cluster
As per here
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
# choose the correct cluster
clusters = [c["Id"] for c in clusters["Clusters"]
if c["Status"]["State"] in ["RUNNING", "WAITING"]]
if not clusters:
sys.stderr.write("No valid clusters\n")
sys.stderr.exit()
# take the first relevant cluster
cluster_id = clusters[0]
# code location on your emr master node
CODE_DIR = "/home/hadoop/code/"
# spark configuration example
step_args = ["/usr/bin/spark-submit", "--spark-conf", "your-configuration",
CODE_DIR + "your_file.py", '--your-parameters', 'parameters']
step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': step_args
}
}
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
return "Added step: %s"%(action)
AWS Lambda function python code if you want to execute Spark jar using spark submit command:
from botocore.vendored import requests
import json
def lambda_handler(event, context):
headers = { "content-type": "application/json" }
url = 'http://ip-address.ec2.internal:8998/batches'
payload = {
'file' : 's3://Bucket/Orchestration/RedshiftJDBC41.jar
s3://Bucket/Orchestration/mysql-connector-java-8.0.12.jar
s3://Bucket/Orchestration/SparkCode.jar',
'className' : 'Main Class Name',
'args' : [event.get('rootPath')]
}
res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)
json_data = json.loads(res.text)
return json_data.get('id')
I have an existing worksheet with an existing NamedRange for it and I would like to call the batch_update method of the API to protect that range from being edited by anyone other than the user that makes the batch_update call.
I have seen an example on how to add protected ranges via a new range definition, but not from an existing NamedRange.
I know I need to send the addProtectedRangeResponse request. Can I define the request body with a Sheetname!NamedRange notation?
this_range = worksheet_name + "!" + nrange
batch_update_spreadsheet_request_body = {
'requests': [
{
"addProtectedRange": {
"protectedRange": {
"range": {
"name": this_range,
},
"description": "Protecting xyz",
"warningOnly": False
}
}
}
],
}
EDIT: Given #Tanaike feedback, I adapted the call to something like:
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting via gsheets_manager",
"warningOnly": False,
"requestingUserCanEdit": False
}
}
}
]
}
res2 = service.spreadsheets().batchUpdate(spreadsheetId=ssId, body=body).execute()
print(res2)
But although it lists the new protections, it still lists 5 different users (all of them) as editors. If I try to manually edit the protection added by my gsheets_manager script, it complains that the range is invalid:
Interestingly, it seems to ignore the requestUserCanEdit flag according to the returning message:
{u'spreadsheetId': u'NNNNNNNNNNNNNNNNNNNNNNNNNNNN', u'replies': [{u'addProtectedRange': {u'protectedRange': {u'requestingUserCanEdit': True, u'description': u'Protecting via gsheets_manager', u'namedRangeId': u'1793914032', u'editors': {}, u'protectedRangeId': 2012740267, u'range': {u'endColumnIndex': 1, u'sheetId': 1196959832, u'startColumnIndex': 0}}}}]}
Any ideas?
How about using namedRangeId for your situation? The flow of the sample script is as follows.
Retrieve namedRangeId using spreadsheets().get of Sheets API.
Set a protected range using namedRangeId using spreadsheets().batchUpdate of Sheets API.
Sample script:
nrange = "### name ###"
ssId = "### spreadsheetId ###"
res1 = service.spreadsheets().get(spreadsheetId=ssId, fields="namedRanges").execute()
namedRangeId = ""
for e in res1['namedRanges']:
if e['name'] == nrange:
namedRangeId = e['namedRangeId']
break
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting xyz",
"warningOnly": False
}
}
}
]
}
res2 = service.spreadsheets().batchUpdate(spreadsheetId=ssId, body=body).execute()
print(res2)
Note:
This script supposes that Sheets API can be used for your environment.
This is a simple sample script. So please modify it to your situation.
References:
ProtectedRange
Named and Protected Ranges
If this was not what you want, I'm sorry.
Edit:
In my above answer, I modified your script using your settings. If you want to protect the named range, please modify body as follows.
Modified body
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting xyz",
"warningOnly": False,
"editors": {"users": ["### your email address ###"]}, # Added
}
}
}
]
}
By this, the named range can be modified by only you. I'm using such settings and I confirm that it works in my environment. But if in your situation, this didn't work, I'm sorry.
To follow up on this question:
Filter CloudWatch Logs to extract Instance ID
I think it leaves the question incomplete because it does not say how to access the event object with python.
My goal is to:
read the instance that was triggered by a change in running state
get a tag value associated with the instance
start all other instances that have the same tag
The Cloudwatch trigger event is:
{
"source": [
"aws.ec2"
],
"detail-type": [
"EC2 Instance State-change Notification"
],
"detail": {
"state": [
"running"
]
}
}
I can see examples like this:
def lambda_handler(event, context):
# here I want to get the instance tag value
# and set the tag filter based on the instance that
# triggered the event
filters = [{
'Name': 'tag:StartGroup',
'Values': ['startgroup1']
},
{
'Name': 'instance-state-name',
'Values': ['running']
}
]
instances = ec2.instances.filter(Filters=filters)
I can see the event object but I don't see how to drill down into the tag of the instance that had it's state changed to running.
Please, what is the object attribute through which I can get a tag from the triggered instance?
I suspect it is something like:
myTag = event.details.instance-id.tags["startgroup1"]
The event data passed to Lambda contains the Instance ID.
You then need to call describe_tags() to retrieve a dictionary of the tags.
import boto3
client = boto3.client('ec2')
client.describe_tags(Filters=[
{
'Name': 'resource-id',
'Values': [
event['detail']['instance-id']
]
}
]
)
In the Details Section of the Event, you will get the instance Id's. Using the instance id and AWS SDK you can query the tags. The following is the sample event
{
"version": "0",
"id": "ee376907-2647-4179-9203-343cfb3017a4",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "123456789012",
"time": "2015-11-11T21:30:34Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
],
"detail": {
"instance-id": "i-abcd1111",
"state": "running"
}
}
This is what I came up with...
Please let me know how it can be done better. Thanks for the help.
# StartMeUp_Instances_byOne
#
# This lambda script is triggered by a CloudWatch Event, startGroupByInstance.
# Every evening a separate lambda script is launched on a schedule to stop
# all non-essential instances.
#
# This script will turn on all instances with a LaunchGroup tag that matches
# a single instance which has been changed to the running state.
#
# To start all instances in a LaunchGroup,
# start one of the instances in the LaunchGroup and wait about 5 minutes.
#
# Costs to run: approx. $0.02/month
# https://s3.amazonaws.com/lambda-tools/pricing-calculator.html
# 150 executions per month * 128 MB Memory * 60000 ms Execution Time
#
# Problems: talk to chrisj
# ======================================
# test system
# this is what the event object looks like (see below)
# it is configured in the test event object with a specific instance-id
# change that to test a different instance-id with a different LaunchGroup
# { "version": "0",
# "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
# "detail-type": "EC2 Instance State-change Notification",
# "source": "aws.ec2",
# "account": "999999999999999",
# "time": "2015-11-11T21:30:34Z",
# "region": "us-east-1",
# "resources": [
# "arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
# ],
# "detail": {
# "instance-id": "i-0aad9474", # <---------- chg this
# "state": "running"
# }
# }
# ======================================
import boto3
import logging
import json
ec2 = boto3.resource('ec2')
def get_instance_LaunchGroup(iid):
# When given an instance ID as str e.g. 'i-1234567',
# return the instance LaunchGroup.
ec2 = boto3.resource('ec2')
ec2instance = ec2.Instance(iid)
thisTag = ''
for tags in ec2instance.tags:
if tags["Key"] == 'LaunchGroup':
thisTag = tags["Value"]
return thisTag
# this is the entry point for the cloudwatch trigger
def lambda_handler(event, context):
# get the instance id that triggered the event
thisInstanceID = event['detail']['instance-id']
print("instance-id: " + thisInstanceID)
# get the LaunchGroup tag value of the thisInstanceID
thisLaunchGroup = get_instance_LaunchGroup(thisInstanceID)
print("LaunchGroup: " + thisLaunchGroup)
if thisLaunchGroup == '':
print("No LaunchGroup associated with this InstanceID - ending lambda function")
return
# set the filters
filters = [{
'Name': 'tag:LaunchGroup',
'Values': [thisLaunchGroup]
},
{
'Name': 'instance-state-name',
'Values': ['stopped']
}
]
# get the instances based on the filter, thisLaunchGroup and stopped
instances = ec2.instances.filter(Filters=filters)
# get the stopped instance IDs
stoppedInstances = [instance.id for instance in instances]
# make sure there are some instances not already started
if len(stoppedInstances) > 0:
startingUp = ec2.instances.filter(InstanceIds=stoppedInstances).start()
print ("Finished launching all instances for tag: " + thisLaunchGroup)
So, here's how I got the tags in my Python code for my Lambda function.
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instanceId)
# get image_id from instance-id
imageId = instance.image_id
print(imageId)
for tags in instance.tags:
if tags["Key"] == 'Name':
newName = tags["Value"] + ".mydomain.com"
print(newName)
So, using instance.tags and then checking the "Key" matching my Name tags and pulling out the "Value" for creating the FQDN (Fully Qualified Domain Name).