Empty dictionary on AnnotationConsolidation lambda event for aws Sagemaker - python

I am starting to use aws sagemaker on the development of my machine learning model and I'm trying to build a lambda function to process the responses of a sagemaker labeling job. I already created my own lambda function but when I try to read the event contents I can see that the event dict is completely empty, so I'm not getting any data to read.
I have already given enough permissions to the role of the lambda function. Including:
- AmazonS3FullAccess.
- AmazonSagemakerFullAccess.
- AWSLambdaBasicExecutionRole
I've tried using this code for the Post-annotation Lambda (adapted for python 3.6):
https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step2-demo1.html#sms-custom-templates-step2-demo1-post-annotation
As well as this one in this git repository:
https://github.com/aws-samples/aws-sagemaker-ground-truth-recipe/blob/master/aws_sagemaker_ground_truth_sample_lambda/annotation_consolidation_lambda.py
But none of them seemed to work.
For creating the labeling job I'm using boto3's functions for sagemaker:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_labeling_job
This is the code i have for creating the labeling job:
def create_labeling_job(client,bucket_name ,labeling_job_name, manifest_uri, output_path):
print("Creating labeling job with name: %s"%(labeling_job_name))
response = client.create_labeling_job(
LabelingJobName=labeling_job_name,
LabelAttributeName='annotations',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': manifest_uri
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': output_path
},
RoleArn='arn:aws:myrolearn',
LabelCategoryConfigS3Uri='s3://'+bucket_name+'/config.json',
StoppingConditions={
'MaxPercentageOfInputDatasetLabeled': 100,
},
LabelingJobAlgorithmsConfig={
'LabelingJobAlgorithmSpecificationArn': 'arn:image-classification'
},
HumanTaskConfig={
'WorkteamArn': 'arn:my-private-workforce-arn',
'UiConfig': {
'UiTemplateS3Uri':'s3://'+bucket_name+'/templatefile'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-BoundingBox',
'TaskTitle': 'Title',
'TaskDescription': 'Description',
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 600,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:my-custom-post-annotation-lambda'
}
}
)
return response
And this is the one i have for the lambda function:
print("Received event: " + json.dumps(event, indent=2))
print("event: %s"%(event))
print("context: %s"%(context))
print("event headers: %s"%(event["headers"]))
parsed_url = urlparse(event['payload']['s3Uri']);
print("parsed_url: ",parsed_url)
labeling_job_arn = event["labelingJobArn"]
label_attribute_name = event["labelAttributeName"]
label_categories = None
if "label_categories" in event:
label_categories = event["labelCategories"]
print(" Label Categories are : " + label_categories)
payload = event["payload"]
role_arn = event["roleArn"]
output_config = None # Output s3 location. You can choose to write your annotation to this location
if "outputConfig" in event:
output_config = event["outputConfig"]
# If you specified a KMS key in your labeling job, you can use the key to write
# consolidated_output to s3 location specified in outputConfig.
kms_key_id = None
if "kmsKeyId" in event:
kms_key_id = event["kmsKeyId"]
# Create s3 client object
s3_client = S3Client(role_arn, kms_key_id)
# Perform consolidation
return do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client)
I've tried debugging the event object with:
print("Received event: " + json.dumps(event, indent=2))
But it just prints an empty dictionary: Received event: {}
I expect the output to be something like:
#Content of an example event:
{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>,
"labelCategories": [<string>], # If you created labeling job using aws console, labelCategories will be null
"labelAttributeName": <string>,
"roleArn" : "string",
"payload": {
"s3Uri": <string>
}
"outputConfig":"s3://<consolidated_output configured for labeling job>"
}
Lastly, when I try yo get the labeling job ARN with:
labeling_job_arn = event["labelingJobArn"]
I just get a KeyError (which makes sense because the dictionary is empty).

I am doing the same but in Labeled object section I am getting failed result and inside my output objects I am getting following error from Post Lambda function:
"annotation-case0-test3-metadata": {
"retry-count": 1,
"failure-reason": "ClientError: The JSON output from the AnnotationConsolidationLambda function could not be read. Check the output of the Lambda function and try your request again.",
"human-annotated": "true"
}
}

I found the problem, I needed to add the ARN of the role used by my Lamda function as a Trusted Entity on the Role used for the Sagemaker Labeling Job.
I just went to Roles > MySagemakerExecutionRole > Trust Relationships and added:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::xxxxxxxxx:role/My-Lambda-Role",
...
],
"Service": [
"lambda.amazonaws.com",
"sagemaker.amazonaws.com",
...
]
},
"Action": "sts:AssumeRole"
}
]
}
This made it work for me.

Related

how to rename python script to be used in lambda - lambda_handler vs main() function name?

initially this was python script on ec2 but now i want it become aws lambda - generated from terraform! Since aws lambda needs lambda_handler function vs "__ main __".
I wonder what to put in my tf code for handler arg.
The python script(it gets zipped by terraform then loaded up to aws lambda):
#!/usr/bin/env python3
import boto3
import json
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
queue = boto3.resource(
'sqs', region_name='us-east-1').get_queue_by_name(QueueName="erjan")
table = boto3.resource('dynamodb', region_name='us-east-1').Table('Votes')
def process_message(message):
try:
payload = message.message_attributes
voter = payload['voter']['StringValue']
vote = payload['vote']['StringValue']
logging.info("Voter: %s, Vote: %s", voter, vote)
update_count(vote)
message.delete()
except Exception as e:
print('-----EXCEPTION-----')
def update_count(vote):
logging.info('update count....')
cur_count = 0
if vote == 'b':
response = table.get_item(Key={'voter': 'count'})
item = response['Item']
item['b'] += 1
table.put_item(Item=item)
elif vote == 'a':
table.update_item(
Key={'voter': 'count'},
UpdateExpression="ADD a :incr",
ExpressionAttributeValues={':incr': 1})
if __name__ == "__main__":
logging.info('--------inside main-------')
while True:
try:
messages = queue.receive_messages(MessageAttributeNames=['vote', 'voter'])
except KeyboardInterrupt:
logging.info("Stopping...")
break
except:
logging.error(sys.exc_info()[0])
continue
for message in messages:
process_message(message)
the tf code:
resource "aws_iam_role" "vote_processor_lambda_iam_role" {
name = "vote_processor_lambda_iam_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_policy" "vote_processor_dynamodb_policy" {
name = "vote_processor_dynamodb_policy"
policy = <<EOF
{
"some json"
}
}
resource "aws_iam_role_policy_attachment" "attach_vote_processor_dynamodb_policy_to_iam_role" {
role = aws_iam_role.vote_processor_lambda_iam_role.name
policy_arn = aws_iam_policy.vote_processor_dynamodb_policy.arn
}
resource "aws_iam_policy" "vote_processor_sqs_policy" {
name = "vote_processor_sqs_policy"
policy = <<EOF
{
"some json"
}
EOF
}
resource "aws_iam_role_policy_attachment" "vote_processor_sqs_access_policy" {
role = aws_iam_role.vote_processor_lambda_iam_role.name
policy_arn = aws_iam_policy.vote_processor_sqs_policy.arn
}
data "archive_file" "vote_processor_zip_code" {
type = "zip"
source_file = "${path.module}/vote_processor.py"
output_path = "${path.module}/vote_processor.zip"
}
#what to put in handler arg?
resource "aws_lambda_function" "vote_processor_lambda_backend" {
filename = "${path.module}/vote_processor.zip"
function_name = "vote_processor"
role = aws_iam_role.vote_processor_lambda_iam_role.arn
handler = "result.lambda_handler" #should this be __main__?
runtime = "python3.9"
}
should i rename the python script main function be "lambda handler"? or vice versa in tf code?
"__ main __" is not necessary. The following syntax defines lambda functions.
Lambda function handler in Python - AWS Lambda
You can use the following general syntax when creating a function handler in Python:
def handler_name(event, context):
...
return some_value
The handler to be set for the tf is as follows.
Lambda function handler in Python - AWS Lambda
The Lambda function handler name specified at the time that you create a Lambda function is derived from:
The name of the file in which the Lambda handler function is located.
The name of the Python handler function.
A function handler can be any name; however, the default name in the Lambda console is lambda_function.lambda_handler. This function handler name reflects the function name (lambda_handler) and the file where the handler code is stored (lambda_function.py).

AWS CloudFormation Stack Creation Keeps Failing

I'm trying to create a DynamoDB table using a CloudFormation stack, however I keep receiving the 'CREATE_FAILED' error in the AWS console and I'm not sure where I'm going wrong.
My method to create_stack:
cf = boto3.client('cloudformation')
stack_name = 'teststack'
with open('dynamoDBTemplate.json') as json_file:
template = json.load(json_file)
template = str(template)
try:
response = cf.create_stack(
StackName = stack_name,
TemplateBody = template,
TimeoutInMinutes = 123,
ResourceTypes = [
'AWS::DynamoDB::Table',
],
OnFailure = 'DO_NOTHING',
EnableTerminationProtection = True
)
print(response)
except ClientError as e:
print(e)
And here is my JSON file:
{
"AWSTemplateFormatVersion":"2010-09-09",
"Resources":{
"myDynamoDBTable":{
"Type":"AWS::DynamoDB::Table",
"Properties":{
"AttributeDefinitions":[
{
"AttributeName":"Filename",
"AttributeType":"S"
},
{
"AttributeName":"Positive Score",
"AttributeType":"S"
},
{
"AttributeName":"Negative Score",
"AttributeType":"S"
},
{
"AttributeName":"Mixed Score",
"AttributeType":"S"
}
],
"KeySchema":[
{
"AttributeName":"Filename",
"KeyType":"HASH"
}
],
"ProvisionedThroughput":{
"ReadCapacityUnits":"5",
"WriteCapacityUnits":"5"
},
"TableName":"testtable"
}
}
}
}
My console prints the created stack but there is no clear indication in the console as to why it keeps failing.
Take a look at the Events tab for your stack. It will show you the detailed actions and explain which step first failed. Specifically it will tell you:
One or more parameter values were invalid: Number of attributes in KeySchema does not exactly match number of attributes defined in AttributeDefinitions (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: 12345; Proxy: null)
The problem is that you have provided definitions for all of your table attributes. You should only provide the key attributes.
Per the AttributeDefinitions documentation:
[AttributeDefinitions is] A list of attributes that describe the key schema for the table and indexes.

Trigger python script on EMR from Lambda on S3 Object arrival with Object details [duplicate]

I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function.
Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step.
You can, I had to same thing last week!
Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a step to an already up cluster.
Defining the cluster with the step
def lambda_handler(event, context):
conn = boto3.client("emr")
cluster_id = conn.run_job_flow(
Name='ClusterName',
ServiceRole='EMR_DefaultRole',
JobFlowRole='EMR_EC2_DefaultRole',
VisibleToAllUsers=True,
LogUri='s3n://some-log-uri/elasticmapreduce/',
ReleaseLabel='emr-5.8.0',
Instances={
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm3.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Slave nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm3.xlarge',
'InstanceCount': 2,
}
],
'Ec2KeyName': 'key-name',
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False
},
Applications=[{
'Name': 'Spark'
}],
Configurations=[{
"Classification":"spark-env",
"Properties":{},
"Configurations":[{
"Classification":"export",
"Properties":{
"PYSPARK_PYTHON":"python35",
"PYSPARK_DRIVER_PYTHON":"python35"
}
}]
}],
BootstrapActions=[{
'Name': 'Install',
'ScriptBootstrapAction': {
'Path': 's3://path/to/bootstrap.script'
}
}],
Steps=[{
'Name': 'StepName',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"/usr/bin/spark-submit", "--deploy-mode", "cluster",
's3://path/to/code.file', '-i', 'input_arg',
'-o', 'output_arg'
]
}
}],
)
return "Started cluster {}".format(cluster_id)
Attaching a step to an already running cluster
As per here
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
# choose the correct cluster
clusters = [c["Id"] for c in clusters["Clusters"]
if c["Status"]["State"] in ["RUNNING", "WAITING"]]
if not clusters:
sys.stderr.write("No valid clusters\n")
sys.stderr.exit()
# take the first relevant cluster
cluster_id = clusters[0]
# code location on your emr master node
CODE_DIR = "/home/hadoop/code/"
# spark configuration example
step_args = ["/usr/bin/spark-submit", "--spark-conf", "your-configuration",
CODE_DIR + "your_file.py", '--your-parameters', 'parameters']
step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': step_args
}
}
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
return "Added step: %s"%(action)
AWS Lambda function python code if you want to execute Spark jar using spark submit command:
from botocore.vendored import requests
import json
def lambda_handler(event, context):
headers = { "content-type": "application/json" }
url = 'http://ip-address.ec2.internal:8998/batches'
payload = {
'file' : 's3://Bucket/Orchestration/RedshiftJDBC41.jar
s3://Bucket/Orchestration/mysql-connector-java-8.0.12.jar
s3://Bucket/Orchestration/SparkCode.jar',
'className' : 'Main Class Name',
'args' : [event.get('rootPath')]
}
res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)
json_data = json.loads(res.text)
return json_data.get('id')

Add a Protected Range to an existing NamedRange

I have an existing worksheet with an existing NamedRange for it and I would like to call the batch_update method of the API to protect that range from being edited by anyone other than the user that makes the batch_update call.
I have seen an example on how to add protected ranges via a new range definition, but not from an existing NamedRange.
I know I need to send the addProtectedRangeResponse request. Can I define the request body with a Sheetname!NamedRange notation?
this_range = worksheet_name + "!" + nrange
batch_update_spreadsheet_request_body = {
'requests': [
{
"addProtectedRange": {
"protectedRange": {
"range": {
"name": this_range,
},
"description": "Protecting xyz",
"warningOnly": False
}
}
}
],
}
EDIT: Given #Tanaike feedback, I adapted the call to something like:
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting via gsheets_manager",
"warningOnly": False,
"requestingUserCanEdit": False
}
}
}
]
}
res2 = service.spreadsheets().batchUpdate(spreadsheetId=ssId, body=body).execute()
print(res2)
But although it lists the new protections, it still lists 5 different users (all of them) as editors. If I try to manually edit the protection added by my gsheets_manager script, it complains that the range is invalid:
Interestingly, it seems to ignore the requestUserCanEdit flag according to the returning message:
{u'spreadsheetId': u'NNNNNNNNNNNNNNNNNNNNNNNNNNNN', u'replies': [{u'addProtectedRange': {u'protectedRange': {u'requestingUserCanEdit': True, u'description': u'Protecting via gsheets_manager', u'namedRangeId': u'1793914032', u'editors': {}, u'protectedRangeId': 2012740267, u'range': {u'endColumnIndex': 1, u'sheetId': 1196959832, u'startColumnIndex': 0}}}}]}
Any ideas?
How about using namedRangeId for your situation? The flow of the sample script is as follows.
Retrieve namedRangeId using spreadsheets().get of Sheets API.
Set a protected range using namedRangeId using spreadsheets().batchUpdate of Sheets API.
Sample script:
nrange = "### name ###"
ssId = "### spreadsheetId ###"
res1 = service.spreadsheets().get(spreadsheetId=ssId, fields="namedRanges").execute()
namedRangeId = ""
for e in res1['namedRanges']:
if e['name'] == nrange:
namedRangeId = e['namedRangeId']
break
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting xyz",
"warningOnly": False
}
}
}
]
}
res2 = service.spreadsheets().batchUpdate(spreadsheetId=ssId, body=body).execute()
print(res2)
Note:
This script supposes that Sheets API can be used for your environment.
This is a simple sample script. So please modify it to your situation.
References:
ProtectedRange
Named and Protected Ranges
If this was not what you want, I'm sorry.
Edit:
In my above answer, I modified your script using your settings. If you want to protect the named range, please modify body as follows.
Modified body
body = {
"requests": [
{
"addProtectedRange": {
"protectedRange": {
"namedRangeId": namedRangeId,
"description": "Protecting xyz",
"warningOnly": False,
"editors": {"users": ["### your email address ###"]}, # Added
}
}
}
]
}
By this, the named range can be modified by only you. I'm using such settings and I confirm that it works in my environment. But if in your situation, this didn't work, I'm sorry.

How to access the event object with python in AWS Lambda?

To follow up on this question:
Filter CloudWatch Logs to extract Instance ID
I think it leaves the question incomplete because it does not say how to access the event object with python.
My goal is to:
read the instance that was triggered by a change in running state
get a tag value associated with the instance
start all other instances that have the same tag
The Cloudwatch trigger event is:
{
"source": [
"aws.ec2"
],
"detail-type": [
"EC2 Instance State-change Notification"
],
"detail": {
"state": [
"running"
]
}
}
I can see examples like this:
def lambda_handler(event, context):
# here I want to get the instance tag value
# and set the tag filter based on the instance that
# triggered the event
filters = [{
'Name': 'tag:StartGroup',
'Values': ['startgroup1']
},
{
'Name': 'instance-state-name',
'Values': ['running']
}
]
instances = ec2.instances.filter(Filters=filters)
I can see the event object but I don't see how to drill down into the tag of the instance that had it's state changed to running.
Please, what is the object attribute through which I can get a tag from the triggered instance?
I suspect it is something like:
myTag = event.details.instance-id.tags["startgroup1"]
The event data passed to Lambda contains the Instance ID.
You then need to call describe_tags() to retrieve a dictionary of the tags.
import boto3
client = boto3.client('ec2')
client.describe_tags(Filters=[
{
'Name': 'resource-id',
'Values': [
event['detail']['instance-id']
]
}
]
)
In the Details Section of the Event, you will get the instance Id's. Using the instance id and AWS SDK you can query the tags. The following is the sample event
{
"version": "0",
"id": "ee376907-2647-4179-9203-343cfb3017a4",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "123456789012",
"time": "2015-11-11T21:30:34Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
],
"detail": {
"instance-id": "i-abcd1111",
"state": "running"
}
}
This is what I came up with...
Please let me know how it can be done better. Thanks for the help.
# StartMeUp_Instances_byOne
#
# This lambda script is triggered by a CloudWatch Event, startGroupByInstance.
# Every evening a separate lambda script is launched on a schedule to stop
# all non-essential instances.
#
# This script will turn on all instances with a LaunchGroup tag that matches
# a single instance which has been changed to the running state.
#
# To start all instances in a LaunchGroup,
# start one of the instances in the LaunchGroup and wait about 5 minutes.
#
# Costs to run: approx. $0.02/month
# https://s3.amazonaws.com/lambda-tools/pricing-calculator.html
# 150 executions per month * 128 MB Memory * 60000 ms Execution Time
#
# Problems: talk to chrisj
# ======================================
# test system
# this is what the event object looks like (see below)
# it is configured in the test event object with a specific instance-id
# change that to test a different instance-id with a different LaunchGroup
# { "version": "0",
# "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
# "detail-type": "EC2 Instance State-change Notification",
# "source": "aws.ec2",
# "account": "999999999999999",
# "time": "2015-11-11T21:30:34Z",
# "region": "us-east-1",
# "resources": [
# "arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
# ],
# "detail": {
# "instance-id": "i-0aad9474", # <---------- chg this
# "state": "running"
# }
# }
# ======================================
import boto3
import logging
import json
ec2 = boto3.resource('ec2')
def get_instance_LaunchGroup(iid):
# When given an instance ID as str e.g. 'i-1234567',
# return the instance LaunchGroup.
ec2 = boto3.resource('ec2')
ec2instance = ec2.Instance(iid)
thisTag = ''
for tags in ec2instance.tags:
if tags["Key"] == 'LaunchGroup':
thisTag = tags["Value"]
return thisTag
# this is the entry point for the cloudwatch trigger
def lambda_handler(event, context):
# get the instance id that triggered the event
thisInstanceID = event['detail']['instance-id']
print("instance-id: " + thisInstanceID)
# get the LaunchGroup tag value of the thisInstanceID
thisLaunchGroup = get_instance_LaunchGroup(thisInstanceID)
print("LaunchGroup: " + thisLaunchGroup)
if thisLaunchGroup == '':
print("No LaunchGroup associated with this InstanceID - ending lambda function")
return
# set the filters
filters = [{
'Name': 'tag:LaunchGroup',
'Values': [thisLaunchGroup]
},
{
'Name': 'instance-state-name',
'Values': ['stopped']
}
]
# get the instances based on the filter, thisLaunchGroup and stopped
instances = ec2.instances.filter(Filters=filters)
# get the stopped instance IDs
stoppedInstances = [instance.id for instance in instances]
# make sure there are some instances not already started
if len(stoppedInstances) > 0:
startingUp = ec2.instances.filter(InstanceIds=stoppedInstances).start()
print ("Finished launching all instances for tag: " + thisLaunchGroup)
So, here's how I got the tags in my Python code for my Lambda function.
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instanceId)
# get image_id from instance-id
imageId = instance.image_id
print(imageId)
for tags in instance.tags:
if tags["Key"] == 'Name':
newName = tags["Value"] + ".mydomain.com"
print(newName)
So, using instance.tags and then checking the "Key" matching my Name tags and pulling out the "Value" for creating the FQDN (Fully Qualified Domain Name).

Categories

Resources