I have AWS WAF CDK that is working with rules, and now I'm trying to add a rule in WAF with multiple statements, but I'm getting this error:
Resource handler returned message: "Error reason: You have used none or multiple values for a field that requires exactly one value., field: STATEMENT, parameter: Statement (Service: Wafv2, Status Code: 400, Request ID: 6a36bfe2-543c-458a-9571-e929142f5df1, Extended Request ID: null)" (RequestToken: b751ae12-bb60-bb75-86c0-346926687ea4, HandlerErrorCode: InvalidRequest)
My Code:
{
'name': 'ruleName',
'priority': 3,
'statement': {
'orStatement': {
'statements': [
{
'iPSetReferenceStatement': {
'arn': 'arn:myARN'
}
},
{
'iPSetReferenceStatement': {
'arn': 'arn:myARN'
}
}
]
}
},
'action': {
'allow': {}
},
'visibilityConfig': {
'sampledRequestsEnabled': True,
'cloudWatchMetricsEnabled': True,
'metricName': 'ruleName'
}
},
There are two things going on there:
Firstly, your capitalization is off. iPSetReferenceStatement cannot be parsed and creates an empty statement reference. The correct key is ipSetReferenceStatement.
However, as mentioned here, there is a jsii implementation bug causing some issues with the IPSetReferenceStatementProperty. This causes it not to be parsed properly resulting in a jsii error when synthesizing.
You can fix it by using the workaround mentioned in the post.
Add to your file containing the construct:
import jsii
from aws_cdk import aws_wafv2 as wafv2 # just for clarity, you might already have this imported
#jsii.implements(wafv2.CfnRuleGroup.IPSetReferenceStatementProperty)
class IPSetReferenceStatement:
#property
def arn(self):
return self._arn
#arn.setter
def arn(self, value):
self._arn = value
Then define your ip reference statement as follows:
ip_set_ref_stmnt = IPSetReferenceStatement()
ip_set_ref_stmnt.arn = "arn:aws:..."
ip_set_ref_stmnt_2 = IPSetReferenceStatement()
ip_set_ref_stmnt_2.arn = "arn:aws:..."
Then in the rules section of the webacl, you can use it as follows:
...
rules=[
{
'name': 'ruleName',
'priority': 3,
'statement': {
'orStatement': {
'statements': [
wafv2.CfnWebACL.StatementProperty(
ip_set_reference_statement=ip_set_ref_stmnt
),
wafv2.CfnWebACL.StatementProperty(
ip_set_reference_statement=ip_set_ref_stmnt_2
),
]
}
},
'action': {
'allow': {}
},
'visibilityConfig': {
'sampledRequestsEnabled': True,
'cloudWatchMetricsEnabled': True,
'metricName': 'ruleName'
}
}
]
...
This should synthesize your stack as expected.
When I use the watch() function on my collection, I am passing a aggregation to filter what comes through. I was able to get operationType to work correctly, but I also only want to include documents in which the city field is equal to Vancouver. The current syntax I am using does not work:
change_stream = client.mydb.mycollection.watch([
{
'$match': {
'operationType': { '$in': ['replace', 'insert'] },
'fullDocument': {'city': {'$eq': 'Vancouver'} }
}
}
])
And for reference, this is the what the dictionary that I'm aggregating looks like:
{'_id': {'_data': '825F...E0004'},
'clusterTime': Timestamp(1595565179, 2),
'documentKey': {'_id': ObjectId('70fc7871...')},
'fullDocument': {'_id': ObjectId('70fc7871...'),
'city': 'Vancouver',
'ns': {'coll': 'notification', 'db': 'pipeline'},
'operationType': 'replace'}
I found I just have to use a dot to access the nested dictionary:
change_stream = client.mydb.mycollection.watch([
{
'$match': {
'operationType': { '$in': ['replace', 'insert'] },
'fullDocument.city': 'Vancouver' }
}
}
])
I need to get a value inside an url (/some/url/value as a Sub Resource) usable as a parameter in an aggregation $match :
event/mac/11:22:33:44:55:66 --> {value:'11:22:33:44:55:66'}
and then:
{"$match":{"MAC":"$value"}},
here is a non-working example :
event = {
'url': 'event/mac/<regex("([\w:]+)"):value>',
'datasource': {
'source':"event",
'aggregation': {
'pipeline': [
{"$match": {"MAC":"$value"}},
{"$group": {"_id":"$MAC", "total": {"$sum": "$count"}}},
]
}
}
}
this example is working correctly with :
event/mac/blablabla?aggregate={"$value":"aa:11:bb:22:cc:33"}
any suggestion ?
The real quick and easy way would be to
path = "event/mac/11:22:33:44:55:66"
value = path.replace("event/mac/", "")
# or
value = path.split("/")[-1]
Python 2.7.12
boto3==1.3.1
How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds?
Create the cluster
response = client.run_job_flow(
Name=name,
LogUri='s3://mybucket/emr/',
ReleaseLabel='emr-5.9.0',
Instances={
'MasterInstanceType': instance_type,
'SlaveInstanceType': instance_type,
'InstanceCount': instance_count,
'KeepJobFlowAliveWhenNoSteps': True,
'Ec2KeyName': 'KeyPair',
'EmrManagedSlaveSecurityGroup': 'sg-1234',
'EmrManagedMasterSecurityGroup': 'sg-1234',
'Ec2SubnetId': 'subnet-1q234',
},
Applications=[
{'Name': 'Spark'},
{'Name': 'Hadoop'}
],
BootstrapActions=[
{
'Name': 'Install Python packages',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
}
}
],
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
Configurations=[
{
'Classification': 'spark',
'Properties': {
'maximizeResourceAllocation': 'true'
}
},
],
)
Add a step
response = client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
'Name': 'Run Step',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
'--py-files',
's3://mybucket/code/spark/spark_udfs.py',
's3://mybucket/code/spark/{}'.format(spark_script),
'--some-arg'
],
'Jar': 'command-runner.jar'
}
}
]
)
This successfully adds a step and runs, however, when the step completes successfully, I would like the cluster to auto-terminate as noted in the AWS CLI: http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
In your case (creating the cluster using boto3) you can add these flags
'TerminationProtected': False, 'AutoTerminate': True, to your cluster creation. In this way after your step finished to run the cluster will be shut-down.
Another solution is to add another step to kill the cluster immediately after the step that you want to run. So basically you need to run this command as step
aws emr terminate-clusters --cluster-ids your_cluster_id
The tricky part is to retrive the cluster_id.
Here you can find some solution: Does an EMR master node know it's cluster id?
The 'AutoTerminate': True parameter as suggested did not work for me. However, it worked when I set the parameter 'KeepJobFlowAliveWhenNoSteps' from True to False. Your Code should look then as the following:
response = client.run_job_flow(
Name=name,
LogUri='s3://mybucket/emr/',
ReleaseLabel='emr-5.9.0',
Instances={
'MasterInstanceType': instance_type,
'SlaveInstanceType': instance_type,
'InstanceCount': instance_count,
'KeepJobFlowAliveWhenNoSteps': False,
'Ec2KeyName': 'KeyPair',
'EmrManagedSlaveSecurityGroup': 'sg-1234',
'EmrManagedMasterSecurityGroup': 'sg-1234',
'Ec2SubnetId': 'subnet-1q234',
},
Applications=[
{'Name': 'Spark'},
{'Name': 'Hadoop'}
],
BootstrapActions=[
{
'Name': 'Install Python packages',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
}
}
],
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
Configurations=[
{
'Classification': 'spark',
'Properties': {
'maximizeResourceAllocation': 'true'
}
},
],
)
You can create a short-lived cluster that automatically terminates after all steps have been run by specifying 'KeepJobFlowAliveWhenNoSteps': False in the Instances param. I've added a complete example to GitHub that shows how to do this.
Here's some of the code from the demo:
def run_job_flow(
name, log_uri, keep_alive, applications, job_flow_role, service_role,
security_groups, steps, emr_client):
try:
response = emr_client.run_job_flow(
Name=name,
LogUri=log_uri,
ReleaseLabel='emr-5.30.1',
Instances={
'MasterInstanceType': 'm5.xlarge',
'SlaveInstanceType': 'm5.xlarge',
'InstanceCount': 3,
'KeepJobFlowAliveWhenNoSteps': keep_alive,
'EmrManagedMasterSecurityGroup': security_groups['manager'].id,
'EmrManagedSlaveSecurityGroup': security_groups['worker'].id,
},
Steps=[{
'Name': step['name'],
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'cluster',
step['script_uri'], *step['script_args']]
}
} for step in steps],
Applications=[{
'Name': app
} for app in applications],
JobFlowRole=job_flow_role.name,
ServiceRole=service_role.name,
EbsRootVolumeSize=10,
VisibleToAllUsers=True
)
cluster_id = response['JobFlowId']
logger.info("Created cluster %s.", cluster_id)
except ClientError:
logger.exception("Couldn't create cluster.")
raise
else:
return cluster_id
And here's some code that calls this function with some real params:
output_prefix = 'pi-calc-output'
pi_step = {
'name': 'estimate-pi-step',
'script_uri': f's3://{bucket_name}/{script_key}',
'script_args':
['--partitions', '3', '--output_uri',
f's3://{bucket_name}/{output_prefix}']
}
cluster_id = emr_basics.run_job_flow(
f'{prefix}-cluster', f's3://{bucket_name}/logs',
False, ['Hadoop', 'Hive', 'Spark'], job_flow_role, service_role,
security_groups, [pi_step], emr_client)
I'm using Tornado_JSON which is based on jsonschema and there is a problem with my schema definition. I tried fixing it in an online schema validator and the problem seems to lie in "additionalItems": True. True with capital T works for python and leads to an error in the online validator (Schema is invalid JSON.). With true the online validator is happy and the example json validates against the schema, but my python script doesn't start anymore (NameError: name 'true' is not defined). Can this be resolved somehow?
#schema.validate(
"""input_schema={
'type': 'object',
'properties': {
'DB': {
'type': 'number'
},
'values': {
'type': 'array',
'items': [
{
'type': 'array',
'items': [
{
'type': 'string'
},
{
'type': [
'number',
'string',
'boolean',
'null'
]
}
]
}
],
'additionalItems': true
}
}
},
input_example={
'DB': 22,
'values': [['INT', 44],['REAL', 33.33],['CHAR', 'b']]
}"""
)
I changed it according to your comments ( external file with json.loads() ). Perfect. Thank you.
Put the schema in a triple-quoted string or an external file, then parse it with json.loads(). Use the lower-case spelling.
The error stems from trying to put a builtin Python datatype into a JSON schema. The latter is a template syntax that is used to check type consistency and should not hold actual data. Instead, under input_schema you'll want to define "additionalItems" to be of { "type": "boolean" } and then add it to the test JSON in your input_example with a boolean after for testing purposes.
Also, I'm not too familiar with Tornado_JSON but it looks like you aren't complying with the schema definition language by placing "additionalItems" inside of the "values" property. Bring that up one level.
More specifically, I think what you're trying to do should look like:
"values": {
...value schema definition...
}
"additionalItems": {
"type": "boolean"
}
And the input examples would become:
input_example={
"DB": 22,
"values": [['INT', 44],['REAL', 33.33],['CHAR', 'b']],
"additionalItems": true
}