I have been looking at developing some custom resources via the use of Lambda from CloudFormation (CF) and have been looking at using the custom resource helper, but it started off ok then the CF stack took ages to create or delete. When I checked the cloud watch logs I noticed there was an error after running the create or cloud functions in my Lambda.
[7cfecd7b-69df-4408-ab12-a764a8bf674e][2021-02-07 12:41:12,535][ERROR] send(..) failed executing requests.put(..):
Formatting field not found in record: 'requestid'
I noticed some others had this issue but no resolution. I have used the generic code from the link below, my custom code works and completes but it looks like passing an update to CF. I looked through the crhelper.py the only reference I can find for 'requestid' is this :
logfmt = '[%(requestid)s][%(asctime)s][%(levelname)s] %(message)s \n'
mainlogger.handlers[0].setFormatter(logging.Formatter(logfmt))
return logging.LoggerAdapter(mainlogger, {'requestid': event['RequestId']})
Reference
To understand the error that you're having we need to look at a reproducible code example of what you're doing. Take into consideration that every time that you have some kind of error on a custom resource operation it may take ages to finished, as you have noticed.
But, there is a good alternative to the original custom resource helper that you're using, and, in my experience, this works very well while maintaining the code much simpler (thanks to a good level of abstraction) and follows the best practices. This is the
custom resource helper framework, as explained on this AWS blog.
You can find more details about the implementation on github here.
Basically, after downloading the needed dependencies and load them on your lambda (this depends on how you handle custom lambda dependencies), you can manage your custom resources operations like this:
from crhelper import CfnResource
import logging
logger = logging.getLogger(__name__)
# Initialise the helper
helper = CfnResource()
try:
## put here your initial code for every operation
pass
except Exception as e:
helper.init_failure(e)
#helper.create
def create(event, context):
logger.info("Got Create")
print('Here we are creating some cool stuff')
#helper.update
def update(event, context):
logger.info("Got Update")
print('Here you update the things you want')
#helper.delete
def delete(event, context):
logger.info("Got Delete")
print('Here you handle the delete operation')
# this will call the defined function according the
# cloudformation operation that is running (create, update or delete)
def handler(event, context):
helper(event, context)
Related
I'm currently discovering Prefect and I'm trying to deploy it to schedule workflows. I struggle a bit to understand how to access some data though. Here is my problem: I create a deployment and run it via the Python API and I need the ID of the flow run it creates (to cancel it, may other things happen outside of the flow).
When I run without any scheduling I can access the data I need (the flow run UUID), but I kind of want the scheduling part. It may be because the run_deployment function is asynchronous but as I am nowhere near being an expert in Python I don't know for sure (well that, and the fact that my code never exits after calling the main() function).
Here is what my code looks like:
from prefect import flow, task
from prefect.deployments import Deployment, run_deployment
from datetime import datetime, date, time, timezone
# Import the flow:
from script import my_flow
# Configure the deployment:
deployment_name = "my_deployment"
# Create the deployment for the flow:
deployment = Deployment.build_from_flow(
flow = my_flow,
name = deployment_name,
version = 1,
work_queue_name = "my_queue",
)
deployment.apply()
def main():
# Schedule a flow run based on the deployment:
response = run_deployment(
name = "my_flow/" + deployment_name,
parameters = {my_param},
scheduled_time = dateutil.parser.isoparse(scheduledDate),
flow_run_name = "my_run",
)
print(response)
if __name__ == "__main__":
main()
exit()
I searched a bit and saw in that post that it was possible to print the flow run id as it was executed, but in my case I need before the execution.
Is there anyway to get that data (using the Python API)? Or to set the flow ID myself? (I've already thoroughly checked the docs, I'm quite sure this is not possible)
Thanks a lot for your time!
Gauthier
As of 2.7.12 - released the same day you posted your question - you can create names for flows programmatically. Does that get you what you need?
As of 2.7.12 - released the same day you posted your question - you can create names for flows programmatically. Does that get you what you need?
Both tasks and flows now expose a mechanism for customizing the names of runs! This new keyword argument (flow_run_name for flows, task_run_name for tasks) accepts a string that will be used to create a run name for each run of the function. The most basic usage is as follows:
from datetime import datetime
from prefect import flow, task
#task(task_run_name="custom-static-name")
def my_task(name):
print(f"hi {name}")
#flow(flow_run_name="custom-but-fixed-name")
def my_flow(name: str, date: datetime):
return my_task(name)
my_flow()
This is great, but doesn’t help distinguish between multiple runs of the same task or flow. In order to make these names dynamic, you can template them using the parameter names of the task or flow function, using all of the basic rules of Python string formatting as follows:
from datetime import datetime
from prefect import flow, task
#task(task_run_name="{name}")
def my_task(name):
print(f"hi {name}")
#flow(flow_run_name="{name}-on-{date:%A}")
def my_flow(name: str, date: datetime):
return my_task(name)
my_flow()
See the docs or https://github.com/PrefectHQ/prefect/pull/8378 for more details.
run_deployment returns a flow run object - which you named response in your code.
If you want to get the ID before the flow run is actually executed, you just have to set timeout=0, so that run_deployment will return immediately after submission.
You only have to do:
flow_run = run_deployment(
name = "my_flow/" + deployment_name,
parameters = {my_param},
scheduled_time = dateutil.parser.isoparse(scheduledDate),
flow_run_name = "my_run",
timeout=0
)
print(flow_run.id)
I'm using the PYSPARK to extract the files and doing basic transformation and loading the data to HIVE. Using for loop to find the extract files and loading it to Hive. We have around 60 tables. Looping each file and loading take time. So using ThreadpoolExecutor to run the threads in parallel. Here is the sample code prototype.
def func(args):
df=extract(args)
tbl,status=load(df)
return tbl,status
def extract(args):
###finding file list and loading it to Hive####
return df
def load(df)
###Loading it to Hive###
status[tbl]='Completed'
return tbl,status
status={}
listA =['ABC','BCD','DEF']
prcs=[]
with futures.ThreadPoolExecutor() as executor:
for i in listA:
prcs.append(executor.submit(func,args))
for tsk in futures.as_completed(prcs):
tbl, status = future.result()
print(tbl)
print(status)
It works well. I'm redirecting the spark-submit log to a file. But while using threadpoolexecutor, logs are clumsy, cant debug anything. Any better way to group the logs based on thread. Here thread denotes the each table. I'm new to Python. Kindly help.
As described here.
Spark uses log4j for logging. You can configure it by adding a
log4j.properties file in the conf directory. One way to start is to
copy the existing log4j.properties.template located there.
So you can configure via log4j.properties or programmatically as described in How to configure the log level of a specific logger using log4j in pyspark?. Post is about log level, but similar concept to configure the appender.
For what to put in logs to make them more meaningful, you need some correlation-id to correlate the logs. Not sure if thread id/name is sufficient? If not then see the note about MDC. You can add your own custom MDC (See: Spark application and logging MDC (Mapped Diagnostic Context)
)
By default, Spark adds 1 record to the MDC (Mapped Diagnostic
Context): mdc.taskName, which shows something like task 1.0 in stage
0.0. You can add %X{mdc.taskName} to your patternLayout in order to print it in the logs. Moreover, you can use
spark.sparkContext.setLocalProperty(s"mdc.$name", "value") to add
user specific data into MDC. The key in MDC will be the string of
“mdc.$name”.
If taskName isn't sufficient then you can create your own correlation id and add to MDC (setLocalProperty()) and use it in patternLayout.
I am trying to collect application-specific Prometheus metrics in Django for functions that are called by django-background-tasks.
In my application models.py file, I am first adding a custom metric with:
my_task_metric = Summary("my_task_metric ", "My task metric")
Then, I am adding this to my function to capture the timestamp at which this function was last run successfully:
#background()
def my_function():
# my function code here
# collecting the metric
my_task_metric.observe((datetime.now().replace(tzinfo=timezone.utc) - datetime(1970, 1, 1).replace(tzinfo=timezone.utc)).total_seconds())
When I bring up Django, the metric is created and accessible in /metrics. However, after this function is run, the value for sum is 0 as if the metric is not observed. Am I missing something?
Or is there a better way to monitor django-background-tasks with Prometheus? I have tried using the model of django-background-tasks but I found it a bit cumbersome.
I ended up creating a decorator leveraging the Prometheus Pushgateway feature
def push_metric_to_prometheus(function):
registry = CollectorRegistry()
Gauge(f'{function.__name__}_last_successful_run', f'Last time {function.__name__} successfully finished',
registry=registry).set_to_current_time()
push_to_gateway('bolero.club:9091', job='batchA', registry=registry)
return function
and then on my function (the order of the decorators is important)
#background()
#push_metric_to_prometheus
def my_function():
# my function code here
I want to test my lambda in local using boto3, moto, pytest. This lambda is using chalice. When I call the function I try to insert a fake event to make it run but it still missing the context object.
If someone know how to test it the cleanest way possible it'll be great.
I tried to add objects in my s3 and retrieve events from it
I tried to simulate fake events
#app.on_s3_event(bucket=s.BUCKET_NAME, events=['s3:ObjectCreated:*'], prefix=s.PREFIX_PREPROCESSED, suffix=s.SUFFIX)
def handle_pre_processed_event(event):
"""
Lambda for preprocessed data
:param event:
:return:
"""
# Retrieve the json that was add to the bucket S3
json_s3 = get_json_file_s3(event.bucket, event.key)
# Send all the records to dynamoDB
insert_records(json_s3)
# Change the path of the file by copying it and delete it
change_path_file(event.key, s.PREFIX_PREPROCESSED)
Here is the lambda I want to test. Thanks for your responses.
If someone get the same problem it's because chalice use a wrapper. Add your notification and a context in your handler.
I'm trying to implement a custom AWS Lambda layer in order to use it with my functions.
It should be a simple layer that gets some parameter from ssm and initialize puresec's function_shield for protection of my services.
The code looks more less like this:
import os
import boto3
import function_shield as shield
STAGE = os.environ['stage']
REGION = os.environ['region']
PARAMETERS_PREFIX = os.environ['parametersPrefix']
class ParameterNotFoundException(Exception):
pass
session = boto3.session.Session(region_name=REGION)
ssm = session.client('ssm')
# function_shield config
parameter_path = f"/{PARAMETERS_PREFIX}/{STAGE}/functionShieldToken"
try:
shield_token = ssm.get_parameter(
Name=parameter_path,
WithDecryption=True,
)['Parameter']['Value']
except Exception:
raise ParameterNotFoundException(f'Parameter {parameter_path} not found.')
policy = {
"outbound_connectivity": "block",
"read_write_tmp": "block",
"create_child_process": "block",
"read_handler": "block"
}
def configure(p):
"""
update function_shield policy
:param p: policy dict
:return: null
"""
policy.update(p)
shield.configure({"policy": policy, "disable_analytics": True, "token": shield_token})
configure(policy)
I want to be able to link this layer to my functions for it to be protected in runtime.
I'm using the serverless framework, and it seems like my layer was deployed just fine, as it was with my example function. Also, the AWS console shows me that the layer was linked in my function.
I named my layer 'shield' and tried to import it by its name on my test function:
import os
import shield
def test(event, context):
shield.configure(policy) # this should be reusable for easy tweaking whenever I need to give more or less permissions to my lambda code.
os.system('ls')
return {
'rep': 'ok'
}
Ideally, I should get and error on CloudWatch telling me that function_shield has prevented a child_process from running, however I instead receive an error telling me that there is no 'shield' declared on my runtime.
What am I missing?
I couldn't find any custom code examples being used for layers apart from numpy, scipy, binaries, etc.
I'm sorry for my stupidity...
Thanks for your kindness!
You also need to name the file in your layer shield.py so that it's importable in Python. Note it does not matter how the layer itself is named. That's a configuration in the AWS world and has no effect on the Python world.
What does have an effect is the structure of the layer archive. You need to place the files you want to import into a python directory, zip it and use that resulting archive as a layer (I'm pressuming serverless framework is doing this for you).
In the Lambda execution environment, the layer archive gets extracted into /opt, but it's only /opt/python that's declared in the PYTHONPATH. Hence the need for the "wrapper" python directory.
Have a look here. I have described all the necessary steps to set up or call custom lambda layers functions on lambda.
https://medium.com/#nimesh.kumar031/how-to-set-up-layers-python-in-aws-lambda-functions-1355519c11ed?source=friends_link&sk=af4994c28b33fb5ba7a27a83c35702e3