cloudwatch.CloudwatchHandler('AWS_KEY_ID','AWS_SECRET_KEY','AWS_REGION','AWS_LOG_GROUP','AWS_LOG_STREAM')
I am new to AWS cloudwatch and I am trying to use cloudwatch lightweight handler in my python project. I have all the values required for .CloudwatchHandler() except AWS_LOG_STREAM. I am not understanding what is AWS_LOG_STREAM where I can i find that value in the AWS console. I googled "A log stream is a sequence of log events that share the same source." but does it mean "same source". And what is the value for AWS_LOG_STREAM?
I need support and thank you in advance.
As Mohit said, the log stream is a subdivision of the log group, usually to identify the original execution source (time and ID of the container, lambda or process is common)
In the latest version you can skip naming the log stream which will give it a timestamp log stream name:
handler = cloudwatch.CloudwatchHandler(log_group = 'my_log_group')
Disclaimer: I am a contributor to the cloudwatch package
AWS_LOG_STREAM is basically log group events divided based on execution time. by specifying a stream you're getting logs for a specific time duration rather than since inception.
example: incase of AWS Lambda, you can check it's current log stream by
LOG_GROUP=log-group
aws logs get-log-events --log-group-name $LOG_GROUP --log-stream-name aws logs describe-log-streams --log-group-name $LOG_GROUP --max-items 1 --order-by LastEventTime --descending --query logStreams[].logStreamName --output text | head -n 1 --query events[].message --output text
else in python, you can use boto3 to fetch existing log streams and then call cloudwatch handler with the respective stream name
[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_streams]
Related
I am trying to put a subscription on a CW log group from a Lambda Function that is scanning for lambdas with the right tag. When calling the put_subscription_filter an Error is thrown:
"An error occurred (InvalidParameterException) when calling the PutSubscriptionFilter
operation: Could not execute the lambda function. Make sure you have given CloudWatch Logs
permission to execute your function."
Stated in the docs for put subscription filter iam:PassRole permission is needed. I have grant this. I have made sure it is not a premission issue for the Lambda function by giving it full admin rights.
By reading the error it indicates it is CW Logs that need permission to execute a function, my guess is that it is the subscribe destination function that they may mean. I have tried a lot of different things here but still no cigar.
Setting a subscription filter in the console is straight forward and no policy is modified or created as I can see.
Does any one have experience of this or any input?
You need to add lambda Invoke Permission so that CloudWatch can send and execute lambda when logs are available
Using AWS CLI is simplest way
aws lambda add-permission \
--function-name "helloworld" \
--statement-id "helloworld" \
--principal "logs.region.amazonaws.com" \
--action "lambda:InvokeFunction" \
--source-arn "arn:aws:logs:region:123456789123:log-group:TestLambda:*"
Using console
1. Go to Lambda Function
2. Configuration -> Permissions tab
3. Scroll down and Click Add permissions
4. Choose "AWS service"
5. Principal - CloudWatch log group ARN
6. Action - Lambda:InvokeFunction
7. Statement Id - policy statement name, anything meaningful
8. Save
Once done through CLI or console, try creating CloudWatch subscription to that lambda
I am running EMR clusters kicked off with Airflow and I need some way of passing error messages back to Airflow. Airflow runs in Python so I need this to be done in python.
Currently the error logs are in the "Log URI" section under configuration details. Accessing this might be one way to do it, but any way to access the error logs from emr with python would be much appreciated.
You can access the EMR logs in S3 with boto3 for example.
The S3 path would be:
stderr : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stderr.gz
stout : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stdout.gz
controller : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/controller.gz
syslog : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/syslog.gz
Cluster ID and Step ID can be passed to your different tasks via XCOM from the task(s) that creates the cluster/steps.
Warning for spark (might be applicable to other types of steps):
This works if you submit your steps in client mode as if you are using cluster mode you would need to change the URL to fetch the application logs of the driver instead.
Our python Dataflow pipeline works locally but not when deployed using the Dataflow managed service on Google Cloud Platform. It doesn't show signs that it is connected to the PubSub subscription. We have tried subscribing to both subscription and topic, neither of them worked. The messages accumulate in the PubSub subscription and the Dataflow pipeline doesn't show signs of being called or anything. We have double-checked the project is the same
Any directions on this would be very much appreciated
Here is the code to connect to a pull subscription
with beam.Pipeline(options=options) as p:
something = p | "ReadPubSub" >> beam.io.ReadFromPubSub(
subscription="projects/PROJECT_ID/subscriptions/cloudflow"
)
Here goes the options used
options = PipelineOptions()
file_processing_options = PipelineOptions().view_as(FileProcessingOptions)
if options.view_as(GoogleCloudOptions).project is None:
print(sys.argv[0] + ": error: argument --project is required")
sys.exit(1)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
The PubSub subscription has this configuration:
Delivery type: Pull
Subscription expiration: Subscription expires in 31 days if there is no activity.
Acknowledgement deadline: 57 Seconds
Subscription filter: —
Message retention duration: 7 Days
Retained acknowledged messages: No
Dead lettering: Disabled
Retry policy : Retry immediately
Very late answer, it may still help someone else. I had the same problem, solved it like this:
Thanks to user Paramnesia1 who wrote this answer, I figured out that I was not observing all the logs on Logs Explorer. Some default job_name query filters were preventing me from that. I am quoting & claryfing the steps to follow to be able to see all logs:
Open the Logs tab in the Dataflow Job UI, section Job Logs
Click the "View in Logs Explorer" button
In the new Logs Explorer screen, in your Query window, remove all the existing "logName" filters, keep only resource.type and resource.labels.job_id
Now you will be able to see all the logs and investigate further your error. In my case, I was getting some 'Syncing Pod' errors, which were due to importing the wrong data file in my setup.py.
I think for Pulling from subscription we need to pass with_attributes parameter as True.
with_attributes – True - output elements will be PubsubMessage objects. False -
output elements will be of type bytes (message data only).
Found similar one here:
When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported
We have our data and our network configured in northamerica-northeast region.
We want to run the data flow job to process our input file and load in BigQuery table. Our storage and BigQuery is also configured in same region northamerica-northeast1.
However, when we run the job : we get the following error -
The workflow could not be created, since it was sent to an invalid or unreleased region. Please resubmit with a valid region.",
We are passing the following arguments to our data flow job:
--region northamerica-northeast1 --zone northamerica-northeast1-a
Now as per below KB -
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints
Dataflow does not have a regional-endpoint in northamerica-northeast1.
However we can override the zone.
Any assistance on how we can do the same.
How can we run the job then in northamerica-northeast1.
You can look at this table https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#commonscenarios. For the scenario that you have mentioned below setup has to be done
I need worker processing to occur in a specific region that does not have a regional endpoint.
Specify both --region and --zone.
Use --region to specify the supported regional endpoint that is closest to the zone where the worker processing must occur. Use --zone to specify a zone within the desired region where worker processing must occur.
Problem: My use case is I want to receive messages from Google Cloud Pub/Sub - one message at a time using the Python Api. All the current examples mention using Async/callback option for pulling the messages from a Pub/Sub subscription. The problem with that approach is I need to keep the thread alive.
Is it possible to just receive 1 message and close the connection i.e. is there a feature where I can just set a parameter (something like a max_messages) to 1 so that once it receives 1 message the thread terminates?
The documentation here doesn't list anything for Python Synchronous pull which seem to have num_of_messages option for other languages like Java.
See the following example in this link:
from google.cloud import pubsub_v1
client = pubsub_v1.SubscriberClient()
subscription = client.subscription_path('[PROJECT]', '[SUBSCRIPTION]')
max_messages = 1
response = client.pull(subscription, max_messages)
print(response)
I've tried myself and using that you get one message at a time.
If you get some error try updating pubsub library to the last version:
pip install --upgrade google-cloud-pubsub
In docs here you have more info about the pull method used in the code snippet:
The Pull method relies on a request/response model:
The application sends a request for messages. The server replies with
zero or more messages and closes the connection.
As per the official documentation here:
...you can achieve exactly once processing of Pub/Sub message streams,
as PubsubIO de-duplicates messages based on custom message identifiers
or identifiers assigned by Pub/Sub.
So you should be able to use record IDs, i.e. identifiers for you messages, to allow for exactly-once processing across the boundary between Dataflow and other systems. To use record IDs, you invoke idLabel when constructing PubsubIO.Read or PubsubIO.Write transforms, passing a string value of your choice. In java this would be:
public PubsubIO.Read.Bound<T> idLabel(String idLabel)
This returns a transform that's like this one but that reads unique message IDs from the given message attribute.