I am running EMR clusters kicked off with Airflow and I need some way of passing error messages back to Airflow. Airflow runs in Python so I need this to be done in python.
Currently the error logs are in the "Log URI" section under configuration details. Accessing this might be one way to do it, but any way to access the error logs from emr with python would be much appreciated.
You can access the EMR logs in S3 with boto3 for example.
The S3 path would be:
stderr : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stderr.gz
stout : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stdout.gz
controller : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/controller.gz
syslog : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/syslog.gz
Cluster ID and Step ID can be passed to your different tasks via XCOM from the task(s) that creates the cluster/steps.
Warning for spark (might be applicable to other types of steps):
This works if you submit your steps in client mode as if you are using cluster mode you would need to change the URL to fetch the application logs of the driver instead.
Related
I've been following this tutorial which lets me connect to Databricks from Python and then run delta table queries. However, I've stumbled upon a problem. When I run it for the FIRST time, I get the following error:
Container container-name in account
storage-account.blob.core.windows.net not found, and we can't create
it using anoynomous credentials, and no credentials found for them in
the configuration.
When I go back to my Databricks cluster and run this code snippet
from pyspark import SparkContext
spark_context =SparkContext.getOrCreate()
if StorageAccountName is not None and StorageAccountAccessKey is not None:
print('Configuring the spark context...')
spark_context._jsc.hadoopConfiguration().set(
f"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net",
StorageAccountAccessKey)
(where StorageAccountName and AccessKey are known) then run my Python app once again, it runs successfully without throwing the previous error. I'd like to ask, is there a way to run this code snippet from my Python app and at the same time reflect it on my Databricks cluster?
You just need to add these configuration options to the cluster itself as it's described in the docs. You need to set following Spark property, the same as you do in your code:
fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
For security, it's better to put access key into secret scope, and refer it from Spark configuration (see docs)
cloudwatch.CloudwatchHandler('AWS_KEY_ID','AWS_SECRET_KEY','AWS_REGION','AWS_LOG_GROUP','AWS_LOG_STREAM')
I am new to AWS cloudwatch and I am trying to use cloudwatch lightweight handler in my python project. I have all the values required for .CloudwatchHandler() except AWS_LOG_STREAM. I am not understanding what is AWS_LOG_STREAM where I can i find that value in the AWS console. I googled "A log stream is a sequence of log events that share the same source." but does it mean "same source". And what is the value for AWS_LOG_STREAM?
I need support and thank you in advance.
As Mohit said, the log stream is a subdivision of the log group, usually to identify the original execution source (time and ID of the container, lambda or process is common)
In the latest version you can skip naming the log stream which will give it a timestamp log stream name:
handler = cloudwatch.CloudwatchHandler(log_group = 'my_log_group')
Disclaimer: I am a contributor to the cloudwatch package
AWS_LOG_STREAM is basically log group events divided based on execution time. by specifying a stream you're getting logs for a specific time duration rather than since inception.
example: incase of AWS Lambda, you can check it's current log stream by
LOG_GROUP=log-group
aws logs get-log-events --log-group-name $LOG_GROUP --log-stream-name aws logs describe-log-streams --log-group-name $LOG_GROUP --max-items 1 --order-by LastEventTime --descending --query logStreams[].logStreamName --output text | head -n 1 --query events[].message --output text
else in python, you can use boto3 to fetch existing log streams and then call cloudwatch handler with the respective stream name
[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_streams]
We have our data and our network configured in northamerica-northeast region.
We want to run the data flow job to process our input file and load in BigQuery table. Our storage and BigQuery is also configured in same region northamerica-northeast1.
However, when we run the job : we get the following error -
The workflow could not be created, since it was sent to an invalid or unreleased region. Please resubmit with a valid region.",
We are passing the following arguments to our data flow job:
--region northamerica-northeast1 --zone northamerica-northeast1-a
Now as per below KB -
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints
Dataflow does not have a regional-endpoint in northamerica-northeast1.
However we can override the zone.
Any assistance on how we can do the same.
How can we run the job then in northamerica-northeast1.
You can look at this table https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#commonscenarios. For the scenario that you have mentioned below setup has to be done
I need worker processing to occur in a specific region that does not have a regional endpoint.
Specify both --region and --zone.
Use --region to specify the supported regional endpoint that is closest to the zone where the worker processing must occur. Use --zone to specify a zone within the desired region where worker processing must occur.
Requirement: Delete DMS Task, DMS Endpoints and Replication Instance.
Use : Boto3 python script in Lambda
My Approach:
1. Delete the Database Migration Task first as Endpoint and Replication Instance cant be deleted before deleting this.
2. Delete Endpoints
3. Delete Replication Instance
Issue: When i am running these 3 delete commands, i get the following error
"errorMessage": "An error occurred (InvalidResourceStateFault) when calling the DeleteEndpoint operation:Endpoint arn:aws:dms:us-east-1:XXXXXXXXXXXXXX:endpoint:XXXXXXXXXXXXXXXXXXXXXX is part of one or more ReplicationTasks.
Here i know that Data migration task will take some time to delete. So till then Endpoint will be occupied by Task. So we cant delete it.
There is a aws cli command to check whether task is deleted or not - replication-task-deleted.
I can run this in shell and wait(sleep) until i get the final status and then execute delete Endpoint script.
There is no equivalent command in Boto3 DMS docs
Is there any other Boto3 command i can use to check the status and make my python script sleep till that time?
Please let me know if i can approach the the issue in different way.
You need to use waiters In your case the Waiter.ReplicationTaskDeleted
Problem: Given N instances launched as part of VMSS, I would like my application code on each azure instance to discover the IP address of the other peer instances. How do I do this?
The overall intent is to cluster the instances so, as to provide active passive HA or keep the configuration in sync.
Seems like there is some support for REST API based querying : https://learn.microsoft.com/en-us/rest/api/virtualmachinescalesets/
Would like to know any other way to do it, i.e. either python SDK or instance meta data URL etc.
The RestAPI you mentioned has a Python SDK, the "azure-mgmt-compute" client
https://learn.microsoft.com/python/api/azure.mgmt.compute.compute.computemanagementclient
One way to do this would be to use instance metadata. Right now instance metadata only shows information about the VM it's running on, e.g.
curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute?api-version=2017-03-01"
{"compute":
{"location":"westcentralus","name":"imdsvmss_0","offer":"UbuntuServer","osType":"Linux","platformFaultDomain":"0","platformUpdateDomain":"0",
"publisher":"Canonical","sku":"16.04-LTS","version":"16.04.201703300","vmId":"e850e4fa-0fcf-423b-9aed-6095228c0bfc","vmSize":"Standard_D1_V2"},
"network":{"interface":[{"ipv4":{"ipaddress":[{"ipaddress":"10.0.0.4","publicip":"52.161.25.104"}],"subnet":[{"address":"10.0.0.0","dnsservers":[],"prefix":"24"}]},
"ipv6":{"ipaddress":[]},"mac":"000D3AF8BECE"}]}}
You could do something like have each VM send the info to a listener on VM#0, or to an external service, or you could combine this with Azure Files, and have each VM output to a common share. There's an Azure template proof of concept here which outputs information from each VM to an Azure File share.. https://github.com/Azure/azure-quickstart-templates/tree/master/201-vmss-azure-files-linux - every VM has a mountpoint which contains info written by every VM.