Unable to run Dataflow job due to region + zone limitation

Unable to run Dataflow job due to region + zone limitation - python

We have our data and our network configured in northamerica-northeast region.
We want to run the data flow job to process our input file and load in BigQuery table. Our storage and BigQuery is also configured in same region northamerica-northeast1.
However, when we run the job : we get the following error -
The workflow could not be created, since it was sent to an invalid or unreleased region. Please resubmit with a valid region.",
We are passing the following arguments to our data flow job:
--region northamerica-northeast1 --zone northamerica-northeast1-a
Now as per below KB -
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints
Dataflow does not have a regional-endpoint in northamerica-northeast1.
However we can override the zone.
Any assistance on how we can do the same.
How can we run the job then in northamerica-northeast1.

You can look at this table https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#commonscenarios. For the scenario that you have mentioned below setup has to be done
I need worker processing to occur in a specific region that does not have a regional endpoint.
Specify both --region and --zone.
Use --region to specify the supported regional endpoint that is closest to the zone where the worker processing must occur. Use --zone to specify a zone within the desired region where worker processing must occur.

Related

How to access Amazon EMR error message with Python

I am running EMR clusters kicked off with Airflow and I need some way of passing error messages back to Airflow. Airflow runs in Python so I need this to be done in python.
Currently the error logs are in the "Log URI" section under configuration details. Accessing this might be one way to do it, but any way to access the error logs from emr with python would be much appreciated.

You can access the EMR logs in S3 with boto3 for example.
The S3 path would be:
stderr : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stderr.gz
stout : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/stdout.gz
controller : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/controller.gz
syslog : s3://<EMR_LOG_BUCKET_DEFINED_IN_EMR_CONFIGURATION>/logs/<CLUSTER_ID>/steps/<STEP_ID>/syslog.gz
Cluster ID and Step ID can be passed to your different tasks via XCOM from the task(s) that creates the cluster/steps.
Warning for spark (might be applicable to other types of steps):
This works if you submit your steps in client mode as if you are using cluster mode you would need to change the URL to fetch the application logs of the driver instead.

What is "AWS_LOG_STREAM" in amazon-CloudWatch?

cloudwatch.CloudwatchHandler('AWS_KEY_ID','AWS_SECRET_KEY','AWS_REGION','AWS_LOG_GROUP','AWS_LOG_STREAM')
I am new to AWS cloudwatch and I am trying to use cloudwatch lightweight handler in my python project. I have all the values required for .CloudwatchHandler() except AWS_LOG_STREAM. I am not understanding what is AWS_LOG_STREAM where I can i find that value in the AWS console. I googled "A log stream is a sequence of log events that share the same source." but does it mean "same source". And what is the value for AWS_LOG_STREAM?
I need support and thank you in advance.

As Mohit said, the log stream is a subdivision of the log group, usually to identify the original execution source (time and ID of the container, lambda or process is common)
In the latest version you can skip naming the log stream which will give it a timestamp log stream name:
handler = cloudwatch.CloudwatchHandler(log_group = 'my_log_group')
Disclaimer: I am a contributor to the cloudwatch package

AWS_LOG_STREAM is basically log group events divided based on execution time. by specifying a stream you're getting logs for a specific time duration rather than since inception.
example: incase of AWS Lambda, you can check it's current log stream by
LOG_GROUP=log-group
aws logs get-log-events --log-group-name $LOG_GROUP --log-stream-name aws logs describe-log-streams --log-group-name $LOG_GROUP --max-items 1 --order-by LastEventTime --descending --query logStreams[].logStreamName --output text | head -n 1 --query events[].message --output text
else in python, you can use boto3 to fetch existing log streams and then call cloudwatch handler with the respective stream name
[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_streams]

Dataflow not reading PubSub messages when running in Dataflow Managed Service

Our python Dataflow pipeline works locally but not when deployed using the Dataflow managed service on Google Cloud Platform. It doesn't show signs that it is connected to the PubSub subscription. We have tried subscribing to both subscription and topic, neither of them worked. The messages accumulate in the PubSub subscription and the Dataflow pipeline doesn't show signs of being called or anything. We have double-checked the project is the same
Any directions on this would be very much appreciated
Here is the code to connect to a pull subscription
with beam.Pipeline(options=options) as p:
something = p | "ReadPubSub" >> beam.io.ReadFromPubSub(
subscription="projects/PROJECT_ID/subscriptions/cloudflow"
)
Here goes the options used
options = PipelineOptions()
file_processing_options = PipelineOptions().view_as(FileProcessingOptions)
if options.view_as(GoogleCloudOptions).project is None:
print(sys.argv[0] + ": error: argument --project is required")
sys.exit(1)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
The PubSub subscription has this configuration:
Delivery type: Pull
Subscription expiration: Subscription expires in 31 days if there is no activity.
Acknowledgement deadline: 57 Seconds
Subscription filter: —
Message retention duration: 7 Days
Retained acknowledged messages: No
Dead lettering: Disabled
Retry policy : Retry immediately

Very late answer, it may still help someone else. I had the same problem, solved it like this:
Thanks to user Paramnesia1 who wrote this answer, I figured out that I was not observing all the logs on Logs Explorer. Some default job_name query filters were preventing me from that. I am quoting & claryfing the steps to follow to be able to see all logs:
Open the Logs tab in the Dataflow Job UI, section Job Logs
Click the "View in Logs Explorer" button
In the new Logs Explorer screen, in your Query window, remove all the existing "logName" filters, keep only resource.type and resource.labels.job_id
Now you will be able to see all the logs and investigate further your error. In my case, I was getting some 'Syncing Pod' errors, which were due to importing the wrong data file in my setup.py.

I think for Pulling from subscription we need to pass with_attributes parameter as True.
with_attributes – True - output elements will be PubsubMessage objects. False -
output elements will be of type bytes (message data only).
Found similar one here:
When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported

Discovering peer instances in Azure Virtual Machine Scale Set

Problem: Given N instances launched as part of VMSS, I would like my application code on each azure instance to discover the IP address of the other peer instances. How do I do this?
The overall intent is to cluster the instances so, as to provide active passive HA or keep the configuration in sync.
Seems like there is some support for REST API based querying : https://learn.microsoft.com/en-us/rest/api/virtualmachinescalesets/
Would like to know any other way to do it, i.e. either python SDK or instance meta data URL etc.

The RestAPI you mentioned has a Python SDK, the "azure-mgmt-compute" client
https://learn.microsoft.com/python/api/azure.mgmt.compute.compute.computemanagementclient

One way to do this would be to use instance metadata. Right now instance metadata only shows information about the VM it's running on, e.g.
curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute?api-version=2017-03-01"
{"compute":
{"location":"westcentralus","name":"imdsvmss_0","offer":"UbuntuServer","osType":"Linux","platformFaultDomain":"0","platformUpdateDomain":"0",
"publisher":"Canonical","sku":"16.04-LTS","version":"16.04.201703300","vmId":"e850e4fa-0fcf-423b-9aed-6095228c0bfc","vmSize":"Standard_D1_V2"},
"network":{"interface":[{"ipv4":{"ipaddress":[{"ipaddress":"10.0.0.4","publicip":"52.161.25.104"}],"subnet":[{"address":"10.0.0.0","dnsservers":[],"prefix":"24"}]},
"ipv6":{"ipaddress":[]},"mac":"000D3AF8BECE"}]}}
You could do something like have each VM send the info to a listener on VM#0, or to an external service, or you could combine this with Azure Files, and have each VM output to a common share. There's an Azure template proof of concept here which outputs information from each VM to an Azure File share.. https://github.com/Azure/azure-quickstart-templates/tree/master/201-vmss-azure-files-linux - every VM has a mountpoint which contains info written by every VM.

Using EC2 with auto scaling group for a batch image-processing application on AWS

I am new to AWS and am trying to port a python-based image processing application to the cloud. Our application scenario is similar to the Batch Processing scenario described here
[media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf]
Specifically the steps involved are:
Receive a large number of images (>1000) and one CSV file containing image metadata
Parse CSV file and create a database (using dynamoDB).
Push images to the cloud (using S3), and push messages of form (bucketname, keyname)
to an input queue (using SQS).
"Pop" messages from the input queue
Fetch appropriate image data from S3, and metadata from dynamoDB.
Do the processing
Update the corresponding entry for that image in dynamoDB
Save results to S3
Save a message in output queue (SQS) which feeds the next part of
the pipeline.
Steps 4-9 would involve the use of EC2 instances.
From the boto documentation and tutorials online, I have understood how to incorporate S3, SQS, and dynamoDB into the pipeline. However, I am unclear on how exactly to proceed with the EC2 inclusion. I tried looking at some example implementations online, but couldn't figure out what the EC2 machine should do to make our batch image processing application work
Use a BOOTSTRAP_SCRIPT with an infinite loop that constantly polls
the input queue ad processes messages if available. This is what I
think is being done in the Django-PDF example on AWS blog
http://aws.amazon.com/articles/Python/3998
Use boto.services to take care of all the details of reading
messages, retrieving and storing files in S3, writing messages etc.
This is what is used in the monster muck mash-up example
http://aws.amazon.com/articles/Python/691
Which of the above methods is preferred for batch processing applications, or is there a better way? Also, for each of the above how do I incorporate the use of Auto-scaling group to manage EC2 machines based on load in the input queue.
Any help in this regards would be really appreciated. Thank you.

You should write an application (using Python and Boto for example) that will do the SQS polling and interact with S# and DynamoDB.
This application must be installed at boot time on the EC2 instance. Several options are available (CloudFormation, Chef, CloudInit and user-data or Custom AMI) but I would suggest you to start with User-Data as described here http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
You also must ensure your instances has proper privileges to talk to S3, SQS and DynamodDB. You must create IAM permissions for this. Then attach the permissions to a role and the role to your instance. Detailled procedure is available in the doc at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.