Dataflow not reading PubSub messages when running in Dataflow Managed Service

Dataflow not reading PubSub messages when running in Dataflow Managed Service - python

Our python Dataflow pipeline works locally but not when deployed using the Dataflow managed service on Google Cloud Platform. It doesn't show signs that it is connected to the PubSub subscription. We have tried subscribing to both subscription and topic, neither of them worked. The messages accumulate in the PubSub subscription and the Dataflow pipeline doesn't show signs of being called or anything. We have double-checked the project is the same
Any directions on this would be very much appreciated
Here is the code to connect to a pull subscription
with beam.Pipeline(options=options) as p:
something = p | "ReadPubSub" >> beam.io.ReadFromPubSub(
subscription="projects/PROJECT_ID/subscriptions/cloudflow"
)
Here goes the options used
options = PipelineOptions()
file_processing_options = PipelineOptions().view_as(FileProcessingOptions)
if options.view_as(GoogleCloudOptions).project is None:
print(sys.argv[0] + ": error: argument --project is required")
sys.exit(1)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
The PubSub subscription has this configuration:
Delivery type: Pull
Subscription expiration: Subscription expires in 31 days if there is no activity.
Acknowledgement deadline: 57 Seconds
Subscription filter: —
Message retention duration: 7 Days
Retained acknowledged messages: No
Dead lettering: Disabled
Retry policy : Retry immediately

Very late answer, it may still help someone else. I had the same problem, solved it like this:
Thanks to user Paramnesia1 who wrote this answer, I figured out that I was not observing all the logs on Logs Explorer. Some default job_name query filters were preventing me from that. I am quoting & claryfing the steps to follow to be able to see all logs:
Open the Logs tab in the Dataflow Job UI, section Job Logs
Click the "View in Logs Explorer" button
In the new Logs Explorer screen, in your Query window, remove all the existing "logName" filters, keep only resource.type and resource.labels.job_id
Now you will be able to see all the logs and investigate further your error. In my case, I was getting some 'Syncing Pod' errors, which were due to importing the wrong data file in my setup.py.

I think for Pulling from subscription we need to pass with_attributes parameter as True.
with_attributes – True - output elements will be PubsubMessage objects. False -
output elements will be of type bytes (message data only).
Found similar one here:
When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported

Related

How to add filter for azure function service bus topic trigger using python code

I have requirement of : - I have azure function service bus topic trigger by using python code, So the service bus topic having one topic and multiple subscription with in it.
I have to add a sqlfilter to the subscription so that the message which I sent right it should only go to that subscription if the filter condition satisfies and triggers the function app
How to add the filter option in python code. I found multiple of reference in c# but I need for python.
public async Task SendMessage(MyPayload payload)
{
string messagePayload = JsonSerializer.Serialize(payload);
ServiceBusMessage message = new ServiceBusMessage(messagePayload);
message.ApplicationProperties.Add("goals", payload.Goals);
try
for sample I have add the code for c# where there are adding application properties in function app code , so which ever subscription satisfy the condition which is goals = payload.Goals the mgs will go to that subscription.
I want to know how can we add the application properties in python azure function app code for service bus topic trigger

Using the python client sdk for Azure Service bus, you can apply SqlFilter and SqlRuleAction before you start processing your messages.
Pseudocode will be like,
servicebus_mgmt_client.create_rule(topicname,sub_name,filtername, filter, action)
send_mesgs_to_topic() #set filter in your message
receive_mesgs() #received mesg will have properties
See the detailed examples here in github.

What is "AWS_LOG_STREAM" in amazon-CloudWatch?

cloudwatch.CloudwatchHandler('AWS_KEY_ID','AWS_SECRET_KEY','AWS_REGION','AWS_LOG_GROUP','AWS_LOG_STREAM')
I am new to AWS cloudwatch and I am trying to use cloudwatch lightweight handler in my python project. I have all the values required for .CloudwatchHandler() except AWS_LOG_STREAM. I am not understanding what is AWS_LOG_STREAM where I can i find that value in the AWS console. I googled "A log stream is a sequence of log events that share the same source." but does it mean "same source". And what is the value for AWS_LOG_STREAM?
I need support and thank you in advance.

As Mohit said, the log stream is a subdivision of the log group, usually to identify the original execution source (time and ID of the container, lambda or process is common)
In the latest version you can skip naming the log stream which will give it a timestamp log stream name:
handler = cloudwatch.CloudwatchHandler(log_group = 'my_log_group')
Disclaimer: I am a contributor to the cloudwatch package

AWS_LOG_STREAM is basically log group events divided based on execution time. by specifying a stream you're getting logs for a specific time duration rather than since inception.
example: incase of AWS Lambda, you can check it's current log stream by
LOG_GROUP=log-group
aws logs get-log-events --log-group-name $LOG_GROUP --log-stream-name aws logs describe-log-streams --log-group-name $LOG_GROUP --max-items 1 --order-by LastEventTime --descending --query logStreams[].logStreamName --output text | head -n 1 --query events[].message --output text
else in python, you can use boto3 to fetch existing log streams and then call cloudwatch handler with the respective stream name
[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_streams]

How To Email Results of Script Within Google Cloud Infrastructure (Design Query Regarding Pub/Sub & Emailing & Other Options)

This is more of a design question on what to use within Google Cloud's infrastructure to obtain the results from a Python script.
Take the following scenario: we have over 60 projects and one central project for Stackdriver logging and the such.
It is from this central project I want to run a Python script (using Cloud Scheduler which then triggers the Cloud Function) to obtain a list of disks that haven't had their snapshot taken in the past 24 hours, those that aren't assigned to a snapshot schedule, and the snapshot schedules that have names that do not match our naming convention. I have the script already prepared, and it works very well from my workstation (producing a list of dictionaries of the desired results per project).
However, my question is: where should I send the results to? And how could I then have an email sent out to the appropriate people to action it?
I've played about with sending the object attributes to Pub/Sub within the central project, but this requires me to manually pull the messages, and I can't see any way of scheduling the pull request. I also don't see an option of sending out an email from Pub/Sub to an email address, and so the only option seems to be to create an email Cloud Function which is then triggered whenever one of the Subscriptions receives a new message from the first Cloud Function containing the original script.
I suppose I could simply set this up on one of our Windows VM instances and convert the script to PowerShell, but I was rather hoping to keep it out of a VM if at all possible.
Has anyone done this before? And if so, what did you use to get the desired results?

I think you can use Sendgrid API to send emails from your Cloud Function. It's very easy to set up it has a free plan which includes 12,000 per month and has an API for Python :D.
You can signup using the Google Cloud Marketplace selecting the free plan.
Then create an API key for your code here. If you only need to send mails I suggest you can select the option Restricted Access and for Mail Send give Full Access or the level you think will work for you.
Here's a code snippet for you:
import logging
from sendgrid import SendGridAPIClient
from sendgrid.helpers.mail import Mail, Email
from python_http_client.exceptions import HTTPError
def send_mail(request):
log = logging.getLogger(__name__)
SENDGRID_API_KEY = 'SG.blahblahblah'
sg = SendGridAPIClient(SENDGRID_API_KEY)
"""
Maybe here goes the code you use to check what you need
"""
APP_NAME = "Testing"
html_content = f"""
Here goes your mail body in HTML format
"""
message = Mail(
to_emails="dest#a.domain.com",
from_email=Email('sender#your.domain.com', "Your name or your app name"),
subject=f"Warning!!!!",
html_content=html_content
)
try:
response = sg.send(message)
log.info(f"email.status_code={response.status_code}")
return f'Your mail was sent!'
except HTTPError as e:
log.error(e)
And don't forget to add the sendgrid lib to your requirements.txt file:
# Function dependencies, for example:
# package>=version
sendgrid
Hope this can help you.

Google Cloud Pub/Sub Python SDK retrieve single message at a time

Problem: My use case is I want to receive messages from Google Cloud Pub/Sub - one message at a time using the Python Api. All the current examples mention using Async/callback option for pulling the messages from a Pub/Sub subscription. The problem with that approach is I need to keep the thread alive.
Is it possible to just receive 1 message and close the connection i.e. is there a feature where I can just set a parameter (something like a max_messages) to 1 so that once it receives 1 message the thread terminates?
The documentation here doesn't list anything for Python Synchronous pull which seem to have num_of_messages option for other languages like Java.

See the following example in this link:
from google.cloud import pubsub_v1
client = pubsub_v1.SubscriberClient()
subscription = client.subscription_path('[PROJECT]', '[SUBSCRIPTION]')
max_messages = 1
response = client.pull(subscription, max_messages)
print(response)
I've tried myself and using that you get one message at a time.
If you get some error try updating pubsub library to the last version:
pip install --upgrade google-cloud-pubsub
In docs here you have more info about the pull method used in the code snippet:
The Pull method relies on a request/response model:
The application sends a request for messages. The server replies with
zero or more messages and closes the connection.

As per the official documentation here:
...you can achieve exactly once processing of Pub/Sub message streams,
as PubsubIO de-duplicates messages based on custom message identifiers
or identifiers assigned by Pub/Sub.
So you should be able to use record IDs, i.e. identifiers for you messages, to allow for exactly-once processing across the boundary between Dataflow and other systems. To use record IDs, you invoke idLabel when constructing PubsubIO.Read or PubsubIO.Write transforms, passing a string value of your choice. In java this would be:
public PubsubIO.Read.Bound<T> idLabel(String idLabel)
This returns a transform that's like this one but that reads unique message IDs from the given message attribute.

Google cloud translate API - "Daily Limit Exceeded"

I'm writing a bit of python using the google cloud api to translate some text.
I have set up billing on my account and it says it's active (with some credit added for the free trial). I created an application_default_credentials.json file with -
gcloud auth application-default login
Which asked me to log in to my account (I logged into the same account I set billing up on).
I then used -
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/theo/.config/gcloud/application_default_credentials.json"
at the start of my python script. For the coding I followed these samples here - https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/translate/cloud-client
Yesterday the api wouldn't work and I would receive "daily limit exceeded" even though I had not used it yet. Eventually I gave up and decided to sleep on it.
Tried again today and it was working. Without having to do anything. Ah great I thought, it must just have taken a while to update my billing information.
But I've since translated a few things, maybe 10000 characters and I'm already receiving the same error message.
I did create a "Project" on the cloud console and have an api key from there. I'm not entirely sure how to use it because the documentation I linked above just uses the json credentials file. From what I've read online, using the json file is recommended over using a key now.
Any ideas about what I need to do?
Thanks.

Solved by creating a token at https://console.cloud.google.com/apis/credentials/serviceaccountkey instead of the one created with the gcloud auth command.
After I referenced the generated json file from that page it started working.
More info here - https://cloud.google.com/docs/authentication/getting-started

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.