How to disable Airflow DAGs with AWS Lambda - python

I would need to disable Airflow DAGs with AWS Lambda or some other way. Can I use python code in order to do this? Thank you in advance.

You can pause/unpause a DAG with Airflow Rest API
The relevant endpoint is update a DAG.
https://airflow.apache.org/api/v1/dags/{dag_id}
With:
{
"is_paused": true
}
You also have Airflow official python client that you can use to interact with the API. Example:
import time
import airflow_client.client
from airflow_client.client.api import dag_api
from airflow_client.client.model.dag import DAG
from airflow_client.client.model.error import Error
from pprint import pprint
configuration = client.Configuration(
host = "http://localhost/api/v1"
)
# Configure HTTP basic authorization: Basic
configuration = client.Configuration(
username = 'YOUR_USERNAME',
password = 'YOUR_PASSWORD'
)
with client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = dag_api.DAGApi(api_client)
dag_id = "dag_id_example" # str | The DAG ID.
dag = DAG(
is_paused=True,
)
try:
# Update a DAG
api_response = api_instance.patch_dag(dag_id, dag)
pprint(api_response)
except client.ApiException as e:
print("Exception when calling DAGApi->patch_dag: %s\n" % e)
You can see the full example in the client doc.

Related

ParamValidationError: Parameter validation failed: Bucket name must match the regex

I'm trying to run a Glue job by calling it from lambda function. The glue job in itself is running perfectly fine but when I trigger it from lambda function, I get the below error:
[ERROR] ParamValidationError: Parameter validation failed: Bucket name must match the regex \"^[a-zA-Z0-9.\\-_]{1,255}$\" or be an ARN matching the regex \"^arn:(aws).*:(s3|s3-object-lambda):[a-z\\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\\-]{1,63}$\""
There is no issue in my bucket name as I am able to do different actions with it and also my glue job is working fine when running it standalone.
Any help would be appreciated.
Thanks in advance.
Maybe you are including the s3:// protocol when indicating the bucket name and it is not required.
I was able to solve it by making a few changes.
My initial code was:
import json
import os
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
import boto3
client = boto3.client('glue')
glueJobName = "MyTestJob"
def lambda_handler(event, context):
logger.info('## INITIATED BY EVENT: ')
logger.info(event['detail'])
response = client.start_job_run(JobName = glueJobName)
logger.info('## STARTED GLUE JOB: ' + glueJobName)
logger.info('## GLUE JOB RUN ID: ' + response['JobRunId'])
return response
Once I removed the logging part (code below), it worked without any error:
from __future__ import print_function
import boto3
import urllib
print('Loading function')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="MyTestJob"
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
What could be the issue here?
Thanks

Passing variables to HTTPS-triggered cloud function

I'm using Cloud Composer, and I have a DAG that has one task that calls an HTTPS-triggered cloud function that sends out an email (due to restrictions on the project I'm working on, I had to do it this way).
The most simple form of this works. I can trigger the cloud function, and the emails are being sent successfully. However, I want to pass some variables I'm defining in the DAG to the Cloud Function, and this is where something is failing. I was using the usual way to pass parameters to the request URL.
This was the way I was defining the DAG:
# --------------------------------------------------------------------------------
# Import Libraries
# --------------------------------------------------------------------------------
import datetime
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator,BigQueryExecuteQueryOperator
from airflow.providers.google.common.utils import id_token_credentials as id_token_credential_utils
import google.auth.transport.requests
from google.auth.transport.requests import AuthorizedSession
# --------------------------------------------------------------------------------
# Set variables
# --------------------------------------------------------------------------------
(...)
report_name_url = "report_name_url"
end_user = "end_user#email.com"
# --------------------------------------------------------------------------------
# Functions
# --------------------------------------------------------------------------------
def invoke_cloud_function():
url = "https://<trigger_url>?report_name_url={}&end_user={}".format(report_name_url, end_user) #I'M ADDING THE STRINGS TO THE URL AFTER THE ?, TO PASS WHAT I WANT AS ARGUMENTS TO THE CLOUD FUNCTION
request = google.auth.transport.requests.Request() #this is a request for obtaining the the credentials
id_token_credentials = id_token_credential_utils.get_default_id_token_credentials(url, request=request) # If your cloud function url has query parameters, remove them before passing to the audience
resp = AuthorizedSession(id_token_credentials).request("GET", url=url) # the authorized session object is used to access the Cloud Function
print(resp.status_code) # should return 200
print(resp.content) # the body of the HTTP response
# --------------------------------------------------------------------------------
# Define DAG
# --------------------------------------------------------------------------------
with DAG(
dag_id,
schedule_interval= '0 13 05 * *', # DAG Cron scheduler
default_args = default_args) as dag:
(...)
send_email = PythonOperator(
task_id="send_email",
python_callable=invoke_cloud_function
)
start >> run_stored_procedure >> composer_logging >> send_email >> end
This is what I have as far as the DAG goes. From the perspective of the cloud function, I have the following:
def send_email(request):
import ssl
from email.message import EmailMessage
import smtplib
import os
report_name_url = request.args.get('report_name_url')
report_name = report_name_url.replace("_", " ")
end_user = request.args.get('end_user')
(...)
context = ssl.create_default_context()
with smtplib.SMTP_SSL('smtp.gmail.com', 465, context=context) as smtp:
smtp.login(sender_email, password)
smtp.sendmail(sender_email, receiver_email, em.as_string())
Can someone point me toward a solution for my use-case?
Thank you very much.
Edit for added context:
I'm getting the following information from the logs:
"(...)Unauthorized</h1>\n<h2>Your client does not have permission to the requested URL <code>...</code>.</h2>\n<h2></h2>\n</body></html>\n'"
This is odd because I think this was the error I was getting BEFORE giving permissions to my service account to invoke the cloud function.
Now, that permission is in place.
The only problem I foresee is that I'm not exactly calling the original URL that triggers the cloud function, since I'm adding parameters. Could this be the problem?
EDIT:
After a lot of digging around, I found a way to do this. First and foremost, I had to switch from GET to POST. This way, I was able to pass the URL that indeed is meant to trigger the Cloud Function.
The final solution came down to this:
This was the function in the DAG:
def invoke_cloud_function_success():
url = "<trigger url>" #the url is also the target audience.
request = google.auth.transport.requests.Request() #this is a request for obtaining the the credentials
id_token_credentials =
id_token_credential_utils.get_default_id_token_credentials(url, request=request) # If your cloud function url has query parameters, remove them before passing to the audience
headers = {"Content-Type": "application/json"}
body = {"report_name":report_name, "end_user":end_user, "datastudio_link":datastudio_link}
resp = AuthorizedSession(id_token_credentials).post(url=url, json=body, headers=headers) # the authorized session object is used to access the Cloud Function
print(resp.status_code) # should return 200
print(resp.content) # the body of the HTTP response
In the final Cloud Function I had to put:
request_json = request.get_json()
report_name = list(request_json['report_name'])
datastudio_link = list(request_json['datastudio_link'])
end_user = list(request_json['end_user'])

How to retrieve a list of all Airflow DAGs in Python?

There is the CLI function airflow dags list which lists all DAGs in the current environment. Is there a similar function in the Python API?
Yes. Airflow Rest API has List DAGs endpoint.
GET /dags
There is also official Python client for the API so you can use it with Python easily.
You can see example how to set it for get_dags:
with client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = dag_api.DAGApi(api_client)
limit = 100
offset = 0
order_by = "order_by_example"
tags = [
"tags_example",
]
only_active = True
try:
# List DAGs
api_response = api_instance.get_dags(limit=limit, offset=offset, order_by=order_by, tags=tags, only_active=only_active)
pprint(api_response)
except client.ApiException as e:
print("Exception when calling DAGApi->get_dags: %s\n" % e)

Passing variables through Cloud Functions to a container using KubernetesPodOperator on Cloud Composer

I am trying to get the event and context variables data from the background functions run on Google Cloud Functions and pass the values through to a container running the KubernetesPodOperator on Cloud Composer / Airflow.
The first section of code is my cloud function which triggers a dag called gcs_to_pubsub_topic_dag, what I would like to pass over and access is the data in json, specifically the "conf": event data.
#!/usr/bin/env python
# coding: utf-8
from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
IAM_SCOPE = 'https://www.googleapis.com/auth/iam'
OAUTH_TOKEN_URI = 'https://www.googleapis.com/oauth2/v4/token'
def trigger_dag(event, context=None):
client_id = '###############.apps.googleusercontent.com'
webserver_id = '###############'
# The name of the DAG you wish to trigger
dag_name = 'gcs_to_pubsub_topic_dag'
webserver_url = (
'https://'
+ webserver_id
+ '.appspot.com/api/experimental/dags/'
+ dag_name
+ '/dag_runs'
)
print(f' This is my webserver url: {webserver_url}')
# Make a POST request to IAP which then Triggers the DAG
make_iap_request(
webserver_url, client_id, method='POST', json={"conf": event, "replace_microseconds": 'false'})
def make_iap_request(url, client_id, method='GET', **kwargs):
if 'timeout' not in kwargs:
kwargs['timeout'] = 90
google_open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
resp = requests.request(
method, url,
headers={'Authorization': 'Bearer {}'.format(
google_open_id_connect_token)}, **kwargs)
if resp.status_code == 403:
raise Exception('Service account does not have permission to '
'access the IAP-protected application.')
elif resp.status_code != 200:
raise Exception(
'Bad response from application: {!r} / {!r} / {!r}'.format(
resp.status_code, resp.headers, resp.text))
else:
return resp.text
def main(event, context=None):
"""
Call the main function, sets the order in which to run functions.
"""
trigger_dag(event, context=None)
return 'Script has run without errors !!'
if (__name__ == "__main__"):
main()
The dag that is triggered runs this KubernetesPodOperator code:
kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id=TASK_ID,
# Name of task you want to run, used to generate Pod ID.
name=TASK_ID,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
cmds=[f'python3', 'execution_file.py'],
# The namespace to run within Kubernetes, default namespace is `default`.
namespace=KUBERNETES_NAMESPACE,
# location of the docker image on google container repository
image=f'eu.gcr.io/{GCP_PROJECT_ID}/{CONTAINER_ID}:{IMAGE_VERSION}',
#Always pulls the image before running it.
image_pull_policy='Always',
# The env_var template variable allows you to access variables defined in Airflow UI.
env_vars = {'GCP_PROJECT_ID':GCP_PROJECT_ID,'DAG_CONF':{{ dag_run.conf }}},
dag=dag)
And then finally I want to get DAG_CONF to print within the called container image execution_file.py script :
#!/usr/bin/env python
# coding: utf-8
from gcs_unzip_function import main as gcs_unzip_function
from gcs_to_pubsub_topic import main as gcs_to_pubsub_topic
from os import listdir, getenv
GCP_PROJECT_ID = getenv('GCP_PROJECT_ID')
DAG_CONF = getenv('DAG_CONF')
print('Test run')
print(GCP_PROJECT_ID)
print (f'This is my dag conf {DAG_CONF}')
print(type(DAG_CONF))
At the moment the code triggers the dag and returns:
Test run
GCP_PROJECT_ID (this is set in the airflow environment variables)
This is my dag conf None
class 'NoneType
where as I would like DAG_CONF to come through
I have a work around way of accessing data about the object triggering the dag within the container running with KubernetesPodOperator.
The post request code stays the same but I want to highlight that you can pass anything as long as it is to the conf element in the dictionary.
make_iap_request(
webserver_url, client_id, method='POST', json={"conf": event,
"replace_microseconds": 'false'})
The dag code requires you to create a custom class which assess the dag_run and the .conf element, then the argument acesses the json we sent from the publish request.
article read while doing this part.
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
class CustomKubernetesPodOperator(KubernetesPodOperator):
def execute(self, context):
json = str(context['dag_run'].conf)
arguments = [f'--json={json}']
self.arguments.extend(arguments)
super().execute(context)
CustomKubernetesPodOperator(
# The ID specified for the task.
task_id=TASK_ID,
# Name of task you want to run, used to generate Pod ID.
name=TASK_ID,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
cmds=[f'python3', 'execution_file.py'],
# The namespace to run within Kubernetes, default namespace is `default`.
namespace=KUBERNETES_NAMESPACE,
# location of the docker image on google container repository
image=f'eu.gcr.io/{GCP_PROJECT_ID}/{CONTAINER_ID}:{IMAGE_VERSION}',
#Always pulls the image before running it.
image_pull_policy='Always',
# The env_var template variable allows you to access variables defined in Airflow UI.
env_vars = {'GCP_PROJECT_ID':GCP_PROJECT_ID},
dag=dag)
The code being run in the container is using argparse to get the argument as a string and then uses ast literal to change it back to a dictionary to be accessed in the code:
import ast
import argparse
from os import listdir, getenv
def main(object_metadata_dict):
"""
Call the main function, sets the order in which to run functions.
"""
print(f'This is my metadata as a dictionary {object_metadata_dict}')
print (f'This is my bucket {object_metadata_dict["bucket"]}')
print (f'This is my file name {object_metadata_dict["name"]}')
return 'Script has run without errors !!'
if (__name__ == "__main__"):
parser = argparse.ArgumentParser(description='Staging to live load process.')
parser.add_argument("--json",type=str, dest="json", required = False, default = 'all',\
help="List of metadata for the triggered object derived
from cloud function backgroud functions.")
args = parser.parse_args()
json=args.json
object_metadata_dict=ast.literal_eval(json)
main(object_metadata_dict)

Use iot_v1 in a GCP Cloud Function

I'm attempting to write a GCP Cloud Function in Python that calls the API for creating an IoT device. The initial challenge seems to be getting the appropriate module (specifically iot_v1) loaded within Cloud Functions so that it can make the call.
Example Python code from Google is located at https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/iot/api-client/manager/manager.py. The specific call desired is shown in "create_es_device". Trying to repurpose that into a Cloud Function (code below) errors out with "ImportError: cannot import name 'iot_v1' from 'google.cloud' (unknown location)"
Any thoughts?
import base64
import logging
import json
import datetime
from google.auth import compute_engine
from apiclient import discovery
from google.cloud import iot_v1
def handle_notification(event, context):
#Triggered from a message on a Cloud Pub/Sub topic.
#Args:
# event (dict): Event payload.
# context (google.cloud.functions.Context): Metadata for the event.
#
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
logging.info('New device registration info: {}'.format(pubsub_message))
certData = json.loads(pubsub_message)['certs']
deviceID = certData['device-id']
certKey = certData['certificate']
projectID = certData['project-id']
cloudRegion = certData['cloud-region']
registryID = certData['registry-id']
newDevice = create_device(projectID, cloudRegion, registryID, deviceID, certKey)
logging.info('New device: {}'.format(newDevice))
def create_device(project_id, cloud_region, registry_id, device_id, public_key):
# from https://cloud.google.com/iot/docs/how-tos/devices#api_1
client = iot_v1.DeviceManagerClient()
parent = client.registry_path(project_id, cloud_region, registry_id)
# Note: You can have multiple credentials associated with a device.
device_template = {
#'id': device_id,
'id' : 'testing_device',
'credentials': [{
'public_key': {
'format': 'ES256_PEM',
'key': public_key
}
}]
}
return client.create_device(parent, device_template)
You need to have the google-cloud-iot project listed in your requirements.txt file.
See https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/iot/api-client/manager/requirements.txt

Categories

Resources