How to connect airflow[docker] to GCP? - python

First I connect airflow PostgresOperator and create dataset
Now i want to Postgres to GCS Operator connection
but Always see me invalid dsn: invalid connection option "extra__google_cloud_platform__key_path"
how to do I change??
my airflow connection
enter image description here
and my Dags
postgres_to_gcs_task = PostgresToGCSOperator(
task_id = 'postgres_to_gcs_task',
postgres_conn_id='postgres_to_gcs',
sql = 'SELECT * FROM game;',
bucket= 'lol-airflow-bucket',
filename= 'lol-airflow-bucket/my_game/1',
export_format= 'csv',
dag=dag
)

I think there is an issue with the json key or the key file path.
Can you confirm the path and the json key are corrects ?
This link from official documentation explains how to create a connection in Airflow :
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/connections/gcp.html
You can also check this interesting post on this topic :
https://junjiejiang94.medium.com/get-started-with-airflow-google-cloud-platform-docker-a21c46e0f797
Another direction if it can help :
If you want to use a managed service on Google Cloud with Airflow, you can create a Cloud Composer cluster (V2 is better)
In this case, no need to care about the connection on GCP, you only need to give a Service account with the expected permissions to the Cloud Composer configuration when creating the cluster.

Related

How to get DB password from Azure app vault using python ? I am running this python file on google Dataproc cluster

My Sql server DB password is saved on Azure app vault which has DATAREF ID as a identifier. I need that password to create spark dataframe from table which is present in SQL server. I am running this .py file on google Dataproc cluster. How can I get that password using python?
Since you are accessing an Azure service from a non-Azure service, you will need a service principal. You can use certificate or secret. See THIS link for the different methods. You will need to give the service principal proper access and this will depend if you are using RBAC or access policy for your key vault.
So the steps you need to follow are:
Create a key vault and create a secret.
Create a Service principal or application registration. Store the clientid, clientsecret and tenantid.
Give the service principal proper access to the key vault(if you are using access policies) or to the specific secret(if you are using RBAC model)
The python link for the code is HERE.
The code that will work for you is below:
from azure.identity import ClientSecretCredential
from azure.keyvault.secrets import SecretClient
tenantid = <your_tenant_id>
clientsecret = <your_client_secret>
clientid = <your_client_id>
my_credentials = ClientSecretCredential(tenant_id=tenantid, client_id=clientid, client_secret=clientsecret)
secret_client = SecretClient(vault_url="https://<your_keyvault_name>.vault.azure.net/", credential=my_credentials)
secret = secret_client.get_secret("<your_secret_name>")
print(secret.name)
print(secret.value)

How can I see the service account that the python bigquery client uses?

To create a default bigquery client I use:
from google.cloud import bigquery
client = bigquery.Client()
This uses the (default) credentials available in the environment.
But how I see then which (default) service account is used?
While you can interrogate the credentials directly (be it json keys, metadata server, etc), I have occasionally found it valuable to simply query bigquery using the SESSION_USER() function.
Something quick like this should suffice:
client = bigquery.Client()
query_job = client.query("SELECT SESSION_USER() as whoami")
results = query_job.result()
for row in results:
print("i am {}".format(row.whoami))
This led me in the right direction:
Google BigQuery Python Client using the wrong credentials
To see the service-account used you can do:
client._credentials.service_account_email
However:
This statement above works when you run it on a jupyter notebook (in Vertex AI), but when you run it in a cloud function with print(client._credentials.service_account_email) then it just logs 'default' to Cloud Logging. But the default service account for a Cloud Function should be: <project_id>#appspot.gserviceaccount.com.
This will also give you the wrong answer:
client.get_service_account_email()
The call to client.get_service_account_email() does not return the credential's service account email address. Instead, it returns the BigQuery service account email address used for KMS encryption/decryption.
Following John Hanley's comment (when running on a Compute Engine) you can query the metadata service to get the email user name:
https://cloud.google.com/compute/docs/metadata/default-metadata-values
So you can either use linux:
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email" -H "Metadata-Flavor: Google"
Or python:
import requests
headers = {'Metadata-Flavor': 'Google'}
response = requests.get(
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email",
headers=headers
)
print(response.text)
The default in the url used is the alias of the actual service account used.

unable to locate credentials aws hooks

am trying to create a simple DAG using airflow.hooks.S3Hook orchestrates 2 tasks the first prints a simple string on bash, the next is uploading a CSV file to AWS s3 bucket.
so I get this error:
unable to locate credentials
It's a credential error related to aws_access_key_id and aws_secret_access_key.
I know that I can solve it using boto3, but I need it with airflow.hooks
from airflow import DAG
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_arguments = {'owner': 'airflow', 'start_date': days_ago(1)}
def upload_file_to_s3_bucket(filename, key, bucket_name, region_name):
hook = S3Hook(aws_conn_id='aws_default')
hook.create_bucket(bucket_name, region_name)
hook.load_file(filename, key, bucket_name)
with DAG('upload_to_aws',
schedule_interval='#daily',
catchup=False,
default_args=default_arguments
) as dag:
bash_task = BashOperator(task_id='bash_task',
bash_command='echo $TODAY',
env={'TODAY': '2020-11-16'})
python_task = PythonOperator(task_id='py_task',
python_callable=upload_file_to_s3_bucket,
op_kwargs={'filename': '*******.csv',
'key': 'my_s3_reasult.csv',
'bucket_name': 'tutobucket',
'region_name': 'us-east-1'}
)
bash_task >> python_task
You specified an aws_conn_id within the S3Hook. This connection needs to be configured, for example via the UI, see Managing Connections:
Airflow needs to know how to connect to your environment. Information such as hostname, port, login and passwords to other systems and services is handled in the Admin->Connections section of the UI. The pipeline code you will author will reference the ‘conn_id’ of the Connection objects.
There is also a dedicated description for an AWS connection:
Configuring the Connection
Login (optional) - Specify the AWS access key ID.
Password (optional) - Specify the AWS secret access key.
I solved it, after digging a bit I was able to make it work.
Under these fields I specified:
Conn Id: aws_default
Conn Type: Amazon Web Services
Extra:
{"aws_access_key_id":"aws_access_key_id",
"secret_access_key": "aws_secret_access_key"}

Python client for accessing kubernetes cluster on GKE

I am struggling to programmatically access a kubernetes cluster running on Google Cloud. I have set up a service account and pointed GOOGLE_APPLICATION_CREDENTIALS to a corresponding credentials file. I managed to get the cluster and credentials as follows:
import google.auth
from google.cloud.container_v1 import ClusterManagerClient
from kubernetes import client
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform',])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(project, 'us-west1-b', 'clic-cluster')
So far so good. But then I want to start using the kubernetes client:
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = False
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
client.Configuration.set_default(config)
kub = client.CoreV1Api()
print(kub.list_pod_for_all_namespaces(watch=False))
And I get an error message like this:
pods is forbidden: User "12341234123451234567" cannot list resource "pods" in API group "" at the cluster scope: Required "container.pods.list" permission.
I found this website describing the container.pods.list, but I don't know where I should add it, or how it relates to the API scopes described here.
As per the error:
pods is forbidden: User "12341234123451234567" cannot list resource
"pods" in API group "" at the cluster scope: Required
"container.pods.list" permission.
it seems evident the user credentials you are trying to use, does not have permission on listing the pods.
The entire list of permissions mentioned in https://cloud.google.com/kubernetes-engine/docs/how-to/iam, states the following:
There are different Role which can play into account here:
If you are able to get cluster, then it is covered with multiple Role sections like: Kubernetes Engine Cluster Admin, Kubernetes Engine Cluster Viewer, Kubernetes Engine Developer & Kubernetes Engine Viewer
Whereas, if you want to list pods kub.list_pod_for_all_namespaces(watch=False) then you might need Kubernetes Engine Viewer access.
You should be able to add multiple roles.

Calling a Google cloud function within Airflow DAG

I have a google cloud function that is working, I am trying to call it from an Airflow DAG.
what I have tried so far is to use the SimpleHttpOperator:
MY_TASK_NAME = SimpleHttpOperator(
task_id= "MY_TASK_NAME",
method='POST',
http_conn_id='http_default',
endpoint='https://us-central1-myprojectname.cloudfunctions.net/MyFunctionName',
data=({"schema": schema, "table": table}),
headers={"Content-Type": "application/json"},
xcom_push=False
)
but digging into the logs, it says it cannot find the resource:
{base_task_runner.py:98} INFO - Subtask: The requested URL /https://us-central1-myprojectname.cloudfunctions.net/MyFunctionName was not found on this server. That’s all we know.
also I noticed that it actually posts to https://www.google.com/ + the url I gave:
Sending 'POST' to url: https://www.google.com/https://us-central1-myprojectname.cloudfunctions.net/MyFunctionName
what is the proper way to call the function ?
Thanks
This is because you are using the http_conn_id='http_default'.
The http_default connection looks as follows:
If you check the Hosts field, it says http://www.google.com/.
Either create a new Connection with HTTP Connection type or modify the http_default connection and change the host to https://us-central1-myprojectname.cloudfunctions.net/
Then update the endpoint field in your task to:
MY_TASK_NAME = SimpleHttpOperator(
task_id= "MY_TASK_NAME",
method='POST',
http_conn_id='http_default',
endpoint='MyFunctionName',
data=({"schema": schema, "table": table}),
headers={"Content-Type": "application/json"},
xcom_push=False
)
Edit: Added / at the end of URLs
As #kaxil noted you need to first change the http connection. You then need to be able to send the correct authenticating for invoking the cloud function. The below link has a step by step guide for doing this by subclassing SimpleHttpOperator
https://medium.com/google-cloud/calling-cloud-composer-to-cloud-functions-and-back-again-securely-8e65d783acce
As a side note, Google should make this process much clearer. It is perfectly reasonable to want to trigger a Google Cloud Function (gcf) from Google Cloud Composer. The documentation for how to send an http trigger to a gcf includes documentation for Cloud Scheduler, Cloud Tasks, Cloud Pub/Sub, and a host of others but not Cloud Composer

Categories

Resources