setting up s3 for logs in airflow - python
I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/
My problem is getting the logs set up to write/read from s3. When a dag has completed I get an error like this
*** Log file isn't local.
*** Fetching here: http://ea43d4d49f35:8793/log/xxxxxxx/2017-06-26T11:00:00
*** Failed to fetch log file from worker.
*** Reading remote logs...
Could not read logs from s3://buckets/xxxxxxx/airflow/logs/xxxxxxx/2017-06-
26T11:00:00
I set up a new section in the airflow.cfg file like this
[MyS3Conn]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxxx
aws_default_region = xxxxxxx
And then specified the s3 path in the remote logs section in airflow.cfg
remote_base_log_folder = s3://buckets/xxxx/airflow/logs
remote_log_conn_id = MyS3Conn
Did I set this up properly and there is a bug? Is there a recipe for success here that I am missing?
-- Update
I tried exporting in URI and JSON formats and neither seemed to work. I then exported the aws_access_key_id and aws_secret_access_key and then airflow started picking it up. Now I get his error in the worker logs
6/30/2017 6:05:59 PMINFO:root:Using connection to: s3
6/30/2017 6:06:00 PMERROR:root:Could not read logs from s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00
6/30/2017 6:06:00 PMERROR:root:Could not write logs to s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00
6/30/2017 6:06:00 PMLogging into: /usr/local/airflow/logs/xxxxx/2017-06-30T23:45:00
-- Update
I found this link as well
https://www.mail-archive.com/dev#airflow.incubator.apache.org/msg00462.html
I then shelled into one of my worker machines (separate from the webserver and scheduler) and ran this bit of code in python
import airflow
s3 = airflow.hooks.S3Hook('s3_conn')
s3.load_string('test', airflow.conf.get('core', 'remote_base_log_folder'))
I receive this error.
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
I tried exporting several different types of AIRFLOW_CONN_ envs as explained here in the connections section https://airflow.incubator.apache.org/concepts.html and by other answers to this question.
s3://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#S3
{"aws_account_id":"<xxxxx>","role_arn":"arn:aws:iam::<xxxx>:role/<xxxxx>"}
{"aws_access_key_id":"<xxxxx>","aws_secret_access_key":"<xxxxx>"}
I have also exported AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with no success.
These credentials are being stored in a database so once I add them in the UI they should be picked up by the workers but they are not able to write/read logs for some reason.
UPDATE Airflow 1.10 makes logging a lot easier.
For s3 logging, set up the connection hook as per the above answer
and then simply add the following to airflow.cfg
[core]
# Airflow can store logs remotely in AWS S3. Users must supply a remote
# location URL (starting with either 's3://...') and an Airflow connection
# id that provides access to the storage location.
remote_base_log_folder = s3://my-bucket/path/to/logs
remote_log_conn_id = MyS3Conn
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False
For gcs logging,
Install the gcp_api package first, like so: pip install apache-airflow[gcp_api].
Set up the connection hook as per the above answer
Add the following to airflow.cfg
[core]
# Airflow can store logs remotely in AWS S3. Users must supply a remote
# location URL (starting with either 's3://...') and an Airflow connection
# id that provides access to the storage location.
remote_logging = True
remote_base_log_folder = gs://my-bucket/path/to/logs
remote_log_conn_id = MyGCSConn
NOTE: As of Airflow 1.9 remote logging has been significantly altered. If you are using 1.9, read on.
Reference here
Complete Instructions:
Create a directory to store configs and place this so that it can be found in PYTHONPATH. One example is $AIRFLOW_HOME/config
Create empty files called $AIRFLOW_HOME/config/log_config.py and
$AIRFLOW_HOME/config/__init__.py
Copy the contents of airflow/config_templates/airflow_local_settings.py into the log_config.py file that was just created in the step above.
Customize the following portions of the template:
#Add this variable to the top of the file. Note the trailing slash.
S3_LOG_FOLDER = 's3://<bucket where logs should be persisted>/'
Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
LOGGING_CONFIG = ...
Add a S3TaskHandler to the 'handlers' block of the LOGGING_CONFIG variable
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
Update the airflow.task and airflow.task_runner blocks to be 's3.task' instead >of 'file.task'.
'loggers': {
'airflow.task': {
'handlers': ['s3.task'],
...
},
'airflow.task_runner': {
'handlers': ['s3.task'],
...
},
'airflow': {
'handlers': ['console'],
...
},
}
Make sure a s3 connection hook has been defined in Airflow, as per the above answer. The hook should have read and write access to the s3 bucket defined above in S3_LOG_FOLDER.
Update $AIRFLOW_HOME/airflow.cfg to contain:
task_log_reader = s3.task
logging_config_class = log_config.LOGGING_CONFIG
remote_log_conn_id = <name of the s3 platform hook>
Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.
Verify that logs are showing up for newly executed tasks in the bucket you’ve defined.
Verify that the s3 storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:
*** Reading remote log from gs://<bucket where logs should be persisted>/example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log.
[2017-10-03 21:57:50,056] {cli.py:377} INFO - Running on host chrisr-00532
[2017-10-03 21:57:50,093] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py']
[2017-10-03 21:57:51,264] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,263] {__init__.py:45} INFO - Using executor SequentialExecutor
[2017-10-03 21:57:51,306] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,306] {models.py:186} INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py
Airflow 2.4.2
Follow the steps above but paste this into log_config.py
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
from airflow import configuration as conf
from copy import deepcopy
S3_LOG_FOLDER = 's3://your/s3/log/folder'
LOG_LEVEL = conf.get('logging', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('logging', 'log_format')
BASE_LOG_FOLDER = conf.get('logging', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
# Attach formatters to loggers (airflow.task, airflow.processor)
LOGGING_CONFIG['formatters']['airflow.task'] = { 'format': LOG_FORMAT }
LOGGING_CONFIG['formatters']['airflow.processor'] = { 'format': LOG_FORMAT }
# Add an S3 task handler
LOGGING_CONFIG['handlers']['s3.task'] = {
'class': 'airflow.providers.amazon.aws.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE
}
# Specify handler for airflow.task
LOGGING_CONFIG['loggers']['airflow.task']['handlers'] = ['task', 's3.task']
You need to set up the S3 connection through Airflow UI. For this, you need to go to the Admin -> Connections tab on airflow UI and create a new row for your S3 connection.
An example configuration would be:
Conn Id: my_conn_S3
Conn Type: S3
Extra: {"aws_access_key_id":"your_aws_key_id", "aws_secret_access_key": "your_aws_secret_key"}
(Updated as of Airflow 1.10.2)
Here's a solution if you don't use the admin UI.
My Airflow doesn't run on a persistent server ... (It gets launched afresh every day in a Docker container, on Heroku.) I know I'm missing out on a lot of great features, but in my minimal setup, I never touch the admin UI or the cfg file. Instead, I have to set Airflow-specific environment variables in a bash script, which overrides the .cfg file.
apache-airflow[s3]
First of all, you need the s3 subpackage installed to write your Airflow logs to S3. (boto3 works fine for the Python jobs within your DAGs, but the S3Hook depends on the s3 subpackage.)
One more side note: conda install doesn't handle this yet, so I have to do pip install apache-airflow[s3].
Environment variables
In a bash script, I set these core variables. Starting from these instructions but using the naming convention AIRFLOW__{SECTION}__{KEY} for environment variables, I do:
export AIRFLOW__CORE__REMOTE_LOGGING=True
export AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucket/key
export AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_uri
export AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
S3 connection ID
The s3_uri above is a connection ID that I made up. In Airflow, it corresponds to another environment variable, AIRFLOW_CONN_S3_URI. The value of that is your S3 path, which has to be in URI form. That's
s3://access_key:secret_key#bucket/key
Store this however you handle other sensitive environment variables.
With this configuration, Airflow will be able to write your logs to S3. They will follow the path of s3://bucket/key/dag/task_id/timestamp/1.log.
Appendix on upgrading from Airflow 1.8 to Airflow 1.10
I recently upgraded my production pipeline from Airflow 1.8 to 1.9, and then 1.10. Good news is that the changes are pretty tiny; the rest of the work was just figuring out nuances with the package installations (unrelated to the original question about S3 logs).
(1) First of all, I needed to upgrade to Python 3.6 with Airflow 1.9.
(2) The package name changed from airflow to apache-airflow with 1.9. You also might run into this in your pip install.
(3) The package psutil has to be in a specific version range for Airflow. You might encounter this when you're doing pip install apache-airflow.
(4) python3-dev headers are needed with Airflow 1.9+.
(5) Here are the substantive changes: export AIRFLOW__CORE__REMOTE_LOGGING=True is now required. And
(6) The logs have a slightly different path in S3, which I updated in the answer: s3://bucket/key/dag/task_id/timestamp/1.log.
But that's it! The logs did not work in 1.9, so I recommend just going straight to 1.10, now that it's available.
To complete Arne's answer with the recent Airflow updates, you do not need to set task_log_reader to another value than the default one : task
As if you follow the default logging template airflow/config_templates/airflow_local_settings.py you can see since this commit (note the handler's name changed to's3': {'task'... instead of s3.task) that's the value on the remote folder(REMOTE_BASE_LOG_FOLDER) will replace the handler with the right one:
REMOTE_LOGGING = conf.get('core', 'remote_logging')
if REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('s3://'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['s3'])
elif REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('gs://'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['gcs'])
elif REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('wasb'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['wasb'])
elif REMOTE_LOGGING and ELASTICSEARCH_HOST:
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['elasticsearch'])
More details on how to log to/read from S3 : https://github.com/apache/incubator-airflow/blob/master/docs/howto/write-logs.rst#writing-logs-to-amazon-s3
Phew! Motivation to keep nipping the airflow bugs in the bud is to confront this as a bunch of python files XD here's my experience on this with apache-airflow==1.9.0.
First of all, there's simply no need trying
airflow connections .......... --conn_extra etc, etc.
Just set your airflow.cfg as:
remote_logging = True
remote_base_log_folder = s3://dev-s3-main-ew2-dmg-immutable-potns/logs/airflow-logs/
encrypt_s3_logs = False
# Logging level
logging_level = INFO
fab_logging_level = WARN
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class = log_config.LOGGING_CONFIG
remote_log_conn_id = s3://<ACCESS-KEY>:<SECRET-ID>#<MY-S3-BUCKET>/<MY>/<SUB>/<FOLDER>/
keep the $AIRFLOW_HOME/config/__ init __.py and $AIRFLOW_HOME/config/log_config.py file as above.
The problem with me as a missing "boto3" package, which I could get to by:
vi /usr/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py
then >> import traceback
and in the line containing:
Could not create an S3Hook with connection id "%s". '
'Please make sure that airflow[s3] is installed and '
'the S3 connection exists.
doing a traceback.print_exc() and well it started cribbing about missing boto3!
Installed it and Life was beautiful back again!
Just a side note to anyone following the very useful instructions in the above answer:
If you stumble upon this issue: "ModuleNotFoundError: No module named
'airflow.utils.log.logging_mixin.RedirectStdHandler'" as referenced here (which happens when using airflow 1.9), the fix is simple - use rather this base template: https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/config_templates/airflow_local_settings.py (and follow all other instructions in the above answer)
The current template incubator-airflow/airflow/config_templates/airflow_local_settings.py present in master branch contains a reference to the class "airflow.utils.log.s3_task_handler.S3TaskHandler", which is not present in apache-airflow==1.9.0 python package.
Hope this helps!
Have it working with Airflow 1.10 in kube.
I have the following env var sets:
AIRFLOW_CONN_LOGS_S3=s3://id:secret_uri_encoded#S3
AIRFLOW__CORE__REMOTE_LOGGING=True
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://xxxx/logs
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=logs_s3
For airflow 2.3.4, using Docker, I also faced issues with logging to s3.
Initially I faced some permission errors (although my IAM Role was set fine), then after changing the config a bit I was able to write the files in the correct location, but could not read (falling back to local log).
Anyway, after many efforts, debugging, trial and error attempts, here is what worked for me:
Define a connection for s3 (assuming your region is also eu-west-1):
Either via the UI, in which case you need to set:
Connection Id: my-conn (or whatever name you prefer),
Connection Type: Amazon Web Services (this is one change that I did, s3 did not work for me),
Extra: {"region_name": "eu-west-1", "endpoint_url": "https://s3.eu-west-1.amazonaws.com"}
Or via the CLI:
airflow connections add my-conn --conn-type aws --conn-extra '{"region_name": "eu-west-1", "endpoint_url": "https://s3.eu-west-1.amazonaws.com"}'
As for airflow config, I have set those in all processes:
...
export AIRFLOW__LOGGING__REMOTE_LOGGING=True
export AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=s3://my-bucket/path/to/log/folder
export AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=my-conn
...
After deploying, I was still getting errors like Falling back to local log..., but eventually the file was loaded and displayed (after a few refreshes).
It seems to work OK now though :)
Related
How to connect airflow[docker] to GCP?
First I connect airflow PostgresOperator and create dataset Now i want to Postgres to GCS Operator connection but Always see me invalid dsn: invalid connection option "extra__google_cloud_platform__key_path" how to do I change?? my airflow connection enter image description here and my Dags postgres_to_gcs_task = PostgresToGCSOperator( task_id = 'postgres_to_gcs_task', postgres_conn_id='postgres_to_gcs', sql = 'SELECT * FROM game;', bucket= 'lol-airflow-bucket', filename= 'lol-airflow-bucket/my_game/1', export_format= 'csv', dag=dag )
I think there is an issue with the json key or the key file path. Can you confirm the path and the json key are corrects ? This link from official documentation explains how to create a connection in Airflow : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/connections/gcp.html You can also check this interesting post on this topic : https://junjiejiang94.medium.com/get-started-with-airflow-google-cloud-platform-docker-a21c46e0f797 Another direction if it can help : If you want to use a managed service on Google Cloud with Airflow, you can create a Cloud Composer cluster (V2 is better) In this case, no need to care about the connection on GCP, you only need to give a Service account with the expected permissions to the Cloud Composer configuration when creating the cluster.
env variable Python on GCP
I am trying to send a CSV file to an SFTP server using a Google Cloud Function. This is the Python script I am using - import paramiko import os def hello_sftp(event, context): myPassword = os.environ.get('SFTP_PASSWORD') host = "HostName" username = "TestUser" password = myPassword file_name = 'test.csv'' port = 22 transport = paramiko.Transport((host, port)) destination_path = "/"+file_name local_path = "gs://testbucket/"+file_name #GCP Bucket address transport.connect(username = username, password = password) sftp = paramiko.SFTPClient.from_transport(transport) sftp.put(local_path, destination_path) sftp.close() transport.close() To authenticate to the SFTP server I need to use the RSA file. In the Secret Manager I have uploaded the Secret value in the Secret Manager and Used the value as an Environment variable in the Google Cloud Function. But I think I am doing something wrong here - myPassword = os.environ.get('SFTP_PASSWORD') because I this line I think I am getting this error deploying message - Deployment failure: Function failed on loading user code. This is likely due to a bug in the user code. Error message: Error: please examine your function logs to see the error cause: https://cloud.google.com/functions/docs/monitoring/logging#viewing_logs. Additional troubleshooting documentation can be found at https://cloud.google.com/functions/docs/troubleshooting#logging. Please visit https://cloud.google.com/functions/docs/troubleshooting for in-depth troubleshooting documentation. In the logs I can find the following error message - 2022-01-23T23:31:59.838337ZCloud FunctionsCreateFunctioneurope-west3:function-SFTPxx#xx.com {#type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], methodName: google.cloud.functions.v1.CloudFunctionsService.CreateFunction, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/testServer-test/locations/eur… {#type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], methodName: google.cloud.functions.v1.CloudFunctionsService.CreateFunction, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/testServer-test/locations/eur… Can anyone point me, to where I am doing wrong here or the Script is wrong??
From your error message looks like function fails to deploy altogether after trying to parse your code so you don't even get to the other part (functions code). Check your code for spelling mistakes, spacing aka formatting etc, make sure it runs outside of function then when deploying you will need to wrap it in a function and set whatever triggers it. Don't forget requirements.txt to specify libraries you are importing in your code to be installed # invocation (you can also version them within there). After looking at your sample code for starter remove extra ' at the end of .cvs'' in file_name = 'test.csv'' line to aka file_name = 'test.csv'. Then to check the rest of your codes behavior try to put print statements before using variables downstream those print statements will show up in your logs after you trigger your deployed function and you can see what variables looked like # invocation.
Azure Python SDK - code only fails when debugging in VS Code
VS Code 1.14.0, Python 3.8.0 (in venv). When I run the following code in VS Code, it works. When I run it in the debugger even with no breakpoints, it fails. This might be something to do with venvs, but I don't know. Ideas? BTW - I am referencing the other packages for what I will be building. From the Bash shell, I have the following environment variables: export AZURE_TENANT_ID = "tenant ID" export AZURE_CLIENT_ID = "client ID" export AZURE_CLIENT_SECRET = "client secret" export AZURE_SUBSCRIPTION_ID = "subscription ID" from azure.identity import DefaultAzureCredential from azure.keyvault.secrets import SecretClient from azure.keyvault.keys import KeyClient from azure.mgmt.keyvault import KeyVaultManagementClient from azure.core.exceptions import HttpResponseError import datetime import os credential = DefaultAzureCredential() secret_client = SecretClient(vault_url="https://blahblahblah.vault.azure.net/", credential=credential) print(os.environ["AZURE_TENANT_ID"]) print(os.environ["AZURE_CLIENT_ID"]) print(os.environ["AZURE_CLIENT_SECRET"]) print(os.environ["AZURE_SUBSCRIPTION_ID"]) try: print("\n.. Get a Secret by name") secret = secret_client.get_secret("mySecret") print("Secret with name '{0}' was found with value '{1}'.".format(secret.name, secret.value)) except HttpResponseError as e: print("\nThis sample has caught an error. {0}".format(e.message)) When I run this in DEBUG in VS Code, this is the error: This sample has caught an error. No credential in this chain provided a token. Attempted credentials: EnvironmentCredential: Incomplete environment configuration ImdsCredential: IMDS endpoint unavailable SharedTokenCacheCredential: The shared cache contains no signed-in accounts. To authenticate with SharedTokenCacheCredential, login through developer tooling supporting Azure single sign on What I have learned is the printed OS environ variables are accurate when I Run Python File in Terminal, but when I run the file in debug, it errors printing the first OS environ variable saying it doesn't exist. This is my ignorance on setting debug correctly. Any pointers will help (and thank you for your responses)!**
To make the answer visible to others, I'm summarizing the answer shared in comment: This issue occurred because of the missing of environment variables under debugger mode. Add the environment variables to the launch.json file solved this issue.
From my experience, BASH does not like spaces, when declaring variables; export AZURE_TENANT_ID="tenant ID" export AZURE_CLIENT_ID="client ID" export AZURE_CLIENT_SECRET="client secret" export AZURE_SUBSCRIPTION_ID="subscription ID" Might do the trick.
Doing the equivalent of log_struct in python logger
In the google example, it gives the following: logger.log_struct({ 'message': 'My second entry', 'weather': 'partly cloudy', }) How would I do the equivalent in python's logger. For example: import logging log.info( msg='My second entry', extra = {'weather': "partly cloudy"} ) When I view this in stackdriver, the extra fields aren't getting parsed properly: 2018-11-12 15:41:12.366 PST My second entry Expand all | Collapse all { insertId: "1de1tqqft3x3ri" jsonPayload: { message: "My second entry" python_logger: "Xhdoo8x" } logName: "projects/Xhdoo8x/logs/python" receiveTimestamp: "2018-11-12T23:41:12.366883466Z" resource: {…} severity: "INFO" timestamp: "2018-11-12T23:41:12.366883466Z" } How would I do that? The closest I'm able to do now is: log.handlers[-1].client.logger('').log_struct("...") But this still requires a second call...
Current solution: Update 1 - user Seth Nickell improved my proposed solution, so I update this answer as his method is superior. The following is based on his response on GitHub: https://github.com/snickell/google_structlog pip install google-structlog Used like so: import google_structlog google_structlog.setup(log_name="here-is-mylilapp") # Now you can use structlog to get searchable json details in stackdriver... import structlog logger = structlog.get_logger() logger.error("Uhoh, something bad did", moreinfo="it was bad", years_back_luck=5) # Of course, you can still use plain ol' logging stdlib to get { "message": ... } objects import logging logger = logging.getLogger("yoyo") logger.error("Regular logging calls will work happily too") # Now you can search stackdriver with the query: # logName: 'here-is-mylilapp' Original answer: Based on an answer from this GitHub thread, I use the following bodge to log custom objects as info payload. It derives more from the original _Worker.enqueue and supports passing custom fields. from google.cloud.logging import _helpers from google.cloud.logging.handlers.transports.background_thread import _Worker def my_enqueue(self, record, message, resource=None, labels=None, trace=None, span_id=None): queue_entry = { "info": {"message": message, "python_logger": record.name}, "severity": _helpers._normalize_severity(record.levelno), "resource": resource, "labels": labels, "trace": trace, "span_id": span_id, "timestamp": datetime.datetime.utcfromtimestamp(record.created), } if 'custom_fields' in record: entry['info']['custom_fields'] = record.custom_fields self._queue.put_nowait(queue_entry) _Worker.enqueue = my_enqueue Then import logging from google.cloud import logging as google_logging logger = logging.getLogger('my_log_client') logger.addHandler(CloudLoggingHandler(google_logging.Client(), 'my_log_client')) logger.info('hello', extra={'custom_fields':{'foo': 1, 'bar':{'tzar':3}}}) Resulting in: Which then makes it much easier to filter according to these custom_fields. Let's admit this is not good programming, though until this functionality is officially supported there doesn't seem to be much else that can be done.
This is not currently possible, see this open feature request on google-cloud-python for more details.
Official docs: Setting Up Cloud Logging for Python You can write logs to Logging from Python applications by using the Python logging handler included with the Logging client library, or by using Cloud Logging API Cloud client library for Python directly. I did not get the Python logging module to export jsonPayload, but the cloud logging library for Python works: google-cloud-logging >= v.3.0.0 can do it No need for Python logging workaround anymore, the only thing you need is to install Python >= 3.6 and pip install google-cloud-logging Then, you can use Python logging with import os # have the environment variable ready: # GOOGLE_APPLICATION_CREDENTIALS # Then: GOOGLE_APPLICATION_CREDENTIALS = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") client = google.cloud.logging.Client.from_service_account_json( GOOGLE_APPLICATION_CREDENTIALS ) log_name = "test" gcloud_logger = client.logger(log_name) # jsonPayloads gcloud_logger.log_struct(entry) # messages gcloud_logger.log_text('hello') # generic (can find out whether it is jsonPayload or message)! gcloud_logger.log(entry or 'hello') You run these commands in a Python file outside of GCP and reach GCP with more or less a one-liner, you only need the credentials. You can use the gcloud logger even for printing and logging in one go, taken from Writing structured logs, untested. Python logging to log jsonPayload into GCP logs (TL/DR) I could not get this to run! You can also use the built-in logging module of Python with the workaround mentioned in the other answer, but I did not get it to run. It will not work if you pass a dictionary or its json.dumps() directly as a parameter, since then, you get a string output of the whole dictionary which you cannot read as a json tree. But it also did not work for me when I used the logger.info() to log the jsonPayload / json.dumps in an example parameter called extras. import json import os #... # https://googleapis.dev/python/logging/latest/stdlib-usage.html GOOGLE_APPLICATION_CREDENTIALS = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") client = google.cloud.logging.Client.from_service_account_json( GOOGLE_APPLICATION_CREDENTIALS ) log_name = "test" handler = CloudLoggingHandler(client, name=log_name) setup_logging(handler) logger = logging.getLogger() logger.setLevel(logging.INFO) # Set default level. #... # Complete a structured log entry. entry = dict( severity="NOTICE", message="This is the default display field.", # Log viewer accesses 'component' as jsonPayload.component'. component="arbitrary-property" ) # Python logging to log jsonPayload into GCP logs logger.info('hello', extras=json.dumps(entry)) I also tried the google-structlog solution of the other answer, that only threw the error: google_structlog.setup(log_name=log_name) TypeError: setup() got an unexpected keyword argument 'log_name' I used Python v3.10.2 and google-auth==1.35.0 google-cloud-core==1.7.2 google-cloud-logging==1.15.0 googleapis-common-protos==1.54.0 Research steps gcloud logging (TL/DR) Following the fixed issue (see the merge and close in the end) on github at googleapis / python-logging: Logging: support sending structured logs to stackdriver via stdlib 'logging'. #13 you find feat!: support json logs #316 : This PR adds full support for JSON logs, along with standard text logs. Now, users can call logging.error({'a':'b'}), and they will get a JsonPayload in Cloud Logging, Or call logging.error('test') to receive a TextPayload As part of this change, I added a generic logger.log() function, which serves as a generic entry-point instead of logger.log_text or logger.log_struct. It will infer which log function is meant based on the input type Previously, the library would attach the python logger name as part of the JSON payload for each log. Now, that information will be attached as a label instead, giving users full control of the log payload fields Fixes #186, #263, #13 With the main new version listing the new feature: chore(main): release 3.0.0 #473
Securely storing environment variables in GAE with app.yaml
I need to store API keys and other sensitive information in app.yaml as environment variables for deployment on GAE. The issue with this is that if I push app.yaml to GitHub, this information becomes public (not good). I don't want to store the info in a datastore as it does not suit the project. Rather, I'd like to swap out the values from a file that is listed in .gitignore on each deployment of the app. Here is my app.yaml file: application: myapp version: 3 runtime: python27 api_version: 1 threadsafe: true libraries: - name: webapp2 version: latest - name: jinja2 version: latest handlers: - url: /static static_dir: static - url: /.* script: main.application login: required secure: always # auth_fail_action: unauthorized env_variables: CLIENT_ID: ${CLIENT_ID} CLIENT_SECRET: ${CLIENT_SECRET} ORG: ${ORG} ACCESS_TOKEN: ${ACCESS_TOKEN} SESSION_SECRET: ${SESSION_SECRET} Any ideas?
This solution is simple but may not suit all different teams. First, put the environment variables in an env_variables.yaml, e.g., env_variables: SECRET: 'my_secret' Then, include this env_variables.yaml in the app.yaml includes: - env_variables.yaml Finally, add the env_variables.yaml to .gitignore, so that the secret variables won't exist in the repository. In this case, the env_variables.yaml needs to be shared among the deployment managers.
If it's sensitive data, you should not store it in source code as it will be checked into source control. The wrong people (inside or outside your organization) may find it there. Also, your development environment probably uses different config values from your production environment. If these values are stored in code, you will have to run different code in development and production, which is messy and bad practice. In my projects, I put config data in the datastore using this class: from google.appengine.ext import ndb class Settings(ndb.Model): name = ndb.StringProperty() value = ndb.StringProperty() #staticmethod def get(name): NOT_SET_VALUE = "NOT SET" retval = Settings.query(Settings.name == name).get() if not retval: retval = Settings() retval.name = name retval.value = NOT_SET_VALUE retval.put() if retval.value == NOT_SET_VALUE: raise Exception(('Setting %s not found in the database. A placeholder ' + 'record has been created. Go to the Developers Console for your app ' + 'in App Engine, look up the Settings record with name=%s and enter ' + 'its value in that record\'s value field.') % (name, name)) return retval.value Your application would do this to get a value: API_KEY = Settings.get('API_KEY') If there is a value for that key in the datastore, you will get it. If there isn't, a placeholder record will be created and an exception will be thrown. The exception will remind you to go to the Developers Console and update the placeholder record. I find this takes the guessing out of setting config values. If you are unsure of what config values to set, just run the code and it will tell you! The code above uses the ndb library which uses memcache and the datastore under the hood, so it's fast. Update: jelder asked for how to find the Datastore values in the App Engine console and set them. Here is how: Go to https://console.cloud.google.com/datastore/ Select your project at the top of the page if it's not already selected. In the Kind dropdown box, select Settings. If you ran the code above, your keys will show up. They will all have the value NOT SET. Click each one and set its value. Hope this helps!
This didn't exist when you posted, but for anyone else who stumbles in here, Google now offers a service called Secret Manager. It's a simple REST service (with SDKs wrapping it, of course) to store your secrets in a secure location on google cloud platform. This is a better approach than Data Store, requiring extra steps to see the stored secrets and having a finer-grained permission model -- you can secure individual secrets differently for different aspects of your project, if you need to. It offers versioning, so you can handle password changes with relative ease, as well as a robust query and management layer enabling you to discover and create secrets at runtime, if necessary. Python SDK Example usage: from google.cloud import secretmanager_v1beta1 as secretmanager secret_id = 'my_secret_key' project_id = 'my_project' version = 1 # use the management tools to determine version at runtime client = secretmanager.SecretManagerServiceClient() secret_path = client.secret_version_path(project_id, secret_id, version) response = client.access_secret_version(secret_path) password_string = response.payload.data.decode('UTF-8') # use password_string -- set up database connection, call third party service, whatever
My approach is to store client secrets only within the App Engine app itself. The client secrets are neither in source control nor on any local computers. This has the benefit that any App Engine collaborator can deploy code changes without having to worry about the client secrets. I store client secrets directly in Datastore and use Memcache for improved latency accessing the secrets. The Datastore entities only need to be created once and will persist across future deploys. of course the App Engine console can be used to update these entities at any time. There are two options to perform the one-time entity creation: Use the App Engine Remote API interactive shell to create the entities. Create an Admin only handler that will initialize the entities with dummy values. Manually invoke this admin handler, then use the App Engine console to update the entities with the production client secrets.
Best way to do it, is store the keys in a client_secrets.json file, and exclude that from being uploaded to git by listing it in your .gitignore file. If you have different keys for different environments, you can use app_identity api to determine what the app id is, and load appropriately. There is a fairly comprehensive example here -> https://developers.google.com/api-client-library/python/guide/aaa_client_secrets. Here's some example code: # declare your app ids as globals ... APPID_LIVE = 'awesomeapp' APPID_DEV = 'awesomeapp-dev' APPID_PILOT = 'awesomeapp-pilot' # create a dictionary mapping the app_ids to the filepaths ... client_secrets_map = {APPID_LIVE:'client_secrets_live.json', APPID_DEV:'client_secrets_dev.json', APPID_PILOT:'client_secrets_pilot.json'} # get the filename based on the current app_id ... client_secrets_filename = client_secrets_map.get( app_identity.get_application_id(), APPID_DEV # fall back to dev ) # use the filename to construct the flow ... flow = flow_from_clientsecrets(filename=client_secrets_filename, scope=scope, redirect_uri=redirect_uri) # or, you could load up the json file manually if you need more control ... f = open(client_secrets_filename, 'r') client_secrets = json.loads(f.read()) f.close()
This solution relies on the deprecated appcfg.py You can use the -E command line option of appcfg.py to setup the environment variables when you deploy your app to GAE (appcfg.py update) $ appcfg.py ... -E NAME:VALUE, --env_variable=NAME:VALUE Set an environment variable, potentially overriding an env_variable value from app.yaml file (flag may be repeated to set multiple variables). ...
Most answers are outdated. Using google cloud datastore is actually a bit different right now. https://cloud.google.com/python/getting-started/using-cloud-datastore Here's an example: from google.cloud import datastore client = datastore.Client() datastore_entity = client.get(client.key('settings', 'TWITTER_APP_KEY')) connection_string_prod = datastore_entity.get('value') This assumes the entity name is 'TWITTER_APP_KEY', the kind is 'settings', and 'value' is a property of the TWITTER_APP_KEY entity.
With github action instead of google cloud triggers (Google cloud triggers aren't able to find it's own app.yaml and manage the freaking environment variable by itself.) Here is how to do it: My environment : App engine, standard (not flex), Nodejs Express application, a PostgreSQL CloudSql First the setup : 1. Create a new Google Cloud Project (or select an existing project). 2. Initialize your App Engine app with your project. [Create a Google Cloud service account][sa] or select an existing one. 3. Add the the following Cloud IAM roles to your service account: App Engine Admin - allows for the creation of new App Engine apps Service Account User - required to deploy to App Engine as service account Storage Admin - allows upload of source code Cloud Build Editor - allows building of source code [Download a JSON service account key][create-key] for the service account. 4. Add the following [secrets to your repository's secrets][gh-secret]: GCP_PROJECT: Google Cloud project ID GCP_SA_KEY: the downloaded service account key The app.yaml runtime: nodejs14 env: standard env_variables: SESSION_SECRET: $SESSION_SECRET beta_settings: cloud_sql_instances: SQL_INSTANCE Then the github action name: Build and Deploy to GKE on: push env: PROJECT_ID: ${{ secrets.GKE_PROJECT }} DATABASE_URL: ${{ secrets.DATABASE_URL}} jobs: setup-build-publish-deploy: name: Setup, Build, Publish, and Deploy runs-on: ubuntu-latest steps: - uses: actions/checkout#v2 - uses: actions/setup-node#v2 with: node-version: '12' - run: npm install - uses: actions/checkout#v1 - uses: ikuanyshbekov/app-yaml-env-compiler#v1.0 env: SESSION_SECRET: ${{ secrets.SESSION_SECRET }} - shell: bash run: | sed -i 's/SQL_INSTANCE/'${{secrets.DATABASE_URL}}'/g' app.yaml - uses: actions-hub/gcloud#master env: PROJECT_ID: ${{ secrets.GKE_PROJECT }} APPLICATION_CREDENTIALS: ${{ secrets.GCLOUD_AUTH }} CLOUDSDK_CORE_DISABLE_PROMPTS: 1 with: args: app deploy app.yaml To add secrets into github action you must go to : Settings/secrets Take note that I could handle all the substitution with the bash script. So I would not depend on the github project "ikuanyshbekov/app-yaml-env-compiler#v1.0" It's a shame that GAE doesn't offer an easiest way to handle environment variable for the app.yaml. I don't want to use KMS since I need to update the beta-settings/cloud sql instance.. I really needed to substitute everything into the app.yaml. This way I can make a specific action for the right environment and manage the secrets.
It sounds like you can do a few approaches. We have a similar issue and do the following (adapted to your use-case): Create a file that stores any dynamic app.yaml values and place it on a secure server in your build environment. If you are really paranoid, you can asymmetrically encrypt the values. You can even keep this in a private repo if you need version control/dynamic pulling, or just use a shells script to copy it/pull it from the appropriate place. Pull from git during the deployment script After the git pull, modify the app.yaml by reading and writing it in pure python using a yaml library The easiest way to do this is to use a continuous integration server such as Hudson, Bamboo, or Jenkins. Simply add some plug-in, script step, or workflow that does all the above items I mentioned. You can pass in environment variables that are configured in Bamboo itself for example. In summary, just push in the values during your build process in an environment you only have access to. If you aren't already automating your builds, you should be. Another option option is what you said, put it in the database. If your reason for not doing that is that things are too slow, simply push the values into memcache as a 2nd layer cache, and pin the values to the instances as a first-layer cache. If the values can change and you need to update the instances without rebooting them, just keep a hash you can check to know when they change or trigger it somehow when something you do changes the values. That should be it.
Just wanted to note how I solved this problem in javascript/nodejs. For local development I used the 'dotenv' npm package which loads environment variables from a .env file into process.env. When I started using GAE I learned that environment variables need to be set in a 'app.yaml' file. Well, I didn't want to use 'dotenv' for local development and 'app.yaml' for GAE (and duplicate my environment variables between the two files), so I wrote a little script that loads app.yaml environment variables into process.env, for local development. Hope this helps someone: yaml_env.js: (function () { const yaml = require('js-yaml'); const fs = require('fs'); const isObject = require('lodash.isobject') var doc = yaml.safeLoad( fs.readFileSync('app.yaml', 'utf8'), { json: true } ); // The .env file will take precedence over the settings the app.yaml file // which allows me to override stuff in app.yaml (the database connection string (DATABASE_URL), for example) // This is optional of course. If you don't use dotenv then remove this line: require('dotenv/config'); if(isObject(doc) && isObject(doc.env_variables)) { Object.keys(doc.env_variables).forEach(function (key) { // Dont set environment with the yaml file value if it's already set process.env[key] = process.env[key] || doc.env_variables[key] }) } })() Now include this file as early as possible in your code, and you're done: require('../yaml_env')
You should encrypt the variables with google kms and embed it in your source code. (https://cloud.google.com/kms/) echo -n the-twitter-app-key | gcloud kms encrypt \ > --project my-project \ > --location us-central1 \ > --keyring THEKEYRING \ > --key THECRYPTOKEY \ > --plaintext-file - \ > --ciphertext-file - \ > | base64 put the scrambled (encrypted and base64 encoded) value into your environment variable (in yaml file). Some pythonish code to get you started on decrypting. kms_client = kms_v1.KeyManagementServiceClient() name = kms_client.crypto_key_path_path("project", "global", "THEKEYRING", "THECRYPTOKEY") twitter_app_key = kms_client.decrypt(name, base64.b64decode(os.environ.get("TWITTER_APP_KEY"))).plaintext
#Jason F's answer based on using Google Datastore is close, but the code is a bit outdated based on the sample usage on the library docs. Here's the snippet that worked for me: from google.cloud import datastore client = datastore.Client('<your project id>') key = client.key('<kind e.g settings>', '<entity name>') # note: entity name not property # get by key for this entity result = client.get(key) print(result) # prints all the properties ( a dict). index a specific value like result['MY_SECRET_KEY']) Partly inspired by this Medium post
Extending Martin's answer from google.appengine.ext import ndb class Settings(ndb.Model): """ Get sensitive data setting from DataStore. key:String -> value:String key:String -> Exception Thanks to: Martin Omander # Stackoverflow https://stackoverflow.com/a/35261091/1463812 """ name = ndb.StringProperty() value = ndb.StringProperty() #staticmethod def get(name): retval = Settings.query(Settings.name == name).get() if not retval: raise Exception(('Setting %s not found in the database. A placeholder ' + 'record has been created. Go to the Developers Console for your app ' + 'in App Engine, look up the Settings record with name=%s and enter ' + 'its value in that record\'s value field.') % (name, name)) return retval.value #staticmethod def set(name, value): exists = Settings.query(Settings.name == name).get() if not exists: s = Settings(name=name, value=value) s.put() else: exists.value = value exists.put() return True
There is a pypi package called gae_env that allows you to save appengine environment variables in Cloud Datastore. Under the hood, it also uses Memcache so its fast Usage: import gae_env API_KEY = gae_env.get('API_KEY') If there is a value for that key in the datastore, it will be returned. If there isn't, a placeholder record __NOT_SET__ will be created and a ValueNotSetError will be thrown. The exception will remind you to go to the Developers Console and update the placeholder record. Similar to Martin's answer, here is how to update the value for the key in Datastore: Go to Datastore Section in the developers console Select your project at the top of the page if it's not already selected. In the Kind dropdown box, select GaeEnvSettings. Keys for which an exception was raised will have value __NOT_SET__. Go to the package's GitHub page for more info on usage/configuration
My solution is to replace the secrets in the app.yaml file via github action and github secrets. app.yaml (App Engine) env_variables: SECRET_ONE: $SECRET_ONE ANOTHER_SECRET: $ANOTHER_SECRET workflow.yaml (Github) steps: - uses: actions/checkout#v2 - uses: 73h/gae-app-yaml-replace-env-variables#v0.1 env: SECRET_ONE: ${{ secrets.SECRET_ONE }} ANOTHER_SECRET: ${{ secrets.ANOTHER_SECRET }} Here you can find the Github action. https://github.com/73h/gae-app-yaml-replace-env-variables When developing locally, I write the secrets to an .env file.