Securely storing environment variables in GAE with app.yaml - python

I need to store API keys and other sensitive information in app.yaml as environment variables for deployment on GAE. The issue with this is that if I push app.yaml to GitHub, this information becomes public (not good). I don't want to store the info in a datastore as it does not suit the project. Rather, I'd like to swap out the values from a file that is listed in .gitignore on each deployment of the app.
Here is my app.yaml file:
application: myapp
version: 3
runtime: python27
api_version: 1
threadsafe: true
libraries:
- name: webapp2
version: latest
- name: jinja2
version: latest
handlers:
- url: /static
static_dir: static
- url: /.*
script: main.application
login: required
secure: always
# auth_fail_action: unauthorized
env_variables:
CLIENT_ID: ${CLIENT_ID}
CLIENT_SECRET: ${CLIENT_SECRET}
ORG: ${ORG}
ACCESS_TOKEN: ${ACCESS_TOKEN}
SESSION_SECRET: ${SESSION_SECRET}
Any ideas?

This solution is simple but may not suit all different teams.
First, put the environment variables in an env_variables.yaml, e.g.,
env_variables:
SECRET: 'my_secret'
Then, include this env_variables.yaml in the app.yaml
includes:
- env_variables.yaml
Finally, add the env_variables.yaml to .gitignore, so that the secret variables won't exist in the repository.
In this case, the env_variables.yaml needs to be shared among the deployment managers.

If it's sensitive data, you should not store it in source code as it will be checked into source control. The wrong people (inside or outside your organization) may find it there. Also, your development environment probably uses different config values from your production environment. If these values are stored in code, you will have to run different code in development and production, which is messy and bad practice.
In my projects, I put config data in the datastore using this class:
from google.appengine.ext import ndb
class Settings(ndb.Model):
name = ndb.StringProperty()
value = ndb.StringProperty()
#staticmethod
def get(name):
NOT_SET_VALUE = "NOT SET"
retval = Settings.query(Settings.name == name).get()
if not retval:
retval = Settings()
retval.name = name
retval.value = NOT_SET_VALUE
retval.put()
if retval.value == NOT_SET_VALUE:
raise Exception(('Setting %s not found in the database. A placeholder ' +
'record has been created. Go to the Developers Console for your app ' +
'in App Engine, look up the Settings record with name=%s and enter ' +
'its value in that record\'s value field.') % (name, name))
return retval.value
Your application would do this to get a value:
API_KEY = Settings.get('API_KEY')
If there is a value for that key in the datastore, you will get it. If there isn't, a placeholder record will be created and an exception will be thrown. The exception will remind you to go to the Developers Console and update the placeholder record.
I find this takes the guessing out of setting config values. If you are unsure of what config values to set, just run the code and it will tell you!
The code above uses the ndb library which uses memcache and the datastore under the hood, so it's fast.
Update:
jelder asked for how to find the Datastore values in the App Engine console and set them. Here is how:
Go to https://console.cloud.google.com/datastore/
Select your project at the top of the page if it's not already selected.
In the Kind dropdown box, select Settings.
If you ran the code above, your keys will show up. They will all have the value NOT SET. Click each one and set its value.
Hope this helps!

This didn't exist when you posted, but for anyone else who stumbles in here, Google now offers a service called Secret Manager.
It's a simple REST service (with SDKs wrapping it, of course) to store your secrets in a secure location on google cloud platform. This is a better approach than Data Store, requiring extra steps to see the stored secrets and having a finer-grained permission model -- you can secure individual secrets differently for different aspects of your project, if you need to.
It offers versioning, so you can handle password changes with relative ease, as well as a robust query and management layer enabling you to discover and create secrets at runtime, if necessary.
Python SDK
Example usage:
from google.cloud import secretmanager_v1beta1 as secretmanager
secret_id = 'my_secret_key'
project_id = 'my_project'
version = 1 # use the management tools to determine version at runtime
client = secretmanager.SecretManagerServiceClient()
secret_path = client.secret_version_path(project_id, secret_id, version)
response = client.access_secret_version(secret_path)
password_string = response.payload.data.decode('UTF-8')
# use password_string -- set up database connection, call third party service, whatever

My approach is to store client secrets only within the App Engine app itself. The client secrets are neither in source control nor on any local computers. This has the benefit that any App Engine collaborator can deploy code changes without having to worry about the client secrets.
I store client secrets directly in Datastore and use Memcache for improved latency accessing the secrets. The Datastore entities only need to be created once and will persist across future deploys. of course the App Engine console can be used to update these entities at any time.
There are two options to perform the one-time entity creation:
Use the App Engine Remote API interactive shell to create the entities.
Create an Admin only handler that will initialize the entities with dummy values. Manually invoke this admin handler, then use the App Engine console to update the entities with the production client secrets.

Best way to do it, is store the keys in a client_secrets.json file, and exclude that from being uploaded to git by listing it in your .gitignore file. If you have different keys for different environments, you can use app_identity api to determine what the app id is, and load appropriately.
There is a fairly comprehensive example here -> https://developers.google.com/api-client-library/python/guide/aaa_client_secrets.
Here's some example code:
# declare your app ids as globals ...
APPID_LIVE = 'awesomeapp'
APPID_DEV = 'awesomeapp-dev'
APPID_PILOT = 'awesomeapp-pilot'
# create a dictionary mapping the app_ids to the filepaths ...
client_secrets_map = {APPID_LIVE:'client_secrets_live.json',
APPID_DEV:'client_secrets_dev.json',
APPID_PILOT:'client_secrets_pilot.json'}
# get the filename based on the current app_id ...
client_secrets_filename = client_secrets_map.get(
app_identity.get_application_id(),
APPID_DEV # fall back to dev
)
# use the filename to construct the flow ...
flow = flow_from_clientsecrets(filename=client_secrets_filename,
scope=scope,
redirect_uri=redirect_uri)
# or, you could load up the json file manually if you need more control ...
f = open(client_secrets_filename, 'r')
client_secrets = json.loads(f.read())
f.close()

This solution relies on the deprecated appcfg.py
You can use the -E command line option of appcfg.py to setup the environment variables when you deploy your app to GAE (appcfg.py update)
$ appcfg.py
...
-E NAME:VALUE, --env_variable=NAME:VALUE
Set an environment variable, potentially overriding an
env_variable value from app.yaml file (flag may be
repeated to set multiple variables).
...

Most answers are outdated. Using google cloud datastore is actually a bit different right now. https://cloud.google.com/python/getting-started/using-cloud-datastore
Here's an example:
from google.cloud import datastore
client = datastore.Client()
datastore_entity = client.get(client.key('settings', 'TWITTER_APP_KEY'))
connection_string_prod = datastore_entity.get('value')
This assumes the entity name is 'TWITTER_APP_KEY', the kind is 'settings', and 'value' is a property of the TWITTER_APP_KEY entity.

With github action instead of google cloud triggers (Google cloud triggers aren't able to find it's own app.yaml and manage the freaking environment variable by itself.)
Here is how to do it:
My environment :
App engine,
standard (not flex),
Nodejs Express application,
a PostgreSQL CloudSql
First the setup :
1. Create a new Google Cloud Project (or select an existing project).
2. Initialize your App Engine app with your project.
[Create a Google Cloud service account][sa] or select an existing one.
3. Add the the following Cloud IAM roles to your service account:
App Engine Admin - allows for the creation of new App Engine apps
Service Account User - required to deploy to App Engine as service account
Storage Admin - allows upload of source code
Cloud Build Editor - allows building of source code
[Download a JSON service account key][create-key] for the service account.
4. Add the following [secrets to your repository's secrets][gh-secret]:
GCP_PROJECT: Google Cloud project ID
GCP_SA_KEY: the downloaded service account key
The app.yaml
runtime: nodejs14
env: standard
env_variables:
SESSION_SECRET: $SESSION_SECRET
beta_settings:
cloud_sql_instances: SQL_INSTANCE
Then the github action
name: Build and Deploy to GKE
on: push
env:
PROJECT_ID: ${{ secrets.GKE_PROJECT }}
DATABASE_URL: ${{ secrets.DATABASE_URL}}
jobs:
setup-build-publish-deploy:
name: Setup, Build, Publish, and Deploy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout#v2
- uses: actions/setup-node#v2
with:
node-version: '12'
- run: npm install
- uses: actions/checkout#v1
- uses: ikuanyshbekov/app-yaml-env-compiler#v1.0
env:
SESSION_SECRET: ${{ secrets.SESSION_SECRET }}
- shell: bash
run: |
sed -i 's/SQL_INSTANCE/'${{secrets.DATABASE_URL}}'/g' app.yaml
- uses: actions-hub/gcloud#master
env:
PROJECT_ID: ${{ secrets.GKE_PROJECT }}
APPLICATION_CREDENTIALS: ${{ secrets.GCLOUD_AUTH }}
CLOUDSDK_CORE_DISABLE_PROMPTS: 1
with:
args: app deploy app.yaml
To add secrets into github action you must go to : Settings/secrets
Take note that I could handle all the substitution with the bash script. So I would not depend on the github project "ikuanyshbekov/app-yaml-env-compiler#v1.0"
It's a shame that GAE doesn't offer an easiest way to handle environment variable for the app.yaml. I don't want to use KMS since I need to update the beta-settings/cloud sql instance.. I really needed to substitute everything into the app.yaml.
This way I can make a specific action for the right environment and manage the secrets.

It sounds like you can do a few approaches. We have a similar issue and do the following (adapted to your use-case):
Create a file that stores any dynamic app.yaml values and place it on a secure server in your build environment. If you are really paranoid, you can asymmetrically encrypt the values. You can even keep this in a private repo if you need version control/dynamic pulling, or just use a shells script to copy it/pull it from the appropriate place.
Pull from git during the deployment script
After the git pull, modify the app.yaml by reading and writing it in pure python using a yaml library
The easiest way to do this is to use a continuous integration server such as Hudson, Bamboo, or Jenkins. Simply add some plug-in, script step, or workflow that does all the above items I mentioned. You can pass in environment variables that are configured in Bamboo itself for example.
In summary, just push in the values during your build process in an environment you only have access to. If you aren't already automating your builds, you should be.
Another option option is what you said, put it in the database. If your reason for not doing that is that things are too slow, simply push the values into memcache as a 2nd layer cache, and pin the values to the instances as a first-layer cache. If the values can change and you need to update the instances without rebooting them, just keep a hash you can check to know when they change or trigger it somehow when something you do changes the values. That should be it.

Just wanted to note how I solved this problem in javascript/nodejs. For local development I used the 'dotenv' npm package which loads environment variables from a .env file into process.env. When I started using GAE I learned that environment variables need to be set in a 'app.yaml' file. Well, I didn't want to use 'dotenv' for local development and 'app.yaml' for GAE (and duplicate my environment variables between the two files), so I wrote a little script that loads app.yaml environment variables into process.env, for local development. Hope this helps someone:
yaml_env.js:
(function () {
const yaml = require('js-yaml');
const fs = require('fs');
const isObject = require('lodash.isobject')
var doc = yaml.safeLoad(
fs.readFileSync('app.yaml', 'utf8'),
{ json: true }
);
// The .env file will take precedence over the settings the app.yaml file
// which allows me to override stuff in app.yaml (the database connection string (DATABASE_URL), for example)
// This is optional of course. If you don't use dotenv then remove this line:
require('dotenv/config');
if(isObject(doc) && isObject(doc.env_variables)) {
Object.keys(doc.env_variables).forEach(function (key) {
// Dont set environment with the yaml file value if it's already set
process.env[key] = process.env[key] || doc.env_variables[key]
})
}
})()
Now include this file as early as possible in your code, and you're done:
require('../yaml_env')

You should encrypt the variables with google kms and embed it in your source code. (https://cloud.google.com/kms/)
echo -n the-twitter-app-key | gcloud kms encrypt \
> --project my-project \
> --location us-central1 \
> --keyring THEKEYRING \
> --key THECRYPTOKEY \
> --plaintext-file - \
> --ciphertext-file - \
> | base64
put the scrambled (encrypted and base64 encoded) value into your environment variable (in yaml file).
Some pythonish code to get you started on decrypting.
kms_client = kms_v1.KeyManagementServiceClient()
name = kms_client.crypto_key_path_path("project", "global", "THEKEYRING", "THECRYPTOKEY")
twitter_app_key = kms_client.decrypt(name, base64.b64decode(os.environ.get("TWITTER_APP_KEY"))).plaintext

#Jason F's answer based on using Google Datastore is close, but the code is a bit outdated based on the sample usage on the library docs. Here's the snippet that worked for me:
from google.cloud import datastore
client = datastore.Client('<your project id>')
key = client.key('<kind e.g settings>', '<entity name>') # note: entity name not property
# get by key for this entity
result = client.get(key)
print(result) # prints all the properties ( a dict). index a specific value like result['MY_SECRET_KEY'])
Partly inspired by this Medium post

Extending Martin's answer
from google.appengine.ext import ndb
class Settings(ndb.Model):
"""
Get sensitive data setting from DataStore.
key:String -> value:String
key:String -> Exception
Thanks to: Martin Omander # Stackoverflow
https://stackoverflow.com/a/35261091/1463812
"""
name = ndb.StringProperty()
value = ndb.StringProperty()
#staticmethod
def get(name):
retval = Settings.query(Settings.name == name).get()
if not retval:
raise Exception(('Setting %s not found in the database. A placeholder ' +
'record has been created. Go to the Developers Console for your app ' +
'in App Engine, look up the Settings record with name=%s and enter ' +
'its value in that record\'s value field.') % (name, name))
return retval.value
#staticmethod
def set(name, value):
exists = Settings.query(Settings.name == name).get()
if not exists:
s = Settings(name=name, value=value)
s.put()
else:
exists.value = value
exists.put()
return True

There is a pypi package called gae_env that allows you to save appengine environment variables in Cloud Datastore. Under the hood, it also uses Memcache so its fast
Usage:
import gae_env
API_KEY = gae_env.get('API_KEY')
If there is a value for that key in the datastore, it will be returned.
If there isn't, a placeholder record __NOT_SET__ will be created and a ValueNotSetError will be thrown. The exception will remind you to go to the Developers Console and update the placeholder record.
Similar to Martin's answer, here is how to update the value for the key in Datastore:
Go to Datastore Section in the developers console
Select your project at the top of the page if it's not already selected.
In the Kind dropdown box, select GaeEnvSettings.
Keys for which an exception was raised will have value __NOT_SET__.
Go to the package's GitHub page for more info on usage/configuration

My solution is to replace the secrets in the app.yaml file via github action and github secrets.
app.yaml (App Engine)
env_variables:
SECRET_ONE: $SECRET_ONE
ANOTHER_SECRET: $ANOTHER_SECRET
workflow.yaml (Github)
steps:
- uses: actions/checkout#v2
- uses: 73h/gae-app-yaml-replace-env-variables#v0.1
env:
SECRET_ONE: ${{ secrets.SECRET_ONE }}
ANOTHER_SECRET: ${{ secrets.ANOTHER_SECRET }}
Here you can find the Github action.
https://github.com/73h/gae-app-yaml-replace-env-variables
When developing locally, I write the secrets to an .env file.

Related

Azure AKS/Container App can't access Key vault using managed identity

I have a docker container python app deployed on a kubernetes cluster on Azure (I also tried on a container app). I'm trying to connect this app to Azure key vault to fetch some secrets. I created a managed identity and assigned it to both but the python app always fails to find the managed identity to even attempt connecting to the key vault.
The Managed Identity role assignments:
Key Vault Contributor -> on the key vault
Managed Identity Operator -> Managed Identity
Azure Kubernetes Service Contributor Role,
Azure Kubernetes Service Cluster User Role,
Managed Identity Operator -> on the resource group that includes the cluster
Also on the key vault Access policies I added the Managed Identity and gave it access to all key, secrets, and certs permissions (for now)
Python code:
credential = ManagedIdentityCredential()
vault_client = SecretClient(vault_url=key_vault_uri, credential=credential)
retrieved_secret = vault_client.get_secret(secret_name)
I keep getting the error:
azure.core.exceptions.ClientAuthenticationError: Unexpected content type "text/plain; charset=utf-8"
Content: no azure identity found for request clientID
So at some point I attempted to add the managed identity clientID in the cluster secrets and load it from there and still got the same error:
Python code:
def get_kube_secret(self, secret_name):
kube_config.load_incluster_config()
v1_secrets = kube_client.CoreV1Api()
string_secret = str(v1_secrets.read_namespaced_secret(secret_name, "redacted_namespace_name").data).replace("'", "\"")
json_secret = json.loads(string_secret)
return json_secret
def decode_base64_string(self, encoded_string):
decoded_secret = base64.b64decode(encoded_string.strip())
decoded_secret = decoded_secret.decode('UTF-8')
return decoded_secret
managed_identity_client_id_secret = self.get_kube_secret('managed-identity-credential')['clientId']
managed_identity_client_id = self.decode_base64_string(managed_identity_client_id_secret)
Update:
I also attempted to use the secret store CSI driver, but I have a feeling I'm missing a step there. Should the python code be updated to be able to use the secret store CSI driver?
# This is a SecretProviderClass using user-assigned identity to access the key vault
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: azure-kvname-user-msi
spec:
provider: azure
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "true" # Set to true for using managed identity
userAssignedIdentityID: "$CLIENT_ID" # Set the clientID of the user-assigned managed identity to use
vmmanagedidentityclientid: "$CLIENT_ID"
keyvaultName: "$KEYVAULT_NAME" # Set to the name of your key vault
cloudName: "" # [OPTIONAL for Azure] if not provided, the Azure environment defaults to AzurePublicCloud
objects: ""
tenantId: "$AZURE_TENANT_ID"
Deployment Yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
namespace: redacted_namespace
labels:
app: backend
spec:
replicas: 1
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
containers:
- name: backend
image: redacted_image
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
imagePullPolicy: Always
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "250m"
env:
- name: test-secrets
valueFrom:
secretKeyRef:
name: test-secrets
key: test-secrets
volumeMounts:
- name: test-secrets
mountPath: "/mnt/secrets-store"
readOnly: true
volumes:
- name: test-secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "azure-kvname-user-msi"
dnsPolicy: ClusterFirst
Update 16/01/2023
I followed the steps in the answers and the linked docs to the letter, even contacted Azure support and followed it step by step with them on the phone and the result is still the following error:
"failed to process mount request" err="failed to get objectType:secret, objectName:MongoUsername, objectVersion:: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://<RedactedVaultName>.vault.azure.net/secrets/<RedactedSecretName>/?api-version=2016-10-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {\"error\":\"invalid_request\",\"error_description\":\"Identity not found\"} Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=<RedactedClientId>&resource=https%3A%2F%2Fvault.azure.net"
Using the Secrets Store CSI Driver, you can configure the SecretProviderClass to use a workload identity by setting the clientID in the SecretProviderClass. You'll need to use the client ID of your user assigned managed identity and change the usePodIdentity and useVMManagedIdentity setting to false.
With this approach, you don't need to add any additional code in your app to retrieve the secrets. Instead, you can mount a secrets store (using CSI driver) as a volume mount in your pod and have secrets loaded as environment variables which is documented here.
This doc will walk you through setting it up on Azure, but at a high-level here is what you need to do:
Register the EnableWorkloadIdentityPreview feature using Azure CLI
Create an AKS cluster using Azure CLI with the azure-keyvault-secrets-provider add-on enabled and --enable-oidc-issuer and --enable-workload-identiy flags set
Create an Azure Key Vault and set your secrets
Create an Azure User Assigned Managed Identity and set an access policy on the key vault for the the managed identity' client ID
Connect to the AKS cluster and create a Kubernetes ServiceAccount with annotations and labels that enable this for Azure workload identity
Create an Azure identity federated credential for the managed identity using the AKS cluster's OIDC issuer URL and Kubernetes ServiceAccount as the subject
Create a Kubernetes SecretProviderClass using clientID to use workload identity and adding a secretObjects block to enable syncing objects as environment variables using Kubernetes secret store.
Create a Kubernetes Deployment with a label to use workload identity, the serviceAccountName set to the service account you created above, volume using CSI and the secret provider class you created above, volumeMount, and finally environment variables in your container using valueFrom and secretKeyRef syntax to mount from your secret object store.
Hope that helps.
What you are referring to is called pod identity (recently deprecated for workload identity).
if the cluster is configured with managed identity, you can use workload identity.
However, for AKS I suggest configuring the secret store CSI driver to fetch secrets from KV and have them as k8s secrets. To use managed identity for secret provider, refer to this doc.
Then you can configure your pods to read those secrets.
I finally figured it out, I contacted microsoft support and it seams Aks Preview is a bit buggy "go figure". They recommended to revert back to a stable version of the CLI and use user assigned identity.
I did just that but this time, instead of creating my own identity that I would then assign to both the vault and the cluster as this seams to confuse it. I used the the identity the cluster automatically generates for the nodes.
Maybe not the neatest solution, but it's the only one that worked for me without any issues.
Finally, some notes missing from the Azure docs:
Since the CSI driver mounts the secrets as files in the target folder, you still need to read those files yourself to load them as env variables.
For example in python:
def load_secrets():
directory = '/path/to/mounted/secrets/folder'
if not os.path.isdir(directory):
return
for filename in os.listdir(directory):
file_path = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(file_path):
with open(file_path, 'r') as file:
file_value = file.read()
os.environ.setdefault(filename, file_value)

How to make Azure App Service for containers "use" .env variables?

I've deployed a Python web app using Azure App Service from a docker container in Container Registries. In my app I'm using dotenv to load secrets, and locally I running docker run --env-file=.env my-container to pass the .env variables, but I can't really figure out how to do it when deployed to Azure?
I'm using dotenv in the following way:
import os
from dotenv import load_dotenv
load_dotenv()
SERVER = os.getenv("SERVER_NAME")
DATABASE = os.getenv("DB_NAME")
USERNAME = os.getenv("USERNAME")
PASSWORD = os.getenv("PASSWORD")
PORT = os.getenv("PORT", default=1433)
DRIVER = os.getenv("DRIVER")
How can I have my container fetch the .env variables?
I've added the secrets to Azure Key Vault, but I'm not sure how to pass these to the container.
In terms of passing variables to Dockerfile check my answer here
Please add argument after from
FROM alpine
ARG serverName
RUN echo $serverName
and then run it like this
- task: Docker#2
inputs:
containerRegistry: 'devopsmanual-acr'
command: 'build'
Dockerfile: 'stackoverflow/85-docker/DOCKERFILE'
arguments: '--build-arg a_version=$(SERVER_NAME)'
In terms of fetching values from KeyVault you can use Azure Key Vault task
# Azure Key Vault
# Download Azure Key Vault secrets
- task: AzureKeyVault#1
inputs:
azureSubscription:
keyVaultName:
secretsFilter: '*'
runAsPreJob: false # Azure DevOps Services only
Be aware that by deault Variables created by this task are marked as secret, so they are not mapped an environment variables.
You can still try to use your approach but first you need to map it.
- powershell: |
Write-Host "Using an input-macro works: $(mySecret)"
Write-Host "Using the env var directly does not work: $env:MYSECRET"
Write-Host "Using a global secret var mapped in the pipeline does not work either: $env:GLOBAL_MYSECRET"
Write-Host "Using a global non-secret var mapped in the pipeline works: $env:GLOBAL_MY_MAPPED_ENV_VAR"
Write-Host "Using the mapped env var for this task works and is recommended: $env:MY_MAPPED_ENV_VAR"
env:
MY_MAPPED_ENV_VAR: $(mySecret) # the recommended way to map to an env variable
To pick up secrets from Key Vault and use them as env vars in your app, use Key Vault references as described here:
https://learn.microsoft.com/en-us/azure/app-service/app-service-key-vault-references
Then just add the reference to your App Settings. For example:
#Microsoft.KeyVault(SecretUri=https://myvault.vault.azure.net/secrets/mysecret/)
That's it. No need to amend your dotenv code to do anything special, as App Settings are already injected as environment variables by App Service into your application.
Don't forget to add your App Service instance (managed identity) to the Key Vault's access policy, otherwise none of this works -
https://learn.microsoft.com/en-us/azure/app-service/app-service-key-vault-references#granting-your-app-access-to-key-vault

setting up s3 for logs in airflow

I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/
My problem is getting the logs set up to write/read from s3. When a dag has completed I get an error like this
*** Log file isn't local.
*** Fetching here: http://ea43d4d49f35:8793/log/xxxxxxx/2017-06-26T11:00:00
*** Failed to fetch log file from worker.
*** Reading remote logs...
Could not read logs from s3://buckets/xxxxxxx/airflow/logs/xxxxxxx/2017-06-
26T11:00:00
I set up a new section in the airflow.cfg file like this
[MyS3Conn]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxxx
aws_default_region = xxxxxxx
And then specified the s3 path in the remote logs section in airflow.cfg
remote_base_log_folder = s3://buckets/xxxx/airflow/logs
remote_log_conn_id = MyS3Conn
Did I set this up properly and there is a bug? Is there a recipe for success here that I am missing?
-- Update
I tried exporting in URI and JSON formats and neither seemed to work. I then exported the aws_access_key_id and aws_secret_access_key and then airflow started picking it up. Now I get his error in the worker logs
6/30/2017 6:05:59 PMINFO:root:Using connection to: s3
6/30/2017 6:06:00 PMERROR:root:Could not read logs from s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00
6/30/2017 6:06:00 PMERROR:root:Could not write logs to s3://buckets/xxxxxx/airflow/logs/xxxxx/2017-06-30T23:45:00
6/30/2017 6:06:00 PMLogging into: /usr/local/airflow/logs/xxxxx/2017-06-30T23:45:00
-- Update
I found this link as well
https://www.mail-archive.com/dev#airflow.incubator.apache.org/msg00462.html
I then shelled into one of my worker machines (separate from the webserver and scheduler) and ran this bit of code in python
import airflow
s3 = airflow.hooks.S3Hook('s3_conn')
s3.load_string('test', airflow.conf.get('core', 'remote_base_log_folder'))
I receive this error.
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
I tried exporting several different types of AIRFLOW_CONN_ envs as explained here in the connections section https://airflow.incubator.apache.org/concepts.html and by other answers to this question.
s3://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#S3
{"aws_account_id":"<xxxxx>","role_arn":"arn:aws:iam::<xxxx>:role/<xxxxx>"}
{"aws_access_key_id":"<xxxxx>","aws_secret_access_key":"<xxxxx>"}
I have also exported AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with no success.
These credentials are being stored in a database so once I add them in the UI they should be picked up by the workers but they are not able to write/read logs for some reason.
UPDATE Airflow 1.10 makes logging a lot easier.
For s3 logging, set up the connection hook as per the above answer
and then simply add the following to airflow.cfg
[core]
# Airflow can store logs remotely in AWS S3. Users must supply a remote
# location URL (starting with either 's3://...') and an Airflow connection
# id that provides access to the storage location.
remote_base_log_folder = s3://my-bucket/path/to/logs
remote_log_conn_id = MyS3Conn
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False
For gcs logging,
Install the gcp_api package first, like so: pip install apache-airflow[gcp_api].
Set up the connection hook as per the above answer
Add the following to airflow.cfg
[core]
# Airflow can store logs remotely in AWS S3. Users must supply a remote
# location URL (starting with either 's3://...') and an Airflow connection
# id that provides access to the storage location.
remote_logging = True
remote_base_log_folder = gs://my-bucket/path/to/logs
remote_log_conn_id = MyGCSConn
NOTE: As of Airflow 1.9 remote logging has been significantly altered. If you are using 1.9, read on.
Reference here
Complete Instructions:
Create a directory to store configs and place this so that it can be found in PYTHONPATH. One example is $AIRFLOW_HOME/config
Create empty files called $AIRFLOW_HOME/config/log_config.py and
$AIRFLOW_HOME/config/__init__.py
Copy the contents of airflow/config_templates/airflow_local_settings.py into the log_config.py file that was just created in the step above.
Customize the following portions of the template:
#Add this variable to the top of the file. Note the trailing slash.
S3_LOG_FOLDER = 's3://<bucket where logs should be persisted>/'
Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
LOGGING_CONFIG = ...
Add a S3TaskHandler to the 'handlers' block of the LOGGING_CONFIG variable
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
Update the airflow.task and airflow.task_runner blocks to be 's3.task' instead >of 'file.task'.
'loggers': {
'airflow.task': {
'handlers': ['s3.task'],
...
},
'airflow.task_runner': {
'handlers': ['s3.task'],
...
},
'airflow': {
'handlers': ['console'],
...
},
}
Make sure a s3 connection hook has been defined in Airflow, as per the above answer. The hook should have read and write access to the s3 bucket defined above in S3_LOG_FOLDER.
Update $AIRFLOW_HOME/airflow.cfg to contain:
task_log_reader = s3.task
logging_config_class = log_config.LOGGING_CONFIG
remote_log_conn_id = <name of the s3 platform hook>
Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.
Verify that logs are showing up for newly executed tasks in the bucket you’ve defined.
Verify that the s3 storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:
*** Reading remote log from gs://<bucket where logs should be persisted>/example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log.
[2017-10-03 21:57:50,056] {cli.py:377} INFO - Running on host chrisr-00532
[2017-10-03 21:57:50,093] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py']
[2017-10-03 21:57:51,264] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,263] {__init__.py:45} INFO - Using executor SequentialExecutor
[2017-10-03 21:57:51,306] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,306] {models.py:186} INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py
Airflow 2.4.2
Follow the steps above but paste this into log_config.py
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
from airflow import configuration as conf
from copy import deepcopy
S3_LOG_FOLDER = 's3://your/s3/log/folder'
LOG_LEVEL = conf.get('logging', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('logging', 'log_format')
BASE_LOG_FOLDER = conf.get('logging', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
# Attach formatters to loggers (airflow.task, airflow.processor)
LOGGING_CONFIG['formatters']['airflow.task'] = { 'format': LOG_FORMAT }
LOGGING_CONFIG['formatters']['airflow.processor'] = { 'format': LOG_FORMAT }
# Add an S3 task handler
LOGGING_CONFIG['handlers']['s3.task'] = {
'class': 'airflow.providers.amazon.aws.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE
}
# Specify handler for airflow.task
LOGGING_CONFIG['loggers']['airflow.task']['handlers'] = ['task', 's3.task']
You need to set up the S3 connection through Airflow UI. For this, you need to go to the Admin -> Connections tab on airflow UI and create a new row for your S3 connection.
An example configuration would be:
Conn Id: my_conn_S3
Conn Type: S3
Extra: {"aws_access_key_id":"your_aws_key_id", "aws_secret_access_key": "your_aws_secret_key"}
(Updated as of Airflow 1.10.2)
Here's a solution if you don't use the admin UI.
My Airflow doesn't run on a persistent server ... (It gets launched afresh every day in a Docker container, on Heroku.) I know I'm missing out on a lot of great features, but in my minimal setup, I never touch the admin UI or the cfg file. Instead, I have to set Airflow-specific environment variables in a bash script, which overrides the .cfg file.
apache-airflow[s3]
First of all, you need the s3 subpackage installed to write your Airflow logs to S3. (boto3 works fine for the Python jobs within your DAGs, but the S3Hook depends on the s3 subpackage.)
One more side note: conda install doesn't handle this yet, so I have to do pip install apache-airflow[s3].
Environment variables
In a bash script, I set these core variables. Starting from these instructions but using the naming convention AIRFLOW__{SECTION}__{KEY} for environment variables, I do:
export AIRFLOW__CORE__REMOTE_LOGGING=True
export AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucket/key
export AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_uri
export AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
S3 connection ID
The s3_uri above is a connection ID that I made up. In Airflow, it corresponds to another environment variable, AIRFLOW_CONN_S3_URI. The value of that is your S3 path, which has to be in URI form. That's
s3://access_key:secret_key#bucket/key
Store this however you handle other sensitive environment variables.
With this configuration, Airflow will be able to write your logs to S3. They will follow the path of s3://bucket/key/dag/task_id/timestamp/1.log.
Appendix on upgrading from Airflow 1.8 to Airflow 1.10
I recently upgraded my production pipeline from Airflow 1.8 to 1.9, and then 1.10. Good news is that the changes are pretty tiny; the rest of the work was just figuring out nuances with the package installations (unrelated to the original question about S3 logs).
(1) First of all, I needed to upgrade to Python 3.6 with Airflow 1.9.
(2) The package name changed from airflow to apache-airflow with 1.9. You also might run into this in your pip install.
(3) The package psutil has to be in a specific version range for Airflow. You might encounter this when you're doing pip install apache-airflow.
(4) python3-dev headers are needed with Airflow 1.9+.
(5) Here are the substantive changes: export AIRFLOW__CORE__REMOTE_LOGGING=True is now required. And
(6) The logs have a slightly different path in S3, which I updated in the answer: s3://bucket/key/dag/task_id/timestamp/1.log.
But that's it! The logs did not work in 1.9, so I recommend just going straight to 1.10, now that it's available.
To complete Arne's answer with the recent Airflow updates, you do not need to set task_log_reader to another value than the default one : task
As if you follow the default logging template airflow/config_templates/airflow_local_settings.py you can see since this commit (note the handler's name changed to's3': {'task'... instead of s3.task) that's the value on the remote folder(REMOTE_BASE_LOG_FOLDER) will replace the handler with the right one:
REMOTE_LOGGING = conf.get('core', 'remote_logging')
if REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('s3://'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['s3'])
elif REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('gs://'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['gcs'])
elif REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('wasb'):
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['wasb'])
elif REMOTE_LOGGING and ELASTICSEARCH_HOST:
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['elasticsearch'])
More details on how to log to/read from S3 : https://github.com/apache/incubator-airflow/blob/master/docs/howto/write-logs.rst#writing-logs-to-amazon-s3
Phew! Motivation to keep nipping the airflow bugs in the bud is to confront this as a bunch of python files XD here's my experience on this with apache-airflow==1.9.0.
First of all, there's simply no need trying
airflow connections .......... --conn_extra etc, etc.
Just set your airflow.cfg as:
remote_logging = True
remote_base_log_folder = s3://dev-s3-main-ew2-dmg-immutable-potns/logs/airflow-logs/
encrypt_s3_logs = False
# Logging level
logging_level = INFO
fab_logging_level = WARN
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class = log_config.LOGGING_CONFIG
remote_log_conn_id = s3://<ACCESS-KEY>:<SECRET-ID>#<MY-S3-BUCKET>/<MY>/<SUB>/<FOLDER>/
keep the $AIRFLOW_HOME/config/__ init __.py and $AIRFLOW_HOME/config/log_config.py file as above.
The problem with me as a missing "boto3" package, which I could get to by:
vi /usr/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py
then >> import traceback
and in the line containing:
Could not create an S3Hook with connection id "%s". '
'Please make sure that airflow[s3] is installed and '
'the S3 connection exists.
doing a traceback.print_exc() and well it started cribbing about missing boto3!
Installed it and Life was beautiful back again!
Just a side note to anyone following the very useful instructions in the above answer:
If you stumble upon this issue: "ModuleNotFoundError: No module named
'airflow.utils.log.logging_mixin.RedirectStdHandler'" as referenced here (which happens when using airflow 1.9), the fix is simple - use rather this base template: https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/config_templates/airflow_local_settings.py (and follow all other instructions in the above answer)
The current template incubator-airflow/airflow/config_templates/airflow_local_settings.py present in master branch contains a reference to the class "airflow.utils.log.s3_task_handler.S3TaskHandler", which is not present in apache-airflow==1.9.0 python package.
Hope this helps!
Have it working with Airflow 1.10 in kube.
I have the following env var sets:
AIRFLOW_CONN_LOGS_S3=s3://id:secret_uri_encoded#S3
AIRFLOW__CORE__REMOTE_LOGGING=True
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://xxxx/logs
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=logs_s3
For airflow 2.3.4, using Docker, I also faced issues with logging to s3.
Initially I faced some permission errors (although my IAM Role was set fine), then after changing the config a bit I was able to write the files in the correct location, but could not read (falling back to local log).
Anyway, after many efforts, debugging, trial and error attempts, here is what worked for me:
Define a connection for s3 (assuming your region is also eu-west-1):
Either via the UI, in which case you need to set:
Connection Id: my-conn (or whatever name you prefer),
Connection Type: Amazon Web Services (this is one change that I did, s3 did not work for me),
Extra: {"region_name": "eu-west-1", "endpoint_url": "https://s3.eu-west-1.amazonaws.com"}
Or via the CLI:
airflow connections add my-conn --conn-type aws --conn-extra '{"region_name": "eu-west-1", "endpoint_url": "https://s3.eu-west-1.amazonaws.com"}'
As for airflow config, I have set those in all processes:
...
export AIRFLOW__LOGGING__REMOTE_LOGGING=True
export AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=s3://my-bucket/path/to/log/folder
export AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=my-conn
...
After deploying, I was still getting errors like Falling back to local log..., but eventually the file was loaded and displayed (after a few refreshes).
It seems to work OK now though :)

How to access a remote datastore when running dev_appserver.py?

I'm attempting to run a localhost web server that has remote api access to a remote datastore using the remote_api_stub method ConfigureRemoteApiForOAuth.
I have been using the following Google doc for reference but find it rather sparse:
https://cloud.google.com/appengine/docs/python/tools/remoteapi
I believe I'm missing the authentication bit, but can't find a concrete resource to guide me. What would be the easiest way, given the follow code example, to access a remote datastore while running dev_appserver.py?
import webapp2
from google.appengine.ext import ndb
from google.appengine.ext.remote_api import remote_api_stub
class Topic(ndb.Model):
created_by = ndb.StringProperty()
subject = ndb.StringProperty()
#classmethod
def query_by_creator(cls, creator):
return cls.query(Topic.created_by == creator)
class MainPage(webapp2.RequestHandler):
def get(self):
remote_api_stub.ConfigureRemoteApiForOAuth(
'#####.appspot.com',
'/_ah/remote_api'
)
topics = Topic.query_by_creator('bill')
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write('<html><body>')
self.response.out.write('<h1>TOPIC SUBJECTS:<h1>')
for topic in topics.fetch(10):
self.response.out.write('<h3>' + topic.subject + '<h3>')
self.response.out.write('</body></html>')
app = webapp2.WSGIApplication([
('/', MainPage)
], debug=True)
This get's asked a lot, simply because you can't use app engines libraries outside of the SDK. However, there is also an easier way to do it from within the App Engine SDK as well.
I would use gcloud for this. Here's how to set it up:
If you want to interact with google cloud storage services inside or outside of the App Engine environment, you may use Gcloud (https://googlecloudplatform.github.io/gcloud-python/stable/) to do so.
You need a service account on your application as well as download the JSON credentials file. You do this on the app engine console under the authentication tab. Create it, and then download it. Call it client_secret.json or something.
With those, once you install the proper packages for gcloud with pip, you'll be able to make queries as well as write data.
Here is an example of authenticating yourself to use the library:
from gcloud import datastore
# the location of the JSON file on your local machine
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/location/client_secret.json"
# project ID from the Developers Console
projectID = "THE_ID_OF_YOUR_PROJECT"
os.environ["GCLOUD_TESTS_PROJECT_ID"] = projectID
os.environ["GCLOUD_TESTS_DATASET_ID"] = projectID
client = datastore.Client(dataset_id=projectID)
Once that's done, you can make queries like this:
query = client.query(kind='Model').fetch()
It's actually super easy. Any who, that's how I would do that! Cheers.

Google App Engine app deployment

I'm trying to deploy a basic HelloWorld app on Google Engine following their tutorial for Python. I created a helloworld.py file and an app.yaml file and copied the code from their online tutorial. I signed up for Google App Engine and have tried everything but keep getting the following error:
2015-08-19 19:22:08,006 ERROR appcfg.py:2438 An error occurred processing file '':
HTTP Error 403: Forbidden Unexpected HTTP status 403. Aborting.
Error 403: --- begin server output ---
You do not have permission to modify this app (app_id=u'helloworld').
--- end server output ---
If deploy fails you might need to 'rollback' manually.
The "Make Symlinks..." menu option can help with command-line work.
*** appcfg.py has finished with exit code 1 ***
Here is the code from helloworld.py:
import webapp2
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.write('Hello, Udacity!')
app = webapp2.WSGIApplication([
('/', MainPage),
], debug=True)
And here is my code from the app.yaml file:
application: helloworld
version: 1
runtime: python27
api_version: 1
threadsafe: true
handlers:
- url: /.*
script: helloworld.app
Is there a problem with permissions i.e. Google App's or my laptop's settings? I have tried everything that's out there on stackoverflow so any new suggestions will be much appreciated!!
In my case, I got refused because the appcfg save my ouauth2 token in the file ~/.appcfg_oauth2_tokens, which happen to be another account of appengine . Simply remove the file and try again. It should work. This is in Mac, I am not sure about windows though.
Ok there is a MUCH easier way to do this now.
If you are getting this message "You do not have permission to modify this app"
but your id is correct within app.YAML do the following:
Bring up the Google App Engine Launcher on your desktop
Click on the control tab on the top left --> "Clear Deployment Credentials"
Thats it!!!
The application name in app.yaml is sort of like a domain name. Once someone has reserved it, no one else can use it. You need to go here, and then select "Create a project..." from the dropdown at the top of the screen. In the popup, it will suggest a project id, or you can select your own. Many project id's are taken so you will need to specify something unusual to get something that is not taken.
You then put this project id in your app.yaml in the application line. You should then be able to upload your project.
Make sure you have created a project in GAE which project ID is exactly same to your configuration in app.yaml
workflow:
Create a project in GAE and set a name
At the same window you can edit the project ID
copy the project ID and paste it when you create a new application in GAE launcher (or edit your app.yaml to set the value of application)
deploy it and type {project-id}.appspot.com in your browser\ Good luck!
App ID ( 'application:' ) in app.yaml mentioned is 'helloworld' which seems to be default for hello world app. Create a new app using: https://console.cloud.google.com/home/dashboard and use the new app ID in app.yaml.
Please have a look at: Application for gae does not deploy that has answer to similar question for Linux platform, where we need to delete : ~/.appcfg_oauth2_tokens* to resolve the permission error.

Categories

Resources