Python Dataflow Template, making Runtime Parameters globally accessible

Python Dataflow Template, making Runtime Parameters globally accessible - python

So the aim of the pipeline is to be able to use Runtime Variables to be able to open a csv file and name a BigQuery table.
All I need is to be able to access these variables globally, or within a ParDo, such as parsing it into the function.
I have tried creating a dummy string, then running a FlatMap to access the Runtime Parameters and make them global, although it returns nothing.
eg.
class CustomPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--path',
type=str,
help='csv storage path')
parser.add_value_provider_argument(
'--table_name',
type=str,
help='Table Id')
def run()
def rewrite_values(element):
""" Rewrite default env values"""
# global project_id
# global campaign_id
# global organization_id
# global language
# global file_path
try:
logging.info("File Path with str(): {}".format(str(custom_options.path)))
logging.info("----------------------------")
logging.info("element: {}".format(element))
project_id = str(cloud_options.project)
file_path = custom_options.path.get()
table_name = custom_options.table_name.get()
logging.info("project: {}".format(project_id))
logging.info("File path: {}".format(file_path))
logging.info("language: {}".format(table_name))
logging.info("----------------------------")
except Exception as e:
logging.info("Error format----------------------------")
raise KeyError(e)
return file_path
pipeline_options = PipelineOptions()
cloud_options = pipeline_options.view_as(GoogleCloudOptions)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
pipeline_options.view_as(SetupOptions).save_main_session = True
# Beginning of the pipeline
p = beam.Pipeline(options=pipeline_options)
init_data = (p
| beam.Create(["Start"])
| beam.FlatMap(rewrite_values))
pipeline processing, running pipeline etc.
I can access the project variable no problem, although everything else returns as blank.
If I make the custom_options variable global, or when I pass a specific customs object into a function, such as: | 'Read data' >> beam.ParDo(ReadGcsBlobs(path_file=custom_options.path)), it only returns something such as RuntimeValueProvider(option: path, type: str, default_value: None).
If I use | 'Read data' >> beam.ParDo(ReadGcsBlobs(path_file=custom_options.path.get())), the variable is and empty string.
So in essence, I just need to access these variables globally, is it possible?
Finally to clarify, I do not want to use the ReadFromText method, I can use the runtime variable there, although to incorporate the runtime options into the dict created from the csv file will be to costly as I am working with huge csv files.

For me it worked by declaring cloud_options and custom_options as global:
import argparse, logging
import apache_beam as beam
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
class CustomPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--path',
type=str,
help='csv storage path')
parser.add_value_provider_argument(
'--table_name',
type=str,
help='Table Id')
def rewrite_values(element):
""" Rewrite default env values"""
# global project_id
# global campaign_id
# global organization_id
# global language
# global file_path
try:
logging.info("File Path with str(): {}".format(str(custom_options.path.get())))
logging.info("----------------------------")
logging.info("element: {}".format(element))
project_id = str(cloud_options.project)
file_path = custom_options.path.get()
table_name = custom_options.table_name.get()
logging.info("project: {}".format(project_id))
logging.info("File path: {}".format(file_path))
logging.info("language: {}".format(table_name))
logging.info("----------------------------")
except Exception as e:
logging.info("Error format----------------------------")
raise KeyError(e)
return file_path
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
global cloud_options
global custom_options
pipeline_options = PipelineOptions(pipeline_args)
cloud_options = pipeline_options.view_as(GoogleCloudOptions)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
pipeline_options.view_as(SetupOptions).save_main_session = True
# Beginning of the pipeline
p = beam.Pipeline(options=pipeline_options)
init_data = (p
| beam.Create(["Start"])
| beam.FlatMap(rewrite_values))
result = p.run()
# result.wait_until_finish
if __name__ == '__main__':
run()
After setting the PROJECT and BUCKET variables I staged the template with:
python script.py \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/global_options
And execute it with providing path and table_name options:
gcloud dataflow jobs run global_options \
--gcs-location gs://$BUCKET/templates/global_options \
--parameters path=test_path,table_name=test_table
And the runtime parameters seem to be logged fine inside the FlatMap:

Related

pysmb from linux to Windows, Unable to connect to shared device

Trying to connect to an smb share via pysmb and getting error...
smb.smb_structs.OperationFailure: Failed to list on \\\\H021BSBD20\\shared_folder: Unable to connect to shared device
The code I am using looks like...
from smb.SMBConnection import SMBConnection
import json
import pprint
import warnings
pp = pprint.PrettyPrinter(indent=4)
PROJECT_HOME = "/path/to/my/project/"
# load configs
CONF = json.load(open(f"{PROJECT_HOME}/configs/configs.json"))
pp.pprint(CONF)
# list all files in storage smb dir
#https://pysmb.readthedocs.io/en/latest/api/smb_SMBConnection.html#smb.SMBConnection.SMBConnection.listPath
IS_DIRECT_TCP = False
CNXN_PORT = 139 if not IS_DIRECT_TCP else 445
LOCAL_IP = "172.18.4.69"
REMOTE_NAME = "H021BSBD20" # exact name shown as Device Name in System Settings
SERVICE_NAME = "\\\\H021BSBD20\\shared_folder"
REMOTE_IP = "172.18.7.102"
try:
conn = SMBConnection(CONF['smb_creds']['username'], CONF['smb_creds']['password'],
my_name=LOCAL_IP, remote_name=REMOTE_NAME,
use_ntlm_v2=True,
is_direct_tcp=IS_DIRECT_TCP)
conn.connect(REMOTE_IP, CNXN_PORT)
except Exception:
warnings.warn("\n\nFailed to initially connect, attempting again with param use_ntlm_v2=False\n\n")
conn = SMBConnection(CONF['smb_creds']['username'], CONF['smb_creds']['password'],
my_name=LOCAL_IP, remote_name=REMOTE_NAME,
use_ntlm_v2=False,
is_direct_tcp=IS_DIRECT_TCP)
conn.connect(REMOTE_IP, CNXN_PORT)
files = conn.listPath(f'{SERVICE_NAME}', '\\')
pp.pprint(files)
Using smbclient on my machine, I can successfully connect to the share by doing...
[root#airflowetl etl]# smbclient -U my_user \\\\H021BSBD20\\shared_folder
The amount of backslashes I use in the python code is so that I can create the same string that works when using this smbclient (have tried with less backslashes in the code and that has not helped).
Note that the user that I am using the access the shared folder in the python code and with smbclient is not able to access / log on to the actual machine that the share is hosted on (they are only allowed to access that particular shared folder as shown above).
Does anyone know what could be happening here? Any other debugging steps that could be done?

After asking on the github repo Issues section (https://github.com/miketeo/pysmb/issues/169), I was able to fix the problem. It was just due to the arg I was using for the conn.listPath() servicename param.
When looking closer at the docs for that function (https://pysmb.readthedocs.io/en/latest/api/smb_SMBConnection.html), I saw...
service_name (string/unicode) – the name of the shared folder for the path
Originally, I was only looking at the function signature, which said service_name, so I assumed it would be the same as with the smbclient command-line tool (which I have been entering the servicename param as \\\\devicename\\sharename (unlike with pysmb which we can see from the docstring wants just the share as the service_name)).
So rather than
files = conn.listPath("\\\\H021BSBD20\\shared_folder", '\\')
I do
files = conn.listPath("shared_folder", '\\')
The full refactored snippet is shown below, just for reference.
import argparse
import json
import os
import pprint
import socket
import sys
import traceback
import warnings
from smb.SMBConnection import SMBConnection
def parseArguments():
# Create argument parser
parser = argparse.ArgumentParser()
# Positional mandatory arguments
parser.add_argument("project_home", help="project home path", type=str)
parser.add_argument("device_name", help="device (eg. NetBIOS) name in configs of share to process", type=str)
# Optional arguments
# parser.add_argument("-dfd", "--data_file_dir",
# help="path to data files dir to be pushed to sink, else source columns based on form_type",
# type=str, default=None)
# Parse arguments
args = parser.parse_args()
return args
args = parseArguments()
for a in args.__dict__:
print(str(a) + ": " + str(args.__dict__[a]))
pp = pprint.PrettyPrinter(indent=4)
PROJECT_HOME = args.project_home
REMOTE_NAME = args.device_name
# load configs
CONF = json.load(open(f"{PROJECT_HOME}/configs/configs.json"))
CREDS = json.load(open(f"{PROJECT_HOME}/configs/creds.json"))
pp.pprint(CONF)
SMB_CONFS = next(info for info in CONF["smb_server_configs"] if info["device_name"] == args.device_name)
print("\nUsing details for device:")
pp.pprint(SMB_CONFS)
# list all files in storage smb dir
#https://pysmb.readthedocs.io/en/latest/api/smb_SMBConnection.html#smb.SMBConnection.SMBConnection.listPath
IS_DIRECT_TCP = False
CNXN_PORT = 139 if IS_DIRECT_TCP is False else 445
LOCAL_IP = socket.gethostname() #"172.18.4.69"
REMOTE_NAME = SMB_CONFS["device_name"]
SHARE_FOLDER = SMB_CONFS["share_folder"]
REMOTE_IP = socket.gethostbyname(REMOTE_NAME) # "172.18.7.102"
print(LOCAL_IP)
print(REMOTE_NAME)
try:
conn = SMBConnection(CREDS['smb_creds']['username'], CREDS['smb_creds']['password'],
my_name=LOCAL_IP, remote_name=REMOTE_NAME,
use_ntlm_v2=False,
is_direct_tcp=IS_DIRECT_TCP)
conn.connect(REMOTE_IP, CNXN_PORT)
except Exception:
traceback.print_exc()
warnings.warn("\n\nFailed to initially connect, attempting again with param use_ntlm_v2=True\n\n")
conn = SMBConnection(CREDS['smb_creds']['username'], CREDS['smb_creds']['password'],
my_name=LOCAL_IP, remote_name=REMOTE_NAME,
use_ntlm_v2=True,
is_direct_tcp=IS_DIRECT_TCP)
conn.connect(REMOTE_IP, CNXN_PORT)
files = conn.listPath(SHARE_FOLDER, '\\')
if len(files) > 0:
print("Found listed files")
for f in files:
print(f.filename)
else:
print("No files to list, this likely indicates a problem. Exiting...")
exit(255)

Passing in 'date' as a runtime argument in Google Dataflow Template

I'm currently trying to generate a Google Dataflow custom template, that will call an API when run, and write the results to a BigQuery table.
However the issue I'm encountering is that the API requires a date parameter 'YYYY-MM-DD' to be passed in for it to work.
Unfortunately it seems that when constructing a template Dataflow requires that you use ValueProvider (as described here) for any variables that are relative to when the job is being run (i.e. today's date). Otherwise it'll just carry on using the same date that was generated when the template was originally created. (i.e. with dt.date.today() etc - h/t to this post)
Therefore with the code that I've got, is there any way to generate the template so that it will utilise today's date correctly as an argument at runtime, rather than just using the same static date indefinitely - or as is currently the case - just not converting to a template at all.
from __future__ import print_function, absolute_import
import argparse
import logging
import sys
import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.metrics.metric import Metrics
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions
from apache_beam.options.value_provider import ValueProvider
import datetime as dt
from datetime import timedelta, date
import time
import re
logging.getLogger().setLevel(logging.INFO)
class GetAPI():
def __init__(self, data={}, date=None):
self.num_api_errors = Metrics.counter(self.__class__, 'num_api_errors')
self.data = data
self.date = date
def get_job(self):
import requests
endpoint = f'https://www.rankranger.com/api/v2/?rank_stats&key={self.data.api_key}&date={self.date}'\
f'&campaign_id={self.data.campaign}&se_id={self.data.se}&domain={self.data.domain}&output=json'
logging.info("Endpoint: {}".format(str(endpoint)))
try:
res = requests.get(endpoint)
if res.status_code == 200:
# logging.info("Reponse: {}".format(str(res.text)))
json_data = res.json()
## Store the API response
if 'result' in json_data:
response = json_data.get('result')
return response
except Exception as e:
self.num_api_errors.inc()
logging.error(f'Exception: {e}')
logging.error(f'Extract error on "%s"', 'Rank API')
def format_dates(api):
api['date'] = dt.datetime.strptime(api['date'], "%m/%d/%Y").strftime("%Y-%m-%d")
return api
# Class to pass in date generated at runtime to template
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
## Special runtime argument e.g. date
parser.add_value_provider_argument('--date',
type=str,
default=(dt.date.today()).strftime("%Y-%m-%d"),
help='Run date in YYYY-MM-DD format.')
def run(argv=None):
"""
Main entry point; defines the static arguments to be passed in.
"""
parser = argparse.ArgumentParser()
parser.add_argument('--api_key',
type=str,
default=API_KEY,
help='API key for Rank API.')
parser.add_argument('--campaign',
type=str,
default=CAMPAIGN,
help='Campaign ID for Rank API')
parser.add_argument('--se',
type=str,
default=SE,
help='Search Engine ID for Rank API')
parser.add_argument('--domain',
type=str,
default=DOMAIN,
help='Domain for Rank API')
parser.add_argument('--dataset',
type=str,
default=DATASET,
help='BigQuery Dataset to write tables to. Must already exist.')
parser.add_argument('--table_name',
type=str,
default=TABLE_NAME,
help='The BigQuery table name. Should not already exist.')
parser.add_argument('--project',
type=str,
default=PROJECT,
help='Your GCS project.')
parser.add_argument('--runner',
type=str,
default="DataflowRunner",
help='Type of DataFlow runner.')
args, pipeline_args = parser.parse_known_args(argv)
# Create and set your PipelineOptions.
options = PipelineOptions(pipeline_args)
user_options = options.view_as(UserOptions)
pipeline = beam.Pipeline(options=options)
# Gets data from Rank Ranger API
api = (
pipeline
| 'create' >> beam.Create(GetAPI(data=args, date=user_options.date).get_job())
| 'format dates' >> beam.Map(format_dates)
)
# Write to bigquery based on specified schema
BQ = (api | "WriteToBigQuery" >> beam.io.WriteToBigQuery(args.table_name, args.dataset, SCHEMA))
pipeline.run()
if __name__ == '__main__':
run()
As you can see from the error message, rather than passing in a neatly formatted 'YYYY-MM-DD' parameter, it's instead passing in the full ValueProvider object which is stopping the API call from working and returning the NoneType error.
(Apache) C:\Users\user.name\Documents\Alchemy\Dataflow\production_pipeline\templates>python main.py --runner DataflowRunner --project <PROJECT> --staging_location gs://<STORAGE-BUCKET>/staging --temp_location gs://<STORAGE-BUCKET>/temp --template_location gs://<STORAGE-BUCKET>/template/<TEMPLATE> --region europe-west2
INFO:root:Endpoint: https://www.rankranger.com/api/v2/?rank_stats&key=<API_KEY>&date=RuntimeValueProvider(option: date, type: str, default_value: '2020-08-25')&campaign_id=<CAMPAIGN>&se_id=<SE>&domain=<DOMAIN>&output=json
Traceback (most recent call last):
File "main.py", line 267, in <module>
run()
File "main.py", line 257, in run
| 'format dates' >> beam.Map(format_dates)
File "C:\Users\user.name\Anaconda3\envs\Apache\lib\site-packages\apache_beam\transforms\core.py", line 2590, in __init__
self.values = tuple(values)
TypeError: 'NoneType' object is not iterable
Any help would be hugely appreciated!

You are correct in your diagnosis. You should consider migrating to Flex Templates which solve this (and other) issues and provide much more flexibility.

How to pass arguments to scoring file when deploying a Model in AzureML

I am deploying a trained model to an ACI endpoint on Azure Machine Learning, using the Python SDK.
I have created my score.py file, but I would like that file to be called with an argument being passed (just like with a training file) that I can interpret using argparse.
However, I don't seem to find how I can pass arguments
This is the code I have to create the InferenceConfig environment and which obviously does not work. Should I fall back on using the extra Docker file steps or so?
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
env = Environment('my_hosted_environment')
env.python.conda_dependencies = CondaDependencies.create(
conda_packages=['scikit-learn'],
pip_packages=['azureml-defaults'])
scoring_script = 'score.py --model_name ' + model_name
inference_config = InferenceConfig(entry_script=scoring_script, environment=env)
Adding the score.py for reference on how I'd love to use the arguments in that script:
#removed imports
import argparse
def init():
global model
parser = argparse.ArgumentParser(description="Load sklearn model")
parser.add_argument('--model_name', dest="model_name", required=True)
args, _ = parser.parse_known_args()
model_path = Model.get_model_path(model_name=args.model_name)
model = joblib.load(model_path)
def run(raw_data):
try:
data = json.loads(raw_data)['data']
data = np.array(data)
result = model.predict(data)
return result.tolist()
except Exception as e:
result = str(e)
return result
Interested to hear your thoughts

This question is a year old. Providing a solution to help those who may still be looking for an answer. My answer to a similar question is here. You may pass native python datatype variables into the inference config and access them as environment variables within the scoring script.

I tackled this problem differently. I could not find a (proper and easy to follow) way to pass arguments for score.py, when it is consumed by InferenceConfig . Instead, what I did was following 4 steps:
Created score_template.py and define variables which should be assigned
Read content of score_template.py and modify it by replacing variables with desired values
Write modified contents into score.py
Finally pass score.py to InferenceConfig
STEP 1 in score_template.py:
import json
from azureml.core.model import Model
import os
import joblib
import pandas as pd
import numpy as np
def init():
global model
#model = joblib.load('recommender.pkl')
model_name="#MODEL_NAME#"
model_saved_file='#MODEL_SAVED_FILE#'
try:
model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), model_saved_file)
model = joblib.load(model_path)
except:
model_path = Model.get_model_path(model_name)
model = joblib.load(model_path)
def run(raw_data):
try:
#data=pd.json_normalize(data)
#data=np.array(data['data'])
data = json.loads(raw_data)["data"]
data = np.array(data)
result = model.predict(data)
# you can return any datatype as long as it is JSON-serializable
return {"result": result.tolist()}
except Exception as e:
error = str(e)
#error= data
return error
STEP 2-4 in deploy_model.py:
#--Modify Entry Script/Pass Model Name--
entry_script="score.py"
entry_script_temp="score_template.py"
# Read in the entry script template
print("Prepare Entry Script")
with open(entry_script_temp, 'r') as file :
entry_script_contents = file.read()
# Replace the target string
entry_script_contents = entry_script_contents.replace('#MODEL_NAME#', model_name)
entry_script_contents = entry_script_contents.replace('#MODEL_SAVED_FILE#', model_file_name)
# Write the file to entry script
with open(entry_script, 'w') as file:
file.write(entry_script_contents)
#--Define configs for the deployment---
print("Get Environtment")
env = Environment.get(workspace=ws, name=env_name)
env.inferencing_stack_version = "latest"
print("Inference Configuration")
inference_config = InferenceConfig(entry_script=entry_script, environment=env, source_directory=base_path)
aci_config = AciWebservice.deploy_configuration(cpu_cores = int(cpu_cores), memory_gb = int(memory_gb),location=location)
#--Deloy the service---
print("Deploy Model")
print("model version:", model_artifact.version)
service = Model.deploy( workspace=ws,
name=service_name,
models=[model_artifact],
inference_config=inference_config,
deployment_config=aci_config,
overwrite=True )
service.wait_for_deployment(show_output=True)

How to deploy using environments can be found here model-register-and-deploy.ipynb . InferenceConfig class accepts source_directory and entry_script parameters, where source_directory is a path to the folder that contains all files(score.py and any other additional files) to create the image.
This multi-model-register-and-deploy.ipynb has code snippets on how to create InferenceConfig with source_directory and entry_script.
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)
service = Model.deploy(workspace=ws,
name='sklearn-mnist-svc',
models=[model],
inference_config=inference_config,
deployment_config=aciconfig)
service.wait_for_deployment(show_output=True)
print(service.scoring_uri)

Attribute error while creating custom template using python in Google Cloud DataFlow

I am facing issue while creating custom template for Cloud Dataflow.
its simple code that takes data from input bucket and loads in BigQuery.
We want to load many tables so trying to create custom template.
once this works, next step would be passing dataset also as parameter.
Error message :
AttributeError: 'StaticValueProvider' object has no attribute 'datasetId'
Code
class ContactUploadOptions(PipelineOptions):
"""
Runtime Parameters given during template execution
path and organization parameters are necessary for execution of pipeline
campaign is optional for committing to bigquery
"""
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--input',
type=str,
help='Path of the file to read from'
)
parser.add_value_provider_argument(
'--output',
type=str,
help='Output BQ table for the pipeline')
def run(argv=None):
"""The main function which creates the pipeline and runs it."""
global PROJECT
from google.cloud import bigquery
# Retrieve project Id and append to PROJECT form GoogleCloudOptions
# Initialize runtime parameters as object
contact_options = PipelineOptions().view_as(ContactUploadOptions)
PROJECT = PipelineOptions().view_as(GoogleCloudOptions).project
client = bigquery.Client(project=PROJECT)
dataset = client.dataset('pharma')
data_ingestion = DataIngestion()
pipeline_options = PipelineOptions()
# Save main session state so pickled functions and classes
# defined in __main__ can be unpickled
pipeline_options.view_as(SetupOptions).save_main_session = True
# Parse arguments from command line.
#data_ingestion = DataIngestion()
# Instantiate pipeline
options = PipelineOptions()
p = beam.Pipeline(options=options)
(p
| 'Read from a File' >> beam.io.ReadFromText(contact_options.input, skip_header_lines=0)
| 'String To BigQuery Row' >> beam.Map(lambda s: data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
contact_options.output,
schema='assetid:INTEGER,assetname:STRING,prodcd:INTEGER',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
)
My command is as below :
python3 -m pharm_template --runner DataflowRunner --project jupiter-120 --staging_location gs://input-cdc/temp/staging --temp_location gs://input-cdc/temp/ --template_location gs://code-cdc/temp/templates/jupiter_pipeline_template
What I tried :
I tried passing --input and --output
I also tried --experiment=use_beam_bq_sink but to no avail.
I also tried passing datasetID
datasetId = StaticValueProvider(str, 'pharma')
but no luck.
If any one has created template that loads in BQ , then I can take cue and fix this issue.

Parse config files, environment, and command-line arguments, to get a single collection of options

Python's standard library has modules for configuration file parsing (configparser), environment variable reading (os.environ), and command-line argument parsing (argparse). I want to write a program that does all those, and also:
Has a cascade of option values:
default option values, overridden by
config file options, overridden by
environment variables, overridden by
command-line options.
Allows one or more configuration file locations specified on the command line with e.g. --config-file foo.conf, and reads that (either instead of, or additional to, the usual configuration file). This must still obey the above cascade.
Allows option definitions in a single place to determine the parsing behaviour for configuration files and the command line.
Unifies the parsed options into a single collection of option values for the rest of the program to access without caring where they came from.
Everything I need is apparently in the Python standard library, but they don't work together smoothly.
How can I achieve this with minimum deviation from the Python standard library?

UPDATE: I finally got around to putting this on pypi. Install latest version via:
pip install configargparser
Full help and instructions are here.
Original post
Here's a little something that I hacked together. Feel free suggest improvements/bug-reports in the comments:
import argparse
import ConfigParser
import os
def _identity(x):
return x
_SENTINEL = object()
class AddConfigFile(argparse.Action):
def __call__(self,parser,namespace,values,option_string=None):
# I can never remember if `values` is a list all the time or if it
# can be a scalar string; this takes care of both.
if isinstance(values,basestring):
parser.config_files.append(values)
else:
parser.config_files.extend(values)
class ArgumentConfigEnvParser(argparse.ArgumentParser):
def __init__(self,*args,**kwargs):
"""
Added 2 new keyword arguments to the ArgumentParser constructor:
config --> List of filenames to parse for config goodness
default_section --> name of the default section in the config file
"""
self.config_files = kwargs.pop('config',[]) #Must be a list
self.default_section = kwargs.pop('default_section','MAIN')
self._action_defaults = {}
argparse.ArgumentParser.__init__(self,*args,**kwargs)
def add_argument(self,*args,**kwargs):
"""
Works like `ArgumentParser.add_argument`, except that we've added an action:
config: add a config file to the parser
This also adds the ability to specify which section of the config file to pull the
data from, via the `section` keyword. This relies on the (undocumented) fact that
`ArgumentParser.add_argument` actually returns the `Action` object that it creates.
We need this to reliably get `dest` (although we could probably write a simple
function to do this for us).
"""
if 'action' in kwargs and kwargs['action'] == 'config':
kwargs['action'] = AddConfigFile
kwargs['default'] = argparse.SUPPRESS
# argparse won't know what to do with the section, so
# we'll pop it out and add it back in later.
#
# We also have to prevent argparse from doing any type conversion,
# which is done explicitly in parse_known_args.
#
# This way, we can reliably check whether argparse has replaced the default.
#
section = kwargs.pop('section', self.default_section)
type = kwargs.pop('type', _identity)
default = kwargs.pop('default', _SENTINEL)
if default is not argparse.SUPPRESS:
kwargs.update(default=_SENTINEL)
else:
kwargs.update(default=argparse.SUPPRESS)
action = argparse.ArgumentParser.add_argument(self,*args,**kwargs)
kwargs.update(section=section, type=type, default=default)
self._action_defaults[action.dest] = (args,kwargs)
return action
def parse_known_args(self,args=None, namespace=None):
# `parse_args` calls `parse_known_args`, so we should be okay with this...
ns, argv = argparse.ArgumentParser.parse_known_args(self, args=args, namespace=namespace)
config_parser = ConfigParser.SafeConfigParser()
config_files = [os.path.expanduser(os.path.expandvars(x)) for x in self.config_files]
config_parser.read(config_files)
for dest,(args,init_dict) in self._action_defaults.items():
type_converter = init_dict['type']
default = init_dict['default']
obj = default
if getattr(ns,dest,_SENTINEL) is not _SENTINEL: # found on command line
obj = getattr(ns,dest)
else: # not found on commandline
try: # get from config file
obj = config_parser.get(init_dict['section'],dest)
except (ConfigParser.NoSectionError, ConfigParser.NoOptionError): # Nope, not in config file
try: # get from environment
obj = os.environ[dest.upper()]
except KeyError:
pass
if obj is _SENTINEL:
setattr(ns,dest,None)
elif obj is argparse.SUPPRESS:
pass
else:
setattr(ns,dest,type_converter(obj))
return ns, argv
if __name__ == '__main__':
fake_config = """
[MAIN]
foo:bar
bar:1
"""
with open('_config.file','w') as fout:
fout.write(fake_config)
parser = ArgumentConfigEnvParser()
parser.add_argument('--config-file', action='config', help="location of config file")
parser.add_argument('--foo', type=str, action='store', default="grape", help="don't know what foo does ...")
parser.add_argument('--bar', type=int, default=7, action='store', help="This is an integer (I hope)")
parser.add_argument('--baz', type=float, action='store', help="This is an float(I hope)")
parser.add_argument('--qux', type=int, default='6', action='store', help="this is another int")
ns = parser.parse_args([])
parser_defaults = {'foo':"grape",'bar':7,'baz':None,'qux':6}
config_defaults = {'foo':'bar','bar':1}
env_defaults = {"baz":3.14159}
# This should be the defaults we gave the parser
print ns
assert ns.__dict__ == parser_defaults
# This should be the defaults we gave the parser + config defaults
d = parser_defaults.copy()
d.update(config_defaults)
ns = parser.parse_args(['--config-file','_config.file'])
print ns
assert ns.__dict__ == d
os.environ['BAZ'] = "3.14159"
# This should be the parser defaults + config defaults + env_defaults
d = parser_defaults.copy()
d.update(config_defaults)
d.update(env_defaults)
ns = parser.parse_args(['--config-file','_config.file'])
print ns
assert ns.__dict__ == d
# This should be the parser defaults + config defaults + env_defaults + commandline
commandline = {'foo':'3','qux':4}
d = parser_defaults.copy()
d.update(config_defaults)
d.update(env_defaults)
d.update(commandline)
ns = parser.parse_args(['--config-file','_config.file','--foo=3','--qux=4'])
print ns
assert ns.__dict__ == d
os.remove('_config.file')
TODO
This implementation is still incomplete. Here's a partial TODO list:
(easy) Interaction with parser defaults
(easy) If type conversion doesn't work, check against how argparse handles error messages
Conform to documented behavior
(easy) Write a function that figures out dest from args in add_argument, instead of relying on the Action object
(trivial) Write a parse_args function which uses parse_known_args. (e.g. copy parse_args from the cpython implementation to guarantee it calls parse_known_args.)
Less Easy Stuff…
I haven't tried any of this yet. It's unlikely—but still possible!—that it could just work…
(hard?) Mutual Exclusion
(hard?) Argument Groups (If implemented, these groups should get a section in the config file.)
(hard?) Sub Commands (Sub-commands should also get a section in the config file.)

The argparse module makes this not nuts, as long as you're happy with a config file that looks like command line. (I think this is an advantage, because users will only have to learn one syntax.) Setting fromfile_prefix_chars to, for example, #, makes it so that,
my_prog --foo=bar
is equivalent to
my_prog #baz.conf
if #baz.conf is,
--foo
bar
You can even have your code look for foo.conf automatically by modifying argv
if os.path.exists('foo.conf'):
argv = ['#foo.conf'] + argv
args = argparser.parse_args(argv)
The format of these configuration files is modifiable by making a subclass of ArgumentParser and adding a convert_arg_line_to_args method.

While I haven't tried it by my own, there is ConfigArgParse library which states that it does most of things that you want:
A drop-in replacement for argparse that allows options to also be set via config files and/or environment variables.

There's library that does exactly this called configglue.
configglue is a library that glues together python's
optparse.OptionParser and ConfigParser.ConfigParser, so that you don't
have to repeat yourself when you want to export the same options to a
configuration file and a commandline interface.
It also supports environment variables.
There's also another library called ConfigArgParse which is
A drop-in replacement for argparse that allows options to also be set
via config files and/or environment variables.
You might be interested in PyCon talk about configuration by Łukasz Langa - Let Them Configure!

It seems the standard library doesn't address this, leaving each programmer to cobble configparser and argparse and os.environ all together in clunky ways.

To hit all those requirements, I would recommend writing your own library that uses both [opt|arg]parse and configparser for the underlying functionality.
Given the first two and the last requirement, I'd say you want:
Step one: Do a command line parser pass that only looks for the --config-file option.
Step two: Parse the config file.
Step three: set up a second command line parser pass using the output of the config file pass as the defaults.
The third requirement likely means you have to design your own option definition system to expose all the functionality of optparse and configparser that you care about, and write some plumbing to do conversions in between.

The Python standard library does not provide this, as far as I know. I solved this for myself by writing code to use optparse and ConfigParser to parse the command line and config files, and provide an abstraction layer on top of them. However, you would need this as a separate dependency, which from your earlier comment seems to be unpalatable.
If you want to look at the code I wrote, it's at http://liw.fi/cliapp/. It's integrated into my "command line application framework" library, since that's a large part of what the framework needs to do.

I was tried something like this recently, using "optparse".
I set it up as a sub-class of OptonParser, with a '--Store' and a '--Check' command.
The code below should pretty much have you covered. You just need to define your own 'load' and 'store' methods which accept/return dictionaries and you're prey much set.
class SmartParse(optparse.OptionParser):
def __init__(self,defaults,*args,**kwargs):
self.smartDefaults=defaults
optparse.OptionParser.__init__(self,*args,**kwargs)
fileGroup = optparse.OptionGroup(self,'handle stored defaults')
fileGroup.add_option(
'-S','--Store',
dest='Action',
action='store_const',const='Store',
help='store command line settings'
)
fileGroup.add_option(
'-C','--Check',
dest='Action',
action='store_const',const='Check',
help ='check stored settings'
)
self.add_option_group(fileGroup)
def parse_args(self,*args,**kwargs):
(options,arguments) = optparse.OptionParser.parse_args(self,*args,**kwargs)
action = options.__dict__.pop('Action')
if action == 'Check':
assert all(
value is None
for (key,value) in options.__dict__.iteritems()
)
print 'defaults:',self.smartDefaults
print 'config:',self.load()
sys.exit()
elif action == 'Store':
self.store(options.__dict__)
sys.exit()
else:
config=self.load()
commandline=dict(
[key,val]
for (key,val) in options.__dict__.iteritems()
if val is not None
)
result = {}
result.update(self.defaults)
result.update(config)
result.update(commandline)
return result,arguments
def load(self):
return {}
def store(self,optionDict):
print 'Storing:',optionDict

Here's a module I hacked together that reads command-line arguments, environment settings, ini files, and keyring values as well. It's also available in a gist.
"""
Configuration Parser
Configurable parser that will parse config files, environment variables,
keyring, and command-line arguments.
Example test.ini file:
[defaults]
gini=10
[app]
xini = 50
Example test.arg file:
--xfarg=30
Example test.py file:
import os
import sys
import config
def main(argv):
'''Test.'''
options = [
config.Option("xpos",
help="positional argument",
nargs='?',
default="all",
env="APP_XPOS"),
config.Option("--xarg",
help="optional argument",
default=1,
type=int,
env="APP_XARG"),
config.Option("--xenv",
help="environment argument",
default=1,
type=int,
env="APP_XENV"),
config.Option("--xfarg",
help="#file argument",
default=1,
type=int,
env="APP_XFARG"),
config.Option("--xini",
help="ini argument",
default=1,
type=int,
ini_section="app",
env="APP_XINI"),
config.Option("--gini",
help="global ini argument",
default=1,
type=int,
env="APP_GINI"),
config.Option("--karg",
help="secret keyring arg",
default=-1,
type=int),
]
ini_file_paths = [
'/etc/default/app.ini',
os.path.join(os.path.dirname(os.path.abspath(__file__)),
'test.ini')
]
# default usage
conf = config.Config(prog='app', options=options,
ini_paths=ini_file_paths)
conf.parse()
print conf
# advanced usage
cli_args = conf.parse_cli(argv=argv)
env = conf.parse_env()
secrets = conf.parse_keyring(namespace="app")
ini = conf.parse_ini(ini_file_paths)
sources = {}
if ini:
for key, value in ini.iteritems():
conf[key] = value
sources[key] = "ini-file"
if secrets:
for key, value in secrets.iteritems():
conf[key] = value
sources[key] = "keyring"
if env:
for key, value in env.iteritems():
conf[key] = value
sources[key] = "environment"
if cli_args:
for key, value in cli_args.iteritems():
conf[key] = value
sources[key] = "command-line"
print '\n'.join(['%s:\t%s' % (k, v) for k, v in sources.items()])
if __name__ == "__main__":
if config.keyring:
config.keyring.set_password("app", "karg", "13")
main(sys.argv)
Example results:
$APP_XENV=10 python test.py api --xarg=2 #test.arg
<Config xpos=api, gini=1, xenv=10, xini=50, karg=13, xarg=2, xfarg=30>
xpos: command-line
xenv: environment
xini: ini-file
karg: keyring
xarg: command-line
xfarg: command-line
"""
import argparse
import ConfigParser
import copy
import os
import sys
try:
import keyring
except ImportError:
keyring = None
class Option(object):
"""Holds a configuration option and the names and locations for it.
Instantiate options using the same arguments as you would for an
add_arguments call in argparse. However, you have two additional kwargs
available:
env: the name of the environment variable to use for this option
ini_section: the ini file section to look this value up from
"""
def __init__(self, *args, **kwargs):
self.args = args or []
self.kwargs = kwargs or {}
def add_argument(self, parser, **override_kwargs):
"""Add an option to a an argparse parser."""
kwargs = {}
if self.kwargs:
kwargs = copy.copy(self.kwargs)
try:
del kwargs['env']
except KeyError:
pass
try:
del kwargs['ini_section']
except KeyError:
pass
kwargs.update(override_kwargs)
parser.add_argument(*self.args, **kwargs)
#property
def type(self):
"""The type of the option.
Should be a callable to parse options.
"""
return self.kwargs.get("type", str)
#property
def name(self):
"""The name of the option as determined from the args."""
for arg in self.args:
if arg.startswith("--"):
return arg[2:].replace("-", "_")
elif arg.startswith("-"):
continue
else:
return arg.replace("-", "_")
#property
def default(self):
"""The default for the option."""
return self.kwargs.get("default")
class Config(object):
"""Parses configuration sources."""
def __init__(self, options=None, ini_paths=None, **parser_kwargs):
"""Initialize with list of options.
:param ini_paths: optional paths to ini files to look up values from
:param parser_kwargs: kwargs used to init argparse parsers.
"""
self._parser_kwargs = parser_kwargs or {}
self._ini_paths = ini_paths or []
self._options = copy.copy(options) or []
self._values = {option.name: option.default
for option in self._options}
self._parser = argparse.ArgumentParser(**parser_kwargs)
self.pass_thru_args = []
#property
def prog(self):
"""Program name."""
return self._parser.prog
def __getitem__(self, key):
return self._values[key]
def __setitem__(self, key, value):
self._values[key] = value
def __delitem__(self, key):
del self._values[key]
def __contains__(self, key):
return key in self._values
def __iter__(self):
return iter(self._values)
def __len__(self):
return len(self._values)
def get(self, key, *args):
"""
Return the value for key if it exists otherwise the default.
"""
return self._values.get(key, *args)
def __getattr__(self, attr):
if attr in self._values:
return self._values[attr]
else:
raise AttributeError("'config' object has no attribute '%s'"
% attr)
def build_parser(self, options, **override_kwargs):
"""."""
kwargs = copy.copy(self._parser_kwargs)
kwargs.update(override_kwargs)
if 'fromfile_prefix_chars' not in kwargs:
kwargs['fromfile_prefix_chars'] = '#'
parser = argparse.ArgumentParser(**kwargs)
if options:
for option in options:
option.add_argument(parser)
return parser
def parse_cli(self, argv=None):
"""Parse command-line arguments into values."""
if not argv:
argv = sys.argv
options = []
for option in self._options:
temp = Option(*option.args, **option.kwargs)
temp.kwargs['default'] = argparse.SUPPRESS
options.append(temp)
parser = self.build_parser(options=options)
parsed, extras = parser.parse_known_args(argv[1:])
if extras:
valid, pass_thru = self.parse_passthru_args(argv[1:])
parsed, extras = parser.parse_known_args(valid)
if extras:
raise AttributeError("Unrecognized arguments: %s" %
' ,'.join(extras))
self.pass_thru_args = pass_thru + extras
return vars(parsed)
def parse_env(self):
results = {}
for option in self._options:
env_var = option.kwargs.get('env')
if env_var and env_var in os.environ:
value = os.environ[env_var]
results[option.name] = option.type(value)
return results
def get_defaults(self):
"""Use argparse to determine and return dict of defaults."""
parser = self.build_parser(options=self._options)
parsed, _ = parser.parse_known_args([])
return vars(parsed)
def parse_ini(self, paths=None):
"""Parse config files and return configuration options.
Expects array of files that are in ini format.
:param paths: list of paths to files to parse (uses ConfigParse logic).
If not supplied, uses the ini_paths value supplied on
initialization.
"""
results = {}
config = ConfigParser.SafeConfigParser()
config.read(paths or self._ini_paths)
for option in self._options:
ini_section = option.kwargs.get('ini_section')
if ini_section:
try:
value = config.get(ini_section, option.name)
results[option.name] = option.type(value)
except ConfigParser.NoSectionError:
pass
return results
def parse_keyring(self, namespace=None):
"""."""
results = {}
if not keyring:
return results
if not namespace:
namespace = self.prog
for option in self._options:
secret = keyring.get_password(namespace, option.name)
if secret:
results[option.name] = option.type(secret)
return results
def parse(self, argv=None):
"""."""
defaults = self.get_defaults()
args = self.parse_cli(argv=argv)
env = self.parse_env()
secrets = self.parse_keyring()
ini = self.parse_ini()
results = defaults
results.update(ini)
results.update(secrets)
results.update(env)
results.update(args)
self._values = results
return self
#staticmethod
def parse_passthru_args(argv):
"""Handles arguments to be passed thru to a subprocess using '--'.
:returns: tuple of two lists; args and pass-thru-args
"""
if '--' in argv:
dashdash = argv.index("--")
if dashdash == 0:
return argv[1:], []
elif dashdash > 0:
return argv[0:dashdash], argv[dashdash + 1:]
return argv, []
def __repr__(self):
return "<Config %s>" % ', '.join([
'%s=%s' % (k, v) for k, v in self._values.iteritems()])
def comma_separated_strings(value):
"""Handles comma-separated arguments passed in command-line."""
return map(str, value.split(","))
def comma_separated_pairs(value):
"""Handles comma-separated key/values passed in command-line."""
pairs = value.split(",")
results = {}
for pair in pairs:
key, pair_value = pair.split('=')
results[key] = pair_value
return results

You can use ChainMap for this. Take a look at my example that I provided for in "Which is the best way to allow configuration options be overridden at the command line in Python?" SO question.

The library confect I built is precisely to meet most of your needs.
It can load configuration file multiple times through given file paths or module name.
It loads configurations from environment variables with a given prefix.
It can attach command line options to some click commands
(sorry, it's not argparse, but click is better and much more advanced. confect might support argparse in the future release).
Most importantly, confect loads Python configuration files not JSON/YMAL/TOML/INI. Just like IPython profile file or DJANGO settings file, Python configuration file is flexible and easier to maintain.
For more information, please check the README.rst in the project repository. Be aware of that it supports only Python3.6 up.
Examples
Attaching command line options
import click
from proj_X.core import conf
#click.command()
#conf.click_options
def cli():
click.echo(f'cache_expire = {conf.api.cache_expire}')
if __name__ == '__main__':
cli()
It automatically creates a comprehensive help message with all properties and default values declared.
$ python -m proj_X.cli --help
Usage: cli.py [OPTIONS]
Options:
--api-cache_expire INTEGER [default: 86400]
--api-cache_prefix TEXT [default: proj_X_cache]
--api-url_base_path TEXT [default: api/v2/]
--db-db_name TEXT [default: proj_x]
--db-username TEXT [default: proj_x_admin]
--db-password TEXT [default: your_password]
--db-host TEXT [default: 127.0.0.1]
--help Show this message and exit.
Loading environment variables
It only needs one line to load environment variables
conf.load_envvars('proj_X')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataflow Template, making Runtime Parameters globally accessible - python

Related

pysmb from linux to Windows, Unable to connect to shared device

Passing in 'date' as a runtime argument in Google Dataflow Template

How to pass arguments to scoring file when deploying a Model in AzureML

Attribute error while creating custom template using python in Google Cloud DataFlow

Parse config files, environment, and command-line arguments, to get a single collection of options

Categories

Resources