How to design a program with many configuration options? - python

Lets say I have a program that has a large number of configuration options. The user can specify them in a config file. My program can parse this config file, but how should it internally store and pass around the options?
In my case, the software is used to perform a scientific simulation. There are about 200 options most of which have sane defaults. Typically the user only has to specify a dozen or so. The difficulty I face is how to design my internal code. Many of the objects that need to be constructed depend on many configuration options. For example an object might need several paths (for where data will be stored), some options that need to be passed to algorithms that the object will call, and some options that are used directly by the object itself.
This leads to objects needing a very large number of arguments to be constructed. Additionally, as my codebase is under very active development, it is a big pain to go through the call stack and pass along a new configuration option all the way down to where it is needed.
One way to prevent that pain is to have a global configuration object that can be freely used anywhere in the code. I don't particularly like this approach as it leads to functions and classes that don't take any (or only one) argument and it isn't obvious to the reader what data the function/class deals with. It also prevents code reuse as all of the code depends on a giant config object.
Can anyone give me some advice about how a program like this should be structured?
Here is an example of what I mean for the configuration option passing style:
class A:
def __init__(self, opt_a, opt_b, ..., opt_z):
self.opt_a = opt_a
self.opt_b = opt_b
...
self.opt_z = opt_z
def foo(self, arg):
algo(arg, opt_a, opt_e)
Here is an example of the global config style:
class A:
def __init__(self, config):
self.config = config
def foo(self, arg):
algo(arg, config)
The examples are in Python but my question stands for any similar programming langauge.

matplotlib is a large package with many configuration options. It use a rcParams module to manage all the default parameters. rcParams save all the default parameters in a dict.
Every functions will get the options from keyword argurments:
for example:
def f(x,y,opt_a=None, opt_b=None):
if opt_a is None: opt_a = rcParams['group1.opt_a']

A few design patterns will help
Prototype
Factory and Abstract Factory
Use these two patterns with configuration objects. Each method will then take a configuration object and use what it needs. Also consider applying a logical grouping to config parameters and think about ways to reduce the number of inputs.
psuedo code
// Consider we can run three different kinds of Simulations. sim1, sim2, sim3
ConfigFactory configFactory = new ConfigFactory("/path/to/option/file");
....
Simulation1 sim1;
Simulation2 sim2;
Simulation3 sim3;
sim1.run( configFactory.ConfigForSim1() );
sim2.run( configFactory.ConfigForSim2() );
sim3.run( configFactory.ConfigForSim3() );
Inside of each factory method it might create a configuration from a prototype object (that has all of the "sane" defaults) and the option file becomes just the things that are different from default. This would be paired with clear documentation on what these defaults are and when a person (or other program) might want to change them.
** Edit: **
Also consider that each config returned by the factory is a subset of the overall config.

Pass around either the config parsing class, or write a class that wraps it and intelligently pulls out the requested options.
Python's standard library configparser exposes the sections and options of an INI style configuration file using the mapping protocol, and so you can retrieve your options directly from that as though it were a dictionary.
myconf = configparser.ConfigParser()
myconf.read('myconf.ini')
what_to_do = myconf['section']['option']
If you explicitly want to provide the options using the attribute notation, create a class that overrides __getattr__:
class MyConf:
def __init__(self, path):
self._parser = configparser.ConfigParser()
self._parser.read('myconf.ini')
def __getattr__(self, option):
return self._parser[{'what_to_do': 'section'}[option]][option]
myconf = MyConf()
what_to_do = myconf.what_to_do

Have a module load the params to its namespace, then import it and use wherever you want.
Also see related question here

Related

Is it appropriate to use a class for the purpose of organizing functions that share inputs?

To provide a bit of context, I am building a risk model that pulls data from various different sources. Initially I wrote the model as a single function that when executed read in the different data sources as pandas.DataFrame objects and used those objects when necessary. As the model grew in complexity, it quickly became unreadable and I found myself copy an pasting blocks of code often.
To cleanup the code I decided to make a class that when initialized reads, cleans and parses the data. Initialization takes about a minute to run and builds my model in its entirety.
The class also has some additional functionality. There is a generate_email method that sends an email with details about high risk factors and another method append_history that point-in-times the risk model and saves it so I can run time comparisons.
The thing about these two additional methods is that I cannot imagine a scenario where I would call them without first re-calibrating my risk model. So I have considered calling them in init() like my other methods. I haven't only because I am trying to justify having a class in the first place.
I am consulting this community because my project structure feels clunky and awkward. I am inclined to believe that I should not be using a class at all. Is it frowned upon to create classes merely for the purpose of organization? Also, is it bad practice to call instance methods (that take upwards of a minute to run) within init()?
Ultimately, I am looking for reassurance or a better code structure. Any help would be greatly appreciated.
Here is some pseudo code showing my project structure:
class RiskModel:
def __init__(self, data_path_a, data_path_b):
self.data_path_a = data_path_a
self.data_path_b = data_path_b
self.historical_data = None
self.raw_data = None
self.lookup_table = None
self._read_in_data()
self.risk_breakdown = None
self._generate_risk_breakdown()
self.risk_summary = None
self.generate_risk_summary()
def _read_in_data(self):
# read in a .csv
self.historical_data = pd.read_csv(self.data_path_a)
# read an excel file containing many sheets into an ordered dictionary
self.raw_data = pd.read_excel(self.data_path_b, sheet_name=None)
# store a specific sheet from the excel file that is used by most of
# my class's methods
self.lookup_table = self.raw_data["Lookup"]
def _generate_risk_breakdown(self):
'''
A function that creates a DataFrame from self.historical_data,
self.raw_data, and self.lookup_table and stores it in
self.risk_breakdown
'''
self.risk_breakdown = some_dataframe
def _generate_risk_summary(self):
'''
A function that creates a DataFrame from self.lookup_table and
self.risk_breakdown and stores it in self.risk_summary
'''
self.risk_summary = some_dataframe
def generate_email(self, recipient):
'''
A function that sends an email with details about high risk factors
'''
if __name__ == "__main__":
risk_model = RiskModel(data_path_a, data_path_b)
risk_model.generate_email(recipient#generic.com)
In my opinion it is a good way to organize your project, especially since you mentioned the high rate of re-usability of parts of the code.
One thing though, I wouldn't put the _read_in_data, _generate_risk_breakdown and _generate_risk_summary methods inside __init__, but instead let the user call this methods after initializing the RiskModel class instance.
This way the user would be able to read in data from a different path or only to generate the risk breakdown or summary, without reading in the data once again.
Something like this:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
my_risk_model.generate_risk_breakdown(parameters)
my_risk_model.generate_risk_summary(other_parameters)
If there is an issue of user calling these methods in an order which would break the logical chain, you could throw an exception if generate_risk_breakdown or generate_risk_summary are called before read_in_data. Of course you could only move the generate... methods out, leaving the data import inside __init__.
To advocate more on exposing the generate... methods out of __init__, consider a case scenario, where you would like to generate multiple risk summaries, changing various parameters. It would make sense, not to create the RiskModel every time and read the same data, but instead change the input to generate_risk_summary method:
my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
for parameter in [50, 60, 80]:
my_risk_model.generate_risk_summary(parameter)
my_risk_model.generate_email('test#gmail.com')

Pythonic way to parse command line output into a container object

Please read this whole question before answering, as it's not what you think... I'm looking at creating python object wrappers that represent hardware devices on a system (trimmed example below).
class TPM(object):
#property
def attr1(self):
"""
Protects value from being accidentally modified after
constructor is called.
"""
return self._attr1
def __init__(self, attr1, ...):
self._attr1 = attr1
...
#classmethod
def scan(cls):
"""Calls Popen, parses to dict, and passes **dict to constructor"""
Most of the constructor inputs involve running command line outputs in subprocess.Popen and then parsing the output to fill in object attributes. I've come up with a few ways to handle these, but I'm unsatisfied with what I've put together just far and am trying to find a better solution. Here are the common catches that I've found. (Quick note: tool versions are tightly controlled, so parsed outputs don't change unexpectedly.)
Many tools produce variant outputs, sometimes including fields and sometimes not. This means that if you assemble a dict to be wrapped in a container object, the constructor is more or less forced to take **kwargs and not really have defined fields. I don't like this because it makes static analysis via pylint, etc less than useful. I'd prefer a defined interface so that sphinx documentation is clearer and errors can be more reliably detected.
In lieu of **kwargs, I've also tried setting default args to None for many of the fields, with what ends up as pretty ugly results. One thing I dislike strongly about this option is that optional fields don't always come at the end of the command line tool output. This makes it a little mind-bending to look at the constructor and match it up to tool output.
I'd greatly prefer to avoid constructing a dictionary in the first place, but using setattr to create attributes will make pylint unable to detect the _attr1, etc... and create warnings. Any ideas here are welcome...
Basically, I am looking for the proper Pythonic way to do this. My requirements, for a re-summary are the following:
Command line tool output parsed into a container object.
Container object protects attributes via properties post-construction.
Varying number of inputs to constructor, with working static analysis and error detection for missing required fields during runtime.
Is there a good way of doing this (hopefully without a ton of boilerplate code) in Python? If so, what is it?
EDIT:
Per some of the clarification requests, we can take a look at the tpm_version command. Here's the output for my laptop, but for this TPM it doesn't include every possible attribute. Sometimes, the command will return extra attributes that I also want to capture. This makes parsing to known attribute names on a container object fairly difficult.
TPM 1.2 Version Info:
Chip Version: 1.2.4.40
Spec Level: 2
Errata Revision: 3
TPM Vendor ID: IFX
Vendor Specific data: 04280077 0074706d 3631ffff ff
TPM Version: 01010000
Manufacturer Info: 49465800
Example code (ignore lack of sanity checks, please. trimmed for brevity):
def __init__(self, chip_version, spec_level, errata_revision,
tpm_vendor_id, vendor_specific_data, tpm_version,
manufacturer_info):
self._chip_version = chip_version
...
#classmethod
def scan(cls):
tpm_proc = Popen("/usr/sbin/tpm_version")
stdout, stderr = Popen.communicate()
tpm_dict = dict()
for line in tpm_proc.stdout.splitlines():
if "Version Info:" in line:
pass
else:
split_line = line.split(":")
attribute_name = (
split_line[0].strip().replace(' ', '_').lower())
tpm_dict[attribute_name] = split_line[1].strip()
return cls(**tpm_dict)
The problem here is that this (or a different one that I may not be able to review the source of to get every possible field) could add extra things that cause my parser to work, but my object to not capture the fields. That's what I'm really trying to solve in an elegant way.
I've been working on a more solid answer to this the last few months, as I basically work on hardware support libraries and have finally come up with a satisfactory (though pretty verbose) answer.
Parse the tool outputs, whatever they look like, into objects structures that match up to how the tool views the device. These can have very generic dict structures, but should be broken out as much as possible.
Create another container class on top of that that which uses attributes to access items in the tool-container-objects. This enforces an API and can return sane errors across multiple versions of the tool, and across differing tool outputs!

Running multiple functions in Python

I had a program that read in a text file and took out the necessary variables for serialization into turtle format and storing in an RDF graph. The code I had was crude and I was advised to separate it into functions. As I am new to Python, I had no idea how to do this. Below is some of the functions of the program.
I am getting confused as to when parameters should be passed into the functions and when they should be initialized with self. Here are some of my functions. If I could get an explanation as to what I am doing wrong that would be great.
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
self.graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
self.split_pointer_part()
self.split_word_part()
self.split_gloss_part()
self.process_lex_filenum()
self.process_synset_offset()
+more functions............
self.print_graph()
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def get_num_words(self, word_part, num_words):
""" 1 as default, may want 0 as an invalid case """
""" do if else statements on l3 variable """
if word_part[3] == '0a':
num_words = 10
else:
num_words = int(word_part[3])
return num_words
def get_pointers_list(self, pointers, after_at, num_pointers, pointerList):
pointers = after_at.split()[0:0 +4 * num_pointers:4]
pointerList = iter(pointers)
return pointerList
............code to create triples for graph...............
def print_graph(self):
print graph.serialize(format='nt')
def main():
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
if __name__ == "__main__":
main()
You question is mainly a question about what object oriented programming is. I will try to explain quickly, but I recommend reading a proper tutorial on it like
http://www.voidspace.org.uk/python/articles/OOP.shtml
http://net.tutsplus.com/tutorials/python-tutorials/python-from-scratch-object-oriented-programming/
and/or http://www.tutorialspoint.com/python/python_classes_objects.htm
When you create a class and instantiate it (with mywordnet=WordNet(somegraph)), you can resue the mywordnet instance many times. Each variable you set on self. in WordNet, is stored in that instance. So for instance self.graph is always available if you call any method of mywordnet. If you wouldn't store it in self.graph, you would need to specify it as a parameter in each method (function) that requires it. Which would be tedious if all of these method calls require the same graph anyway.
So to look at it another way: everything you set with self. can be seen as a sort of configuration for that specific instance of Wordnet. It influences the Wordnet behaviour. You could for instance have two Wordnet instances, each instantiated with a different graph, but all other functionality the same. That way you can choose which graph to print to, depending on which Wordnet instance you use, but everything else stays the same.
I hope this helps you out a little.
First, I suggest you figure out the basic functional decomposition on its own - don't worry about writing a class at all.
For example,
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
doesn't touch any instance variables (it never refers to self), so it can just be a standalone function.
It also exhibits a peculiarity I see in your other code: you pass two arguments (before_at, after_at) but never use their values. If the caller doesn't already know what they are, why pass them in?
So, a free function should probably look like:
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
If you want to put this function in your class scope (so it doesn't pollute the top-level namespace, or just because it's a logical grouping), you still don't need to pass self if it isn't used. You can make it a static method:
#staticmethod
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
One thing that would be very helpful for you is a good visual debugger. There's a nice free one for Python called Winpdb. There are also excellent debuggers in the commercial products IntelliJ IDEA/PyCharm, Komodo IDE, WingIDE, and Visual Studio (with the Python Tools add-in). Probably a few others too.
I highly recommend setting up one of these debuggers and running your code under it. It will let you step through your code line by line and see what happens with all your variables and objects.
You may find people who tell you that real programmers don't need or shouldn't use debuggers. Don't listen to them: a good debugger is one of the very best tools to help you learn a new language or to get familiar with a piece of code.

How do I programmatically add new functions to current scope in Python?

In Python it is easy to create new functions programmatically. How would I assign this to programmatically determined names in the current scope?
This is what I'd like to do (in non-working code):
obj_types = ('cat', 'dog', 'donkey', 'camel')
for obj_type in obj_types:
'create_'+obj_type = lambda id: id
In the above example, the assignment of lambda into a to-be-determined function name obviously does not work. In the real code, the function itself would be created by a function factory.
The background is lazyness and do-not-repeat-yourself: I've got a dozen and more object types for which I'd assign a generated function. So the code currently looks like:
create_cat = make_creator('cat')
# ...
create_camel = make_creator('camel')
The functions create_cat etc are used hardcoded in a parser.
If I would create classes as a new type programmatically, types.new_class() as seen in the docs seems to be the solution.
Is it my best bet to (mis)use this approach?
One way to accomplish what you are trying to do (but not create functions with dynamic names) is to store the lamda's in a dict using the name as the key. Instead of calling create_cat() you would call create['cat'](). That would dovetail nicely with not hardcoding names in the parser logic as well.
Vaughn Cato points out that one could just assign into locals()[object_type] = factory(object_type). However the Python docs prohibit this: "Note: The contents of this dictionary should not be modified; changes may not affect the values of local and free variables used by the interpreter"
D. Shawley points out that it would be wiser to use a dict() object which entries would hold the functions. Access would be simple by using create['cat']() in the parser. While this is compelling I do not like the syntax overhead of the brackets and ticks required.
J.F. Sebastian points to classes. And this is what I ended up with:
# Omitting code of these classes for clarity
class Entity:
def __init__(file_name, line_number):
# Store location, good for debug, messages, and general indexing
# The following classes are the real objects to be generated by a parser
# Their constructors must consume whatever data is provided by the tokens
# as well as calling super() to forward the file_name,line_number info.
class Cat(Entity): pass
class Camel(Entity): pass
class Parser:
def parse_file(self, fn):
# ...
# Function factory to wrap object constructor calls
def create_factory(obj_type):
def creator(text, line_number, token):
try:
return obj_type(*token,
file_name=fn, line_number=line_number)
except Exception as e:
# For debug of constructor during development
print(e)
return creator
# Helper class, serving as a 'dictionary' of obj construction functions
class create: pass
for obj_type in (Cat, Camel):
setattr(create,
obj_type.__name__.lower(),
create_factory(obj_type))
# Parsing code now can use (again simplified for clarity):
expression = Keyword('cat').setParseAction(create.cat)
This is helper code for deploying a pyparsing parser. D. Shawley is correct in that the dict would actually more easily allow to dynamically generate the parser grammar.

How could you swap out a particular database implementation in python?

If I have a seperate class for my db calls, and I create another implementation of the db layer but say with a different data store.
Is there a way for me to completly swap out the implementation without having to change allot of code?
i.e. I am starting a project, so I can design things properly to achieve this from the get-go.
Note: I will use this pattern for other parts of the site also, not just the db layer so its not really specific to db layer only.
As long as two modules implement exactly the same interface (classes with the same names, methods, and other attributes, functions with the same names and signatures, ...) you can pick one or the other at the time your application is starting up, for example on the basis of some configuration file, and import the chosen one under a fixed name. All the rest of your application can then use that fixed name and, net of the startup code, be blissfully unaware of any shenanigans that may have been done at the start.
For example, consider a simplified case:
# english.py
def greet(): return 'Hello!'
# italian.py
def greet(): return 'Ciao!'
# french.py
def greet(): return 'Salut!'
# config.py
langname = 'italian'
# startit.py
import config
import sys
lang = __import__(config.langname)
sys.modules['lang'] = lang
Now, all the rest of the application can just import lang, and it will be getting under that name the italian module, so, when calling lang.greet(), it will get the string 'Ciao!'.
Of course, in real life you'll have multiple modules, each with multiple functions, classes, and whatnot, but the general principles stay very similar. Just take special care about modules with qualified names (such as foo.bar), i.e., modules which must reside in a package (in this case, foo). For those, you can't just use __import__'s return value, but must use a slightly more roundabout approach, such as:
import sys
def importanyasname(actualname, fakename):
__import__(actualname)
sys.modules[fakename] = sys.modules[actualname]
that is, ignore __import__'s return value, and reach right for the value that's left (with the actual name as the key) in the sys.modules dictionary -- that is the module object you seek, and that you can set back into sys.modules with the "fake name" by which all the rest of the application will be able to blissfully import it any time.

Categories

Resources