Using one dictionary vs. many to store program configurations

Using one dictionary vs. many to store program configurations - python

I am writing a Python program with many (approx ~30-40) parameters, all of which have default values, and all should be adjustable at run time by the user. The way I set it up, these parameters are grouped into 4 dictionaries, corresponding to 4 different modules of the program. However, I have encountered a few cases of a single parameter required by more then one of these modules, leading me to consider just unifying the dictionaries into one big config dictionary, or perhaps even one config object, passed to each module.
My questions are
Would this have any effect on run time? I suspect not, but want to be sure.
Is this considered good practice? Is there some other solution to the problem I have described?

probably no effect on runtime. larger dictionaries could take longer to lookup in, but in your case, we are talking about 40 items. that's nothing.
we use a single settings file in which we initialize globals by calling a method that either read the config from the environment, a file or a Python file (as globals). the method that reads the config can get the desired type and default value. Others use YAML or TOML for representing configuration and I'm guessing then stores them a globally accessible object. If your settings can be changed in runtime, you have to protect this object in terms of thread-safety (if you have threads of course).

Related

Equivalent of `Microsoft.Extensions.Configuration` in Python

I really like how Microsoft.Extensions.Configuration works in .NET to manage and merge configuration. Now I'm starting a Python project and I would like to know if there's any package that gives comparable features there. What I especially like is the ability to add a series of JSON files that override one-another, and being able to specify some as optional.
On top of that, it gives the ability to override some values from environment variables using some environment variable name convention. That would be the cherry on top, but I can live without it.
Example: I have a default JSON config file (it would be called appsettings.json in .NET), and for deployments, I add another JSON file (would be called appsettings.Deployment.json). When the app starts, it looks for the first file (it's mandatory) and the second file (optionally) and combines the two by overriding values from the first file with values from the second file where it applies. The whole thing is deserialized to some object (I think I can easily handle that in Python).
If this is simply the wrong way of thinking about it in Python and there's some better way of doing mergeable configuration, I would also be glad to learn about that.

How can I save a dynamically generated module and reimport them from file?

I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.

You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]

In Python, how do I tie an on-disk JSON file to an in-process dictionary?

In perl there was this idea of the tie operator, where writing to or modifying a variable can run arbitrary code (such as updating some underlying Berkeley database file). I'm quite sure there is this concept of overloading in python too.
I'm interested to know what the most idiomatic way is to basically consider a local JSON file as the canonical source of needed hierarchical information throughout the running of a python script, so that changes in a local dictionary are automatically reflected in the JSON file. I'll leave it to the OS to optimise writes and cache (I don't mind if the file is basically updated dozens of times throughout the running of the script), but ultimately this is just about a kilobyte of metadata that I'd like to keep around. It's not necessary to address concurrent access to this. I'd just like to be able to access a hierarchical structure (like nested dictionary) within the python process and have reads (and writes to) that structure automatically result in reads from (and changes to) a local JSON file.

well, since python itself has no signals-slots, I guess you can instead make your own dictionary class by inherit it from python dictionary. Class exactly like python dict, only in every method of it that can change dict values you will dump your json.
also you can use smth like PyQt4 QAbstractItemModel which has signals. And when it data changed signal will emitted, do your dumping - it will be only in one place, which is nice.
I know these two are sort of stupid ways, probably yea. :) If anyone knows better, go ahead and tell!

This is a developpement from aspect_mkn8rd' answer taking into account Gerrat's comments, but it is too long for a true comment.
You will need 2 special container classes emulating a list and a dictionnary. In both, you add a pointer to a top-level object and override the following methods :
__setitem__(self, key, value)
__delitem__(self, key)
__reversed__(self)
All those methods are called in modification and should have the top-level object to be written to disk.
In addition, __setitem__(self, key, value) should look if value is a list and wrap it into a special list object or if it is a dictionary, wrap it into a special dictionnary object. In both case, the method should set the top-level object into the new container. If neither of them and the object defines __setitem__, it should raise an Exception saying the object is not supported. Of course you should then modify the method to take in account this new class.
Of course, there is a good deal of code to write and test, but it should work - left to the reader as an exercise :-)

If concurrency is not required, maybe consider writing 2 functions to read and write the data to a shelf file? Our is the idea to have the dictionary" aware" of changes to update the file without this kind of thing?

How does eclipse's pydev do code completion?

Does anyone know how pydev determines what to use for code completion? I'm trying to define a set of classes specifically to enable code completion. I've tried using __new__ to set __dict__ and also __slots__, but neither seems to get listed in pydev autocomplete.
I've got a set of enums I want to list in autocomplete, but I'd like to set them in a generator, not hardcode them all for each class.
So rather than
class TypeA(object):
ValOk = 1
ValSomethingSpecificToThisClassWentWrong = 4
def __call__(self):
return 42
I'd like do something like
def TYPE_GEN(name, val, enums={}):
def call(self):
return val
dct = {}
dct["__call__"] = call
dct['__slots__'] = enums.keys()
for k, v in enums.items():
dct[k] = v
return type(name, (), dct)
TypeA = TYPE_GEN("TypeA",42,{"ValOk":1,"ValSomethingSpecificToThisClassWentWrong":4})
What can I do to help the processing out?
edit:
The comments seem to be about questioning what I am doing. Again, a big part of what I'm after is code completion. I'm using python binding to a protocol to talk to various microcontrollers. Each parameter I can change (there are hundreds) has a name conceptually, but over the protocol I need to use its ID, which is effectively random. Many of the parameters accept values that are conceptually named, but are again represented by integers. Thus the enum.
I'm trying to autogenerate a python module for the library, so the group can specify what they want to change using the names instead of the error prone numbers. The __call__ property will return the id of the parameter, the enums are the allowable values for the parameter.
Yes, I can generate the verbose version of each class. One line for each type seemed clearer to me, since the point is autocomplete, not viewing these classes.

Ok, as pointed, your code is too dynamic for this... PyDev will only analyze your own code statically (i.e.: code that lives inside your project).
Still, there are some alternatives there:
Option 1:
You can force PyDev to analyze code that's in your library (i.e.: in site-packages) dynamically, in which case it could get that information dynamically through a shell.
To do that, you'd have to create a module in site-packages and in your interpreter configuration you'd need to add it to the 'forced builtins'. See: http://pydev.org/manual_101_interpreter.html for details on that.
Option 2:
Another option would be putting it into your predefined completions (but in this case it also needs to be in the interpreter configuration, not in your code -- and you'd have to make the completions explicit there anyways). See the link above for how to do this too.
Option 3:
Generate the actual code. I believe that Cog (http://nedbatchelder.com/code/cog/) is the best alternative for this as you can write python code to output the contents of the file and you can later change the code/rerun cog to update what's needed (if you want proper completions without having to put your code as it was a library in PyDev, I believe that'd be the best alternative -- and you'd be able to grasp better what you have as your structure would be explicit there).
Note that cog also works if you're in other languages such as Java/C++, etc. So, it's something I'd recommend adding to your tool set regardless of this particular issue.

Fully general code completion for Python isn't actually possible in an "offline" editor (as opposed to in an interactive Python shell).
The reason is that Python is too dynamic; basically anything can change at any time. If I type TypeA.Val and ask for completions, the system had to know what object TypeA is bound to, what its class is, and what the attributes of both are. All 3 of those facts can change (and do; TypeA starts undefined and is only bound to an object at some specific point during program execution).
So the system would have to know st what point in the program run do you want the completions from? And even if there were some unambiguous way of specifying that, there's no general way to know what the state of everything in the program is like at that point without actually running it to that point, which you probably don't want your editor to do!
So what pydev does instead is guess, when it's pretty obvious. If you have a class block in a module foo defining class Bar, then it's a safe bet that the name Bar imported from foo is going to refer to that class. And so you know something about what names are accessible under Bar., or on an object created by obj = Bar(). Sure, the program could be rebinding foo.Bar (or altering its set of attributes) at runtime, or could be run in an environment where import foo is hitting some other file. But that sort of thing happens rarely, and the completions are useful in the common case.
What that means though is that you basically lose completions whenever you use "too much" of Python's dynamic language flexibility. Defining a class by calling a function is one of those cases. It's not ready to guess that TypeA has names ValOk and ValSomethingSpecificToThisClassWentWrong; after all, there's presumably lots of other objects that result from calls to TYPE_GEN, but they all have different names.
So if your main goal is to have completions, I think you'll have to make it easy for pydev and write these classes out in full. Of course, you could use similar code to generate the python files (textually) if you wanted. It looks though like there's actually more "syntactic overhead" of defining these with dictionaries than as a class, though; you're writing "a": b, per item rather than a = b. Unless you can generate these more systematically or parse existing definition files or something, I think I'd find the static class definition easier to read and write than the dictionary driving TYPE_GEN.

The simpler your code, the more likely completion is to work. Would it be reasonable to have this as a separate tool that generates Python code files containing the class definitions like you have above? This would essentially be the best of both worlds. You could even put the name/value pairs in a JSON or INI file or what have you, eliminating the clutter of the methods call among the name/value pairs. The only downside is needing to run the tool to regenerate the code files when the codes change, but at least that's an automated, simple process.
Personally, I would just go with making things more verbose and writing out the classes manually, but that's just my opinion.
On a side note, I don't see much benefit in making the classes callable vs. just having an id class variable. Both require knowing what to type: TypeA() vs TypeA.id. If you want to prevent instantiation, I think throwing an exception in __init__ would be a bit more clear about your intentions.

How to avoid computation every time a python module is reloaded

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compute it? I tried cPickle, but loading the dictionary variable from a file(1.3M) takes approximately the same time as computation.
To give more information about my problem,
FD = FreqDist(word for word in brown.words()) # this line of code takes 1 min

Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.
However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.
For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:
# Create dict with a million items:
import shelve
d = shelve.open('path/to/my_persistant_dict')
d.update(('key%d' % x, x) for x in xrange(1000000))
d.close()
Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:
>>> d = shelve.open('path/to/my_persistant_dict')
>>> print d['key99999']
99999
It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.

Calculate your global var on the first use.
class Proxy:
#property
def global_name(self):
# calculate your global var here, enable cache if needed
...
_proxy_object = Proxy()
GLOBAL_NAME = _proxy_object.global_name
Or better yet, access necessery data via special data object.
class Data:
GLOBAL_NAME = property(...)
data = Data()
Example:
from some_module import data
print(data.GLOBAL_NAME)
See Django settings.

I assume you've pasted the dict literal into the source, and that's what's taking a minute? I don't know how to get around that, but you could probably avoid instantiating this dict upon import... You could lazily-instantiate it the first time it's actually used.

You could try using the marshal module instead of the c?Pickle one; it could be faster. This module is used by python to store values in a binary format. Note especially the following paragraph, to see if marshal fits your needs:
Not all Python object types are supported; in general, only objects whose value is independent from a particular invocation of Python can be written and read by this module. The following types are supported: None, integers, long integers, floating point numbers, strings, Unicode objects, tuples, lists, sets, dictionaries, and code objects, where it should be understood that tuples, lists and dictionaries are only supported as long as the values contained therein are themselves supported; and recursive lists and dictionaries should not be written (they will cause infinite loops).
Just to be on the safe side, before unmarshalling the dict, make sure that the Python version that unmarshals the dict is the same as the one that did the marshal, since there are no guarantees for backwards compatibility.

If the 'shelve' solution turns out to be too slow or fiddly, there are other possibilities:
shove
Durus
ZopeDB
pyTables

shelve gets really slow with large data sets. I've been using redis quite successfully, and wrote a FreqDist wrapper around it. It's very fast, and can be accessed concurrently.

You can use a shelve to store your data on disc instead of loading the whole data into memory. So startup time will be very fast, but the trade-off will be slower access time.
Shelve will pickle the dict values too, but will do the (un)pickle not at startup for all the items, but only at access time for each item itself.

A couple of things that will help speed up imports:
You might try running python using the -OO flag when running python. This will do some optimizations that will reduce import time of modules.
Is there any reason why you couldn't break the dictionary up into smaller dictionaries in separate modules that can be loaded more quickly?
As a last resort, you could do the calculations asynchronously so that they won't delay your program until it needs the results. Or maybe even put the dictionary in a separate process and pass data back and forth using IPC if you want to take advantage of multi-core architectures.
With that said, I agree that you shouldn't be experiencing any delay in importing modules after the first time you import it. Here are a couple of other general thoughts:
Are you importing the module within a function? If so, this can lead to performance problems since it has to check and see if the module is loaded every time it hits the import statement.
Is your program multi-threaded? I have seen occassions where executing code upon module import in a multi-threaded app can cause some wonkiness and application instability (most notably with the cgitb module).
If this is a global variable, be aware that global variable lookup times can be significantly longer than local variable lookup times. In this case, you can achieve a significant performance improvement by binding the dictionary to a local variable if you're using it multiple times in the same context.
With that said, it's a tad bit difficult to give you any specific advice without a little bit more context. More specifically, where are you importing it? And what are the computations?

Factor the computationally intensive part into a separate module. Then at least on reload, you won't have to wait.
Try dumping the data structure using protocol 2. The command to try would be cPickle.dump(FD, protocol=2). From the docstring for cPickle.Pickler:
Protocol 0 is the
only protocol that can be written to a file opened in text
mode and read back successfully. When using a protocol higher
than 0, make sure the file is opened in binary mode, both when
pickling and unpickling.

I'm going through this same issue...
shelve, databases, etc... are all too slow for this type of problem. You'll need to take the hit once, insert it into an inmemory key/val store like Redis. It will just live there in memory (warning it could use up a good amount of memory so you may want a dedicated box). You'll never have to reload it and you'll just get looking in memory for keys
r = Redis()
r.set(key, word)
word = r.get(key)

Expanding on the delayed-calculation idea, why not turn the dict into a class that supplies (and caches) elements as necessary?
You might also use psyco to speed up overall execution...

OR you could just use a database for storing the values in? Check out SQLObject, which makes it very easy to store stuff to a database.

There's another pretty obvious solution for this problem. When code is reloaded the original scope is still available.
So... doing something like this will make sure this code is executed only once.
try:
FD
except NameError:
FD = FreqDist(word for word in brown.words())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.