I find myself in a need of working with functions and objects who take a large number of variables.
For a specific case, consider a function from a separated module which takes N different variables, which are then pass them on to newly instanced object:
def Function(Variables):
Do something with some of the variables
object1 = someobject(some of the variables)
object2 = anotherobject(some of the variables, not necessarily as in object1)
While i can just pass a long list of variables, from time to time i find myself making changes to one function, which requires making changes in other functions it might call, or objects it might create. Sometimes the list of variables might change a little.
Is there a nice elegant way to pass a large group of variables and maintain flexibility?
I tried using kwargs in the following way:
def Function(**kwargs):
Rest of the function
and calling Function(**somedict), where somedict is a dictionary has keys and values of all the variables i need to pass to Function (and maybe some more). But i get an error about undefined global variables.
Edit1:
I will post the piece of code later since i am not at home or the lab now. Till then i will try to better explain the situation.
I have a molecular dynamics simulation, which take few dozens of parameters. Few of the parameters (like the temperature for example) need to be iterated over. To make good use of the quad core processor i ran different iterations in parallel. So the code starts with a loop over the different iteration, and at each pass send that parameters of that iteration to a pool of workers (using the multiprocessing module). It goes something like:
P = mp.pool(number of workers) # If i remember correctly this line
for iteration in Iterations:
assign values to parameters
P.apply_async(run,(list of parameters),callback = some post processing)
P.close()
P.join()
The function run takes the list of parameters and generates the simulation objects, each take some of the parameters as their attributes.
Edit2:
Here is a version of the problematic function. **kwargs contain all the parameters needed by the 'sim','lattice' and 'adatom'.
def run(**kwargs):
"""'run' runs a single simulation process.
j is the index number of the simulation run.
The code generates an independent random seed for the initial conditios."""
scipy.random.seed()
sim = MDF.Simulation(tstep, temp, time, writeout, boundaryxy, boundaryz, relax, insert, lat,savetemp)
lattice = MDF.Lattice(tstep, temp, time, writeout, boundaryxy, boundaryz, relax, insert, lat, kb, ks, kbs, a, p, q, massL, randinit, initvel, parangle,scaletemp,savetemp,freeze)
adatom = MDF.Adatom(tstep, temp, time, writeout, boundaryxy, boundaryz, relax, insert, lat, ra, massa, amorse, bmorse, r0, z0, name, lattice, samplerate,savetemp,adatomrelax)
bad = 1
print 'Starting simulation run number %g\nrun' % (j+1)
while bad is 1:
# If the simulation did not complete successfuly, run it again.
bad = sim.timeloop(lattice,adatom1,j)
print 'Starting post processing'
# Return the temperature and adatomś trajectory and velocity
List = [j,lattice.temp , adatom1.traj ,adatom1.velocity, lattice.Toptemp, lattice.Bottomtemp, lattice.middletemp, lattice.latticetop]
return List
The cleanest solution is not using that many parameters in a function at all.
You could use set methods or properties to set each variable separately, storing them as class members and being used by the functions inside that class.
Those functions fill the private variables and the get methods can be used to retrieve those variables.
An alternative is to use structures (or classes without functions) and this way create a named group of variables.
Putting *args and/or **kwargs as the last items in your function definition’s argument list allows that function to accept an arbitrary number of anonymous and/or keyword arguments.
You would use *args when you're not sure how many arguments might be passed to your function.
You also can group parameters in tuples to make a structure.
This is not as elegant as using structures or get/set methods but can be applied mostly easily in existing applications without too much rework.
Of course only related parameters should be grouped in a tuple.
E.g. you could have a function passed as
value = function_call((car_model, car_type), age, (owner.name, owner.address, owner.telephone))
This does not reduce the number of parameters but adds a bit more structure.
Related
I have a library that has functions like the following:
class ResultObject:
def __init__(self)
self.object_string = []
...
def append(self, other_result_object)
self.object_string.append(other_result_object)
...
def bottom_func_A(option_1, option_2, param_0=def_arg_0, param_1=def_arg_1, ..., param_14=def_arg_14):
""""
This is a process whose output genuinely depends on 15 input parameters enumerated param_n
"""
...
return result_object_A
def bottom_func_B(option_1, option_2, param_0=def_arg_0, param_1=def_arg_1, ..., param_14=def_arg_14):
""""
This is a different process whose output genuinely depends on 15 input parameters
enumerated param_n. parameters values or defautsare not shared with bottom_func_A
"""
...
return result_object_A
def wrapper_A(wrapper_option):
result_1 = bottom_func_A(True, True)
if wrapper_option:
result_2 = bottom_func_A(True, False)
# Do something with result_1 and result_2 if present...
...
return wrapper_object_A
def wrapper_B(wrapper_option):
result_1 = bottom_func_B(True, True)
if wrapper_option:
result_2 = bottom_func_B(True, False)
# Do something with result_1 and result_2 if present...
...
return wrapper_object_B
def do_both_A_B(wrapper_option)
result_A = wrapper_A(wrapper_option)
result_B = wrapper_B(wrapper_option)
# Do something with result_A and result_B
...
return wrapper_A_B_object
In the library the bottom_func_X functions do not always have similar signatures, sometimes they have more or less "option"-like params and more or less "numerical parameter" parameters like param_N. Their bodies are also generic. Similar in this case I've shown wrapper_A and wrapper_B with similar structures but the "wrapper" functions need not have similar structures to eachother. Also, there may be a wrapper (like do_both_A_B) that involves bottom functions A, R, G etc. That is wrappers aren't 1:1 with bottom_funcs. The only structural consistency is that ALL of these functions (bottom funcs and wrappers at all layers) return instances of the ResultObject class. These result objects get "strung together" into long strings where each node in the string was probably generated by some bottom_func with a set of parameters. Note also that are many dozens of bottom_func type objects and wrappers and super-wrappers that string together many combinations of bottom_funcs. That is, these many parameter functions are not one-off things, but that there are many of them in the library.
This is fine and can be made to work. The problem is this. It is very often the case that I want parameters from the bottom functions to be exposed in the signatures of the wrappers at various levels. I see 4 ways to do this.
(1) Just include additional parameters for inner functions at each layer.
(1a) Only do this on an as-needed basis. If I or someone has a use case for exposing param_4 from bottom_func_A in wrapper_A then add it. The downside of this is you have a hodge-podge of exposed and unexposed parameters so you may find yourself often needing to modify this library rather than just call things from it.
(1b) Do this carte-blanche. Expose all parameters at all layers. Obviously this is a mess because do_both_A_B would need to expose 30 parameters, 15 for each of its bottom layer functions. Most of the code in the library would rapidly become function signatures.
Option (1a) is what I'd say has been de-facto the existing solution in the library.
(2) Don't call bottom_funcs within wrappers, rather, accept results from bottom_funcs in the wrappers. That is, bottom_func_A produces a result_object_A and this is al wrapper_A requires. So wrapper_A could have a parameter called result_object_A or something. Maybe it could default to None and in this case wrapper_A will indeed call bottom_func_A using the default arguments of bottom_func_A. But if for some reason a caller doesn't want the default result_object_A they can call bottom_func_A themselves with their desired custom parameters.
(3) Bundle the param_N parameters into some sort of dataclass bundle that can be passed more fluidly. This param bundle would then be exposed at all layers and a caller could modify it from the default. The downside is class or instance of this databundle would need to be defined for each bottom_func. Each of the bottom_funcs would also now have the job of unbundling the data bundle within the bottom_func body. The data bundle and bottom_func would need to cooperate, so if bottom_func is modified to include some new feature (that requires additional params) the dev would need to make sure to modify the corresponding data bundle. I think in practice this would have a similar effect on the codebase as option (2). That is, the default data bundle would be used if callers of the wrappers don't choose to modify the bundle, but if they like, they can do the work of generating a custom bundle and passing that through. In effect this isn't much different than calling bottom_func_A with the caller's desired parameters and passing the result through.
(4) Similar to option 3, but the "bundle" could just be kwargs dictionaries that get passed with appropriate names through the different layers. That is do_both_A_B could have something like bottom_func_A_kwargs as an input parameter that exposes the ability to modify one or both versions of bottom_func_A in it.
(5) The nested parameters could be turned into some sort of global variables whose value can be modified in wrappers or by callers. These could be python global variables or entries in some sort of database module/file or actual database. This is a pretty nasty idea for a few reasons but in practice this sort of solution has arisen in this code base as well. The major downside that pops in my mind first is if a caller modifies a param value but forgets to set it back to its default value but that's not to say there's not other big ones.
Which of these patterns (or other patterns I haven't listed) most aptly solves the parameter explosion problem that arises from wanting to expose the ability to modify the lowest level parameters at higher layers? Perhaps to some degree the issue is that I want a very high degree of flexibility in what this code can do so I just have to pay the price of having tons of code dedicated to handling parameters?
I've read through the documentation, but I don't understand what is meant by:
The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.
I'm using it to iterate over the list I want to operate on (allImages) as follows:
def joblib_loop():
Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages)
This returns my HOG features, like I want (and with the speed gain using all my 8 cores), but I'm just not sure what it is actually doing.
My Python knowledge is alright at best, and it's very possible that I'm missing something basic. Any pointers in the right direction would be most appreciated
Perhaps things become clearer if we look at what would happen if instead we simply wrote
Parallel(n_jobs=8)(getHog(i) for i in allImages)
which, in this context, could be expressed more naturally as:
Create a Parallel instance with n_jobs=8
create a generator for the list [getHog(i) for i in allImages]
pass that generator to the Parallel instance
What's the problem? By the time the list gets passed to the Parallel object, all getHog(i) calls have already returned - so there's nothing left to execute in Parallel! All the work was already done in the main thread, sequentially.
What we actually want is to tell Python what functions we want to call with what arguments, without actually calling them - in other words, we want to delay the execution.
This is what delayed conveniently allows us to do, with clear syntax. If we want to tell Python that we'd like to call foo(2, g=3) sometime later, we can simply write delayed(foo)(2, g=3). Returned is the tuple (foo, [2], {g: 3}), containing:
a reference to the function we want to call, e.g.foo
all arguments (short "args") without a keyword, e.g.t 2
all keyword arguments (short "kwargs"), e.g. g=3
So, by writing Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages), instead of the above sequence, now the following happens:
A Parallel instance with n_jobs=8 gets created
The list
[delayed(getHog)(i) for i in allImages]
gets created, evaluating to
[(getHog, [img1], {}), (getHog, [img2], {}), ... ]
That list is passed to the Parallel instance
The Parallel instance creates 8 threads and distributes the tuples from the list to them
Finally, each of those threads starts executing the tuples, i.e., they call the first element with the second and the third elements unpacked as arguments tup[0](*tup[1], **tup[2]), turning the tuple back into the call we actually intended to do, getHog(img2).
we need a loop to test a list of different model configurations. This is the main function that drives the grid search process and will call the score_model() function for each model configuration. We can dramatically speed up the grid search process by evaluating model configurations in parallel. One way to do that is to use the Joblib library . We can define a Parallel object with the number of cores to use and set it to the number of scores detected in your hardware.
define executor
executor = Parallel(n_jobs=cpu_count(), backend= 'multiprocessing' )
then create a list of tasks to execute in parallel, which will be one call to the score model() function for each model configuration we have.
suppose def score_model(data, n_test, cfg):
........................
define list of tasks
tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)
we can use the Parallel object to execute the list of tasks in parallel.
scores = executor(tasks)
So what you want to be able to do is pile up a set of function calls and their arguments in such a way that you can pass them out efficiently to a scheduler/executor. Delayed is a decorator that takes in a function and its args and wraps them into an object that can be put in a list and popped out as needed. Dask has the same thing which it uses in part to feed into its graph scheduler.
From reference https://wiki.python.org/moin/ParallelProcessing
The Parallel object creates a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list. The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.
Another thing I would like to suggest is instead of explicitly defining num of cores we can generalize like this:
import multiprocessing
num_core=multiprocessing.cpu_count()
I was thinking about parts of my class api's and one thing that came up was the following:
Should I use a tuple/list of equal attributes or should I use several attributes, e.g. let's say I've got a Controller class which reads several thermometers.
class Controller(object):
def __init__(self):
self.temperature1 = Thermometer()
self.temperature3 = Thermometer()
self.temperature2 = Thermometer()
self.temperature4 = Thermometer()
vs.
class Controller(object):
def __init__(self):
self.temperature = tuple(Thermometer() for _ in range(4))
Is there a best practice when I should use which style?
(Let's assume the number of Thermometers will not be changed, otherwise choosing the second style with a list would be obvious.)
A tuple or list, 100%. variable1, variable2, etc... is a really common anti-pattern.
Think about how you code later - it's likely you'll want to do similar things to these items. In a data structure, you can loop over them to perform operations, with the numbered variable names, you'll have to do it manually. Not only that but it makes it easier to add in more values, it makes you code more generic and therefore more reusable, and means you can add new values mid-execution easily.
Why make the assumption the number will not be changed? More often than not, assumptions like that end up being wrong. Regardless, you can already see that the second example exemplifies the do not repeat yourself idiom that is central to clear, efficient code.
Even if you had more relevant names eg: cpu_temperature, hdd_temperature, I would say that if you ever see yourself performing the same operations on them, you want a data structure, not lots of variables. In this case, a dictionary:
temperatures = {
"cpu": ...,
"hdd": ...,
...
}
The main thing is that by storing the data in a data structure, you are giving the software the information about the grouping you are providing. If you just give them the variable names, you are only telling the programmer(s) - and if they are numbered, then you are not even really telling the programmer(s) what they are.
Another option is to store them as a dictionary:
{1: temp1, 2: temp2}
The most important thing in deciding how to store data is relaying the data's meaning, if these items are essentially the same information in a slightly different context then they should be grouped (in terms of data-type) to relay that - i.e. they should be stored as either a tuple or a dictionary.
Note: if you use a tuple and then later insert more data, e.g. a temp0 at the beginning, then there could be backwards-compatability issues where you've grabbed individual variables. (With a dictionary temp[1] will always return temp1.)
Out of curiosity is more desirable to explicitly pass functions to other functions, or let the function call functions from within. is this a case of Explicit is better than implicit?
for example (the following is only to illustrate what i mean)
def foo(x,y):
return 1 if x > y else 0
partialfun = functools.partial(foo, 1)
def bar(xs,ys):
return partialfun(sum(map(operator.mul,xs,ys)))
>>> bar([1,2,3], [4,5,6])
--or--
def foo(x,y):
return 1 if x > y else 0
partialfun = functools.partial(foo, 1)
def bar(fn,xs,ys):
return fn(sum(map(operator.mul,xs,ys)))
>>> bar(partialfun, [1,2,3], [4,5,6])
There's not really any difference between functions and anything else in this situation. You pass something as an argument if it's a parameter that might vary over different invocations of the function. If the function you are calling (bar in your example) is always calling the same other function, there's no reason to pass that as an argument. If you need to parameterize it so that you can use many different functions (i.e., bar might need to call many functions besides partialfun, and needs to know which one to call), then you need to pass it as an argument.
Generally, yes, but as always, it depends. What you are illustrating here is known as dependency injection. Generally, it is a good idea, as it allows separation of variability from the logic of a given function. This means, for example, that it will be extremely easy for you to test such code.
# To test the process performed in bar(), we can "inject" a function
# which simply returns its argument
def dummy(x):
return x
def bar(fn,xs,ys):
return fn(sum(map(operator.mul,xs,ys)))
>>> assert bar(dummy, [1,2,3], [4,5,6]) == 32
It depends very much on the context.
Basically, if the function is an argument to bar, then it's the responsibility of the caller to know how to implement that function. bar doesn't have to care. But consequently, bar's documentation has to describe what kind of function it needs.
Often this is very appropriate. The obvious example is the map builtin function. map implements the logic of applying a function to each item in a list, and giving back a list of results. map itself neither knows nor cares about what the items are, or what the function is doing to them. map's documentation has to describe that it needs a function of one argument, and each caller of map has to know how to implement or find a suitable function. But this arrangement is great; it allows you to pass a list of your custom objects, and a function which operates specifically on those objects, and map can go away and do its generic thing.
But often this arrangement is inappropriate. A function gives a name to a high level operation and hides the internal implementation details, so you can think of the operation as a unit. Allowing part of its operation to be passed in from outside as a function parameter exposes that it works in a way that uses that function's interface.
A more concrete (though somewhat contrived) example may help. Lets say I've implemented data types representing Person and Job, and I'm writing a function name_and_title for formatting someone's full name and job title into a string, for client code to insert into email signatures or on letterhead or whatever. It's obviously going to take a Person and Job. It could potentially take a function parameter to let the caller decide how to format the person's name: something like lambda firstname, lastname: lastname + ', ' + firstname. But to do this is to expose that I'm representing people's names with a separate first name and last name. If I want to change to supporting a middle name, then either name_and_title won't be able to include the middle name, or I have to change the type of the function it accepts. When I realise that some people have 4 or more names and decide to change to storing a list of names, then I definitely have to change the type of function name_and_title accepts.
So for your bar example, we can't say which is better, because it's an abstract example with no meaning. It depends on whether the call to partialfun is an implementation detail of whatever bar is supposed to be doing, or whether the call to partialfun is something that the caller knows about (and might want to do something else). If it's "part of" bar, then it shouldn't be a parameter. If it's "part of" the caller, then it should be a parameter.
It's worth noting that bar could have a huge number of function parameters. You call sum, map, and operator.mul, which could all be parameterised to make bar more flexible:
def bar(fn, xs,ys, g, h, i):
return fn(g(h(i,xs,ys))
And the way in which g is called on the output of h could be abstracted too:
def bar(fn, xs, ys, g, h, i, j):
return fn(j(g, h(i, xs, ys)))
And we can keep going on and on, until bar doesn't do anything at all, and everything is controlled by the functions passed in, and the caller might as well have just directly done what they want done rather than writing 100 functions to do it and passing those to bar to execute the functions.
So there really isn't a definite answer one way or the other that applies all the time. It depends on the particular code you're writing.
Say I a method to create a dictionary from the given parameters:
def newDict(a,b,c,d): # in reality this method is a bit more complex, I've just shortened for the sake of simplicity
return { "x": a,
"y": b,
"z": c,
"t": d }
And I have another method that calls newDict method each time it is executed. Therefore, at the end, when I look at my cProfiler I see something like this:
17874 calls (17868 primitive) 0.076 CPU seconds
and of course, my newDict method is called 1785 times. Now, my question is whether I can memorize the newDict method so that I reduce the call times? (Just to make sure, the variables change almost in every call, though I'm not sure if it has an effect on memorizing the function)
Sub Question: I believe that 17k calls are too much, and the code is not efficient. But by looking at the stats can you also please state whether this is a normal result or I have too many calls and the code is slow?
You mean memoize not memorize.
If the values are almost always different, memoizing won't help, it will slow things down.
Without seeing your full code, and knowing what it's supposed to do, how can we know if 17k calls is a lot or the little?
If by memorizing you mean memoizing, use functools.lru_cache.
It's a function decorator
The purpose of memoizing is to save a result of an operation that was expensive to perform so that it can be provided a second, third, etc., time without having to repeat the operation and repeatedly incur the expense.
Memoizing is normally applied to a function that (a) performs an expensive operation, (b) always produces the same result given the same arguments, and (c) has no side effects on the program state.
Memoizing is typically implemented within such a function by 'saving' the result along with the values of the arguments that produced that result. This is a special form of the general concept of a cache. Each time the function is called, the function checks its memo cache to see if it has already determined the result that is appropriate for the current values of the arguments. If the cache contains the result, it can be returned without the need to recompute it.
Your function appears to be intended to create a new dict each time it is called. There does not appear to be a sensible way to memoize this function: you always want a new dict returned to the caller so that its use of the dict it receives does not interfere with some other call to the function.
The only way I can visualize using memoizing would be if (1) the computation of one or more of the values placed into the result are expensive (in which case I would probably define a function that computes the value and memoize that function) or (2) the newDict function is intended to return the same collection of values given a particular set of argument values. In the latter case I would not use a dict but would instead use a non-modifiable object (e.g., a class like a dict but with protections against modifying its contents).
Regarding your subquestion, the questions you need to ask are (1) is the number of times newDict is being called appropriate and (2) can the execution time of each execution of newDict be reduced. These are two separate and independent questions that need to be individually addressed as appropriate.
BTW your function definition has a typo in it -- the return should not have a 'd' between the return keyword and the open brace.