Question: What are the pros and cons of writing an __init__ that takes a collection directly as an argument, rather than unpacking its contents?
Context: I'm writing a class to process data from several fields in a database table. I iterate through some large (~100 million rows) query result, passing one row at a time to a class that performs the processing. Each row is retrieved from the database as a tuple (or optionally, as a dictionary).
Discussion: Assume I'm interested in exactly three fields, but what gets passed into my class depends on the query, and the query is written by the user. The most basic approach might be one of the following:
class Direct:
def __init__(self, names):
self.names = names
class Simple:
def __init__(self, names):
self.name1 = names[0]
self.name2 = names[1]
self.name3 = names[2]
class Unpack:
def __init__(self, names):
self.name1, self.name2, self.name3 = names
Here are some examples of rows that might be passed to a new instance:
good = ('Simon', 'Marie', 'Kent') # Exactly what we want
bad1 = ('Simon', 'Marie', 'Kent', '10 Main St') # Extra field(s) behind
bad2 = ('15', 'Simon', 'Marie', 'Kent') # Extra field(s) in front
bad3 = ('Simon', 'Marie') # Forgot a field
When faced with the above, Direct always runs (at least to this point) but is very likely to be buggy (GIGO). It takes one argument and assigns it exactly as given, so this could be a tuple or list of any size, a Null value, a function reference, etc. This is the most quick-and-dirty way I can think of to initialize the object, but I feel like the class should complain immediately when I give it data it's clearly not designed to handle.
Simple handles bad1 correctly, is buggy when given bad2, and throws an error when given bad3. It's convenient to be able to effectively truncate the inputs from bad1 but not worth the bugs that would come from bad2. This one feels naive and inconsistent.
Unpack seems like the safest approach, because it throws an error in all three "bad" cases. The last thing we want to do is silently fill our database with bad information, right? It takes the tuple directly, but allows me to identify its contents as distinct attributes instead of forcing me to keep referring to indices, and complains if the tuple is the wrong size.
On the other hand, why pass a collection at all? Since I know I always want three fields, I can define __init__ to explicitly accept three arguments, and unpack the collection using the *-operator as I pass it to the new object:
class Explicit:
def __init__(self, name1, name2, name3):
self.name1 = name1
self.name2 = name2
self.name3 = name3
names = ('Guy', 'Rose', 'Deb')
e = Explicit(*names)
The only differences I see are that the __init__ definition is a bit more verbose and we raise TypeError instead of ValueError when the tuple is the wrong size. Philosophically, it seems to make sense that if we are taking some group of data (a row of a query) and examining its parts (three fields), we should pass a group of data (the tuple) but store its parts (the three attributes). So Unpack would be better.
If I wanted to accept an indeterminate number of fields, rather than always three, I still have the choice to pass the tuple directly or use arbitrary argument lists (*args, **kwargs) and *-operator unpacking. So I'm left wondering, is this a completely neutral style decision?
This question is probably best answered by trying out the different approaches and seeing what makes the most sense to you and is the most easily understood by others reading your code.
Now that I have the benefit of more experience, I'd ask myself, how do I plan to access these values?
When I access any one of the values in this collection, am I likely to be using most or all of the values in that same subroutine or section of code? If so, the "Direct" approach is a good choice; it's the most compact and it lets me think about the collection as a collection until the point that I absolutely need to pay attention to what's inside.
On the other hand, if I'm using some values here, some values there, I don't want have to constantly remember which index to access or add verbosity in the form of dictionary keys when I could just be referring directly to the values using separately named attributes. I would probably avoid the "Direct" approach in this case so that I only have to even think about the fact that there's a collection when the class is first initialized.
Each of the remaining approaches involves splitting the collection up into different attributes, and I think the clear winner here is the "Explicit" approach. The "Simple" and "Unpack" approaches share a hidden dependency on the order of the collection, without offering any real advantage.
Related
I have a REST api and want to write a wrapper around it in Python for others to use. It's a search api and each parameters are treated as AND
Example:
api/search/v1/search_parameters[words]=cat cute fluffy&search_parameters[author_id]=12345&search_parameters[filter][orientation]=horizontal
What's the most Pythonic way to write a function that takes all this arguments, must specify at least one search_parameters string and value.
My wrapper function would look something like this below but I'm lost with the way the user can input multiple search parameter for this search api call:
def search(self):
url = BASE_URL + search_param_url
response = self.session.get(url)
return response.json()
In the end, users should be able to just call something like api.search()
Disclaimer: questions like what is the most Pythonic (best/prettiest) way can attract unnecessary discussion (and create a distraction) yielding an inconclusive results. My personal recommendation, over reusing recommendation from a particular part of the community would be above all: be consistent across your code and how you design your interfaces. Think of those who will use them (incl. yourself 12 months down the road). As well as "The Best" solution is usually function of the intended purpose and not necessarily a universal constant (even though there might be more or less recommendable ways). That said.
If I understand correctly, your parameters are of key=value pairs nature (and you will expand them into URL as search_parameters[key]=value). Event though the filter and orientation in your example throw me off... if not true, please, describe a bit more and I can revisit my suggestion. For that a dictionary seems to offer itself as a good choice. To get one, your method could be either:
def search(self, search_kwargs):
...
And you expect your user to pass a dict of parameters (args_dict = {'string': 'xxx', ...}; c.search(args_dict)). Or:
def search(self, **kwargs):
...
And you expect your user to pass key/value pairs as keyword arguments of the method (c.search(string='xxx')). I would probably favor the former option. Dict is flexible when you prepare the parameters (and yes, you could also pass a dict in the latter case, but that kind beats the purpose of keyword arguments expansion; always chose the simpler option achieving the same goal).
In any case, you can just take the dict (my_args stands for either one of the two above). Check you have at least one of the required keys:
not ('string' in my_args or 'value' in my_args):
raise SearchParamsError("Require 'string' or 'value'.")
Perform any other sanity checks. Prepare params to be appended to the URL:
url_params = '&'.join(('{}={}'.format(k, my_dict[k]) for k in my_dict))
That's the trivial stuff. But depending on your needs and usage, you may actually introduce a (e.g.) SearchRequest class whose constructor could take initial set of parameters similar to the above described method, but you would have further method(s) allowing to manipulate the search (add more parameters) before executing it. And each parameter addition could be already subject to validity check. You could make the instance callable to execute the search itself (corresponding method) or pass this to a search method that takes a prepared requests as its argument.
Updated based on bit more insight in the comment.
If your API actually uses (arbitrarily) nested mapping objects, dictionary is still a good structure to hold your parameters. I'd pick one of the two options.
You can use nested dictionaries, which might afford you flexibility describing the request and could more accurately reflect how your REST API understand its data -> the way you form your request is more similar to how the REST API describes it. However using keyword arguments mentioned above is no longer an option (or not without extra work similar to the next option and some more translation). And the structure of the data might make (esp. simple cases) using it less convenient. E.g.:
my_dict = {'string': 'foo bar',
'author_id': 12345,
'filter': {'orientation': 'horizontal',
'other': 'baz'},
'other': {'more': {'nested': 1,
'also': 2},
'less': 'flat'}}
def par_dict_format(in_dict, *, _pfx='search_parameters'):
ret = []
for key, value in in_dict.items():
if isinstance(value, dict):
ret.append(par_dict_format(value, _pfx='{}[{}]'.format(_pfx, key)))
else:
ret.append('{}[{}]={}'.format(_pfx, key, value))
return '&'.join(ret)
Or you can opt for a structure of flat key/value pairs introducing notation using reasonable and non-conflicting separator for individual elements. Depending on the separator used, you could even get keyword arguments back into play (not with the . in my example though). One of the downsides is, you effectively create a new/parallel interface and notation. E.g.:
my_dict = {'string': 'foo bar',
'author_id': 12345,
'filter.orientation': 'horizontal',
'filter.other': 'baz',
'other.more.nested': 1,
'other.more.also': 2,
'other.more.also': 2,
'other.less': 'flat'}
def par_dict_format(in_dict):
ret = []
for key, value in in_dict.items():
key_str = ''.join(('[{}]'.format(p) for p in key.split('.')))
ret.append('{}={}'.format(key_str, value))
return '&'.join(('search_parameters{}'.format(i) for i in ret))
My take on these two would be. If I mostly construct the query programmatically (for instance having different methods to launch different queries), I'd lean to nesting dictionaries. If expected usage would be geared more towards people writing queries directly, calling the search method or even perhaps exposing it through a CLI, the latter (flat) structure could be easier to use/write for that.
For legibility purposes, I would like to have a custom class that behaves exactly like a dict (but carries a meaningful type instead of the more general dict type):
class Derivatives(dict):
"Dictionary that represents the derivatives."
Now, is there a way of building new objects of this class in a way that does not involve copies? The naive usage
derivs = Derivatives({var: 1}) # var is a Python object
in fact creates a copy of the dictionary passed as an argument, which I would like to avoid, for efficiency reasons.
I tried to bypass the copy but then the class of the dict cannot be changed, in CPython:
class Derivatives(dict):
def __new__(cls, init_dict):
init_dict.__class__ = cls # Fails with __class__ assignment: only for heap types
return init_dict
I would like to have both the ability to give an explicit class name to the dictionaries that the program manipulates and an efficient way of building such dictionaries (instead of being forced to copy a Python dict). Is this doable efficiently in Python?
PS: The use case is maybe 100,000 creations of single-key Derivatives, where the key is a variable (not a string, so no keyword initialization). This is actually not slow, so "efficiency reasons" here means more something like "elegance": there is ideally no need to waste time doing a copy when the copy is not needed. So, in this particular case the question is more about the elegance/clarity that Python can bring here than about running speed.
By inheriting from dict you are given three possibilities for constructor arguments: (baring the {} literal)
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
This means that, in order to instantiate your instance you must do one of the following:
Pass the variables as keywords D(x=1) which are then packed into an intermediate dictionary anyway.
Create a plain dictionary and pass it as a mapping.
Pass an iterable of (key,value) pairs.
So in all three of these cases you will need to create intermediate objects to satisfy the dict constructor.
The third option for a single pair it would look like D(((var,1),)) which I highly recommend against for readability sake.
So if you want your class to inherit from a dictionary, using Derivatives({var: 1}) is your most efficient and most readable option.
As a personal note if you will have thousands of single pair dictionaries I'm not sure how the dict setup is the best in the first place, you may just reconsider the basis of your class.
TL;DR: There's not general-purpose way to do it unless you do it in C.
Long answer:
The dict class is implemented in C. Thus, there is no way to access it's internal properties - and most importantly, it's internal hash table, unless you use C.
In C, you could simply copy the pointer representing the hash table into your object without having to iterate over the dict (key, value) pairs and insert them into your object. (Of course, it's a bit more complicated than this. Note that I omit memory management details).
Longer answer:
I'm not sure why you are concerned about efficiency.
Python passes arguments as references. It rarely every copies unless you explicitly tell it to.
I read in the comments that you can't use named parameters, as the keys are actual Python objects. That leaves me to understand that you're worried about copying the dict keys (and maybe values). However, even the dictionary keys are not copied, and passed by reference! Consider this code:
class Test:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return self.x
t = Test(1, 2)
print(t.y) # prints 2
d = {t: 1}
print(d[t]) # prints 1
keys = list(d.keys())
keys[0].y = 10
print(t.y) # prints 10! No copying was made when inserting object into dictionary.
Thus, the only remaining area of concern is iterating through the dict and inserting the values in your Derivatives class. This is unavoidable, unless you can somehow set the internal hash table of your class to the dict's internal hash table. There is no way to do this in pure python, as the dict class is implemented in C (as mentioned above).
Note that others have suggested using generators. This seems like a good idea too - say if you were reading the derivatives from a file or if you were generating them with a simple formula. It would avoid creating the dict object in the first place. However, there will be no noticable improvements in efficiency if the generators are just wrappers around lists (or any other data structure that can contain an arbritary set of values).
Your best bet is do stick with your original method. Generators are great, but they can't efficiently represent an arbritary set of values (which might be the case in your scenario). It's also not worth it to do it in C.
EDIT: It might be worth it to do it in C, after all!
I'm not too big on the details of the Python C API, but consider defining a class in C, for example,DerivativesBase (deriving from dict). All you do is define an __init__ function in C for DerivativesBase that takes a dict as a parameter and copies the hash table pointer from the dict into your DerivativesBase object. Then, in python, your Derivatives class derives from DerivativesBase and implements the bulk of the functionality.
I need a container that can collect a number of objects and provides some reporting functionality on the container's elements. Essentially, I'd like to be able to do:
magiclistobject = MagicList()
magiclistobject.report() ### generates all my needed info about the list content
So I thought of subclassing the normal list and adding a report() method. That way, I get to use all the built-in list functionality.
class SubClassedList(list):
def __init__(self):
list.__init__(self)
def report(self): # forgive the silly example
if 999 in self:
print "999 Alert!"
Instead, I could also create my own class that has a magiclist attribute but I would then have to create new methods for appending, extending, etc., if I want to get to the list using:
magiclistobject.append() # instead of magiclistobject.list.append()
I would need something like this (which seems redundant):
class MagicList():
def __init__(self):
self.list = []
def append(self,element):
self.list.append(element)
def extend(self,element):
self.list.extend(element)
# more list functionality as needed...
def report(self):
if 999 in self.list:
print "999 Alert!"
I thought that subclassing the list would be a no-brainer. But this post here makes it sounds like a no-no. Why?
One reason why extending list might be bad is since it ties together your 'MagicReport' object too closely to the list. For example, a Python list supports the following methods:
append
count
extend
index
insert
pop
remove
reverse
sort
It also contains a whole host of other operations (adding, comparisons using < and >, slicing, etc).
Are all of those operations things that your 'MagicReport' object actually wants to support? For example, the following is legal Python:
b = [1, 2]
b *= 3
print b # [1, 2, 1, 2, 1, 2]
This is a pretty contrived example, but if you inherit from 'list', your 'MagicReport' object will do exactly the same thing if somebody inadvertently does something like this.
As another example, what if you try slicing your MagicReport object?
m = MagicReport()
# Add stuff to m
slice = m[2:3]
print type(slice)
You'd probably expect the slice to be another MagicReport object, but it's actually a list. You'd need to override __getslice__ in order to avoid surprising behavior, which is a bit of a pain.
It also makes it harder for you to change the implementation of your MagicReport object. If you end up needing to do more sophisticated analysis, it often helps to be able to change the underlying data structure into something more suited for the problem.
If you subclass list, you could get around this problem by just providing new append, extend, etc methods so that you don't change the interface, but you won't have any clear way of determining which of the list methods are actually being used unless you read through the entire codebase. However, if you use composition and just have a list as a field and create methods for the operations you support, you know exactly what needs to be changed.
I actually ran into a scenario very similar to your at work recently. I had an object which contained a collection of 'things' which I first internally represented as a list. As the requirements of the project changed, I ended up changing the object to internally use a dict, a custom collections object, then finally an OrderedDict in rapid succession. At least in my experience, composition makes it much easier to change how something is implemented as opposed to inheritance.
That being said, I think extending list might be ok in scenarios where your 'MagicReport' object is legitimately a list in all but name. If you do want to use MagicReport as a list in every single way, and don't plan on changing its implementation, then it just might be more convenient to subclass list and just be done with it.
Though in that case, it might be better to just use a list and write a 'report' function -- I can't imagine you needing to report the contents of the list more than once, and creating a custom object with a custom method just for that purpose might be overkill (though this obviously depends on what exactly you're trying to do)
As a general rule, whenever you ask yourself "should I inherit or have a member of that type", choose not to inherit. This rule of thumb is known as "favour composition over inheritance".
The reason why this is so is: composition is appropriate where you want to use features of another class; inheritance is appropriate if other code needs to use the features of the other class with the class you are creating.
I was thinking about parts of my class api's and one thing that came up was the following:
Should I use a tuple/list of equal attributes or should I use several attributes, e.g. let's say I've got a Controller class which reads several thermometers.
class Controller(object):
def __init__(self):
self.temperature1 = Thermometer()
self.temperature3 = Thermometer()
self.temperature2 = Thermometer()
self.temperature4 = Thermometer()
vs.
class Controller(object):
def __init__(self):
self.temperature = tuple(Thermometer() for _ in range(4))
Is there a best practice when I should use which style?
(Let's assume the number of Thermometers will not be changed, otherwise choosing the second style with a list would be obvious.)
A tuple or list, 100%. variable1, variable2, etc... is a really common anti-pattern.
Think about how you code later - it's likely you'll want to do similar things to these items. In a data structure, you can loop over them to perform operations, with the numbered variable names, you'll have to do it manually. Not only that but it makes it easier to add in more values, it makes you code more generic and therefore more reusable, and means you can add new values mid-execution easily.
Why make the assumption the number will not be changed? More often than not, assumptions like that end up being wrong. Regardless, you can already see that the second example exemplifies the do not repeat yourself idiom that is central to clear, efficient code.
Even if you had more relevant names eg: cpu_temperature, hdd_temperature, I would say that if you ever see yourself performing the same operations on them, you want a data structure, not lots of variables. In this case, a dictionary:
temperatures = {
"cpu": ...,
"hdd": ...,
...
}
The main thing is that by storing the data in a data structure, you are giving the software the information about the grouping you are providing. If you just give them the variable names, you are only telling the programmer(s) - and if they are numbered, then you are not even really telling the programmer(s) what they are.
Another option is to store them as a dictionary:
{1: temp1, 2: temp2}
The most important thing in deciding how to store data is relaying the data's meaning, if these items are essentially the same information in a slightly different context then they should be grouped (in terms of data-type) to relay that - i.e. they should be stored as either a tuple or a dictionary.
Note: if you use a tuple and then later insert more data, e.g. a temp0 at the beginning, then there could be backwards-compatability issues where you've grabbed individual variables. (With a dictionary temp[1] will always return temp1.)
I have some functions in my code that accept either an object or an iterable of objects as input. I was taught to use meaningful names for everything, but I am not sure how to comply here. What should I call a parameter that can a sinlge object or an iterable of objects? I have come up with two ideas, but I don't like either of them:
FooOrManyFoos - This expresses what goes on, but I could imagine that someone not used to it could have trouble understanding what it means right away
param - Some generic name. This makes clear that it can be several things, but does explain nothing about what the parameter is used for.
Normally I call iterables of objects just the plural of what I would call a single object. I know this might seem a little bit compulsive, but Python is supposed to be (among others) about readability.
I have some functions in my code that accept either an object or an iterable of objects as input.
This is a very exceptional and often very bad thing to do. It's trivially avoidable.
i.e., pass [foo] instead of foo when calling this function.
The only time you can justify doing this is when (1) you have an installed base of software that expects one form (iterable or singleton) and (2) you have to expand it to support the other use case. So. You only do this when expanding an existing function that has an existing code base.
If this is new development, Do Not Do This.
I have come up with two ideas, but I don't like either of them:
[Only two?]
FooOrManyFoos - This expresses what goes on, but I could imagine that someone not used to it could have trouble understanding what it means right away
What? Are you saying you provide NO other documentation, and no other training? No support? No advice? Who is the "someone not used to it"? Talk to them. Don't assume or imagine things about them.
Also, don't use Leading Upper Case Names.
param - Some generic name. This makes clear that it can be several things, but does explain nothing about what the parameter is used for.
Terrible. Never. Do. This.
I looked in the Python library for examples. Most of the functions that do this have simple descriptions.
http://docs.python.org/library/functions.html#isinstance
isinstance(object, classinfo)
They call it "classinfo" and it can be a class or a tuple of classes.
You could do that, too.
You must consider the common use case and the exceptions. Follow the 80/20 rule.
80% of the time, you can replace this with an iterable and not have this problem.
In the remaining 20% of the cases, you have an installed base of software built around an assumption (either iterable or single item) and you need to add the other case. Don't change the name, just change the documentation. If it used to say "foo" it still says "foo" but you make it accept an iterable of "foo's" without making any change to the parameters. If it used to say "foo_list" or "foo_iter", then it still says "foo_list" or "foo_iter" but it will quietly tolerate a singleton without breaking.
80% of the code is the legacy ("foo" or "foo_list")
20% of the code is the new feature ("foo" can be an iterable or "foo_list" can be a single object.)
I guess I'm a little late to the party, but I'm suprised that nobody suggested a decorator.
def withmany(f):
def many(many_foos):
for foo in many_foos:
yield f(foo)
f.many = many
return f
#withmany
def process_foo(foo):
return foo + 1
processed_foo = process_foo(foo)
for processed_foo in process_foo.many(foos):
print processed_foo
I saw a similar pattern in one of Alex Martelli's posts but I don't remember the link off hand.
It sounds like you're agonizing over the ugliness of code like:
def ProcessWidget(widget_thing):
# Infer if we have a singleton instance and make it a
# length 1 list for consistency
if isinstance(widget_thing, WidgetType):
widget_thing = [widget_thing]
for widget in widget_thing:
#...
My suggestion is to avoid overloading your interface to handle two distinct cases. I tend to write code that favors re-use and clear naming of methods over clever dynamic use of parameters:
def ProcessOneWidget(widget):
#...
def ProcessManyWidgets(widgets):
for widget in widgets:
ProcessOneWidget(widget)
Often, I start with this simple pattern, but then have the opportunity to optimize the "Many" case when there are efficiencies to gain that offset the additional code complexity and partial duplication of functionality. If this convention seems overly verbose, one can opt for names like "ProcessWidget" and "ProcessWidgets", though the difference between the two is a single easily missed character.
You can use *args magic (varargs) to make your params always be iterable.
Pass a single item or multiple known items as normal function args like func(arg1, arg2, ...) and pass iterable arguments with an asterisk before, like func(*args)
Example:
# magic *args function
def foo(*args):
print args
# many ways to call it
foo(1)
foo(1, 2, 3)
args1 = (1, 2, 3)
args2 = [1, 2, 3]
args3 = iter((1, 2, 3))
foo(*args1)
foo(*args2)
foo(*args3)
Can you name your parameter in a very high-level way? people who read the code are more interested in knowing what the parameter represents ("clients") than what their type is ("list_of_tuples"); the type can be defined in the function documentation string, which is a good thing since it might change, in the future (the type is sometimes an implementation detail).
I would do 1 thing,
def myFunc(manyFoos):
if not type(manyFoos) in (list,tuple):
manyFoos = [manyFoos]
#do stuff here
so then you don't need to worry anymore about its name.
in a function you should try to achieve to have 1 action, accept the same parameter type and return the same type.
Instead of filling the functions with ifs you could have 2 functions.
Since you don't care exactly what kind of iterable you get, you could try to get an iterator for the parameter using iter(). If iter() raises a TypeError exception, the parameter is not iterable, so you then create a list or tuple of the one item, which is iterable and Bob's your uncle.
def doIt(foos):
try:
iter(foos)
except TypeError:
foos = [foos]
for foo in foos:
pass # do something here
The only problem with this approach is if foo is a string. A string is iterable, so passing in a single string rather than a list of strings will result in iterating over the characters in a string. If this is a concern, you could add an if test for it. At this point it's getting wordy for boilerplate code, so I'd break it out into its own function.
def iterfy(iterable):
if isinstance(iterable, basestring):
iterable = [iterable]
try:
iter(iterable)
except TypeError:
iterable = [iterable]
return iterable
def doIt(foos):
for foo in iterfy(foos):
pass # do something
Unlike some of those answering, I like doing this, since it eliminates one thing the caller could get wrong when using your API. "Be conservative in what you generate but liberal in what you accept."
To answer your original question, i.e. what you should name the parameter, I would still go with "foos" even though you will accept a single item, since your intent is to accept a list. If it's not iterable, that is technically a mistake, albeit one you will correct for the caller since processing just the one item is probably what they want. Also, if the caller thinks they must pass in an iterable even of one item, well, that will of course work fine and requires very little syntax, so why worry about correcting their misapprehension?
I would go with a name explaining that the parameter can be an instance or a list of instances. Say one_or_more_Foo_objects. I find it better than the bland param.
I'm working on a fairly big project now and we're passing maps around and just calling our parameter map. The map contents vary depending on the function that's being called. This probably isn't the best situation, but we reuse a lot of the same code on the maps, so copying and pasting is easier.
I would say instead of naming it what it is, you should name it what it's used for. Also, just be careful that you can't call use in on a not iterable.