Custom class which is a dict, but initialized without dict copy? - python

For legibility purposes, I would like to have a custom class that behaves exactly like a dict (but carries a meaningful type instead of the more general dict type):
class Derivatives(dict):
"Dictionary that represents the derivatives."
Now, is there a way of building new objects of this class in a way that does not involve copies? The naive usage
derivs = Derivatives({var: 1}) # var is a Python object
in fact creates a copy of the dictionary passed as an argument, which I would like to avoid, for efficiency reasons.
I tried to bypass the copy but then the class of the dict cannot be changed, in CPython:
class Derivatives(dict):
def __new__(cls, init_dict):
init_dict.__class__ = cls # Fails with __class__ assignment: only for heap types
return init_dict
I would like to have both the ability to give an explicit class name to the dictionaries that the program manipulates and an efficient way of building such dictionaries (instead of being forced to copy a Python dict). Is this doable efficiently in Python?
PS: The use case is maybe 100,000 creations of single-key Derivatives, where the key is a variable (not a string, so no keyword initialization). This is actually not slow, so "efficiency reasons" here means more something like "elegance": there is ideally no need to waste time doing a copy when the copy is not needed. So, in this particular case the question is more about the elegance/clarity that Python can bring here than about running speed.

By inheriting from dict you are given three possibilities for constructor arguments: (baring the {} literal)
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
This means that, in order to instantiate your instance you must do one of the following:
Pass the variables as keywords D(x=1) which are then packed into an intermediate dictionary anyway.
Create a plain dictionary and pass it as a mapping.
Pass an iterable of (key,value) pairs.
So in all three of these cases you will need to create intermediate objects to satisfy the dict constructor.
The third option for a single pair it would look like D(((var,1),)) which I highly recommend against for readability sake.
So if you want your class to inherit from a dictionary, using Derivatives({var: 1}) is your most efficient and most readable option.
As a personal note if you will have thousands of single pair dictionaries I'm not sure how the dict setup is the best in the first place, you may just reconsider the basis of your class.

TL;DR: There's not general-purpose way to do it unless you do it in C.
Long answer:
The dict class is implemented in C. Thus, there is no way to access it's internal properties - and most importantly, it's internal hash table, unless you use C.
In C, you could simply copy the pointer representing the hash table into your object without having to iterate over the dict (key, value) pairs and insert them into your object. (Of course, it's a bit more complicated than this. Note that I omit memory management details).
Longer answer:
I'm not sure why you are concerned about efficiency.
Python passes arguments as references. It rarely every copies unless you explicitly tell it to.
I read in the comments that you can't use named parameters, as the keys are actual Python objects. That leaves me to understand that you're worried about copying the dict keys (and maybe values). However, even the dictionary keys are not copied, and passed by reference! Consider this code:
class Test:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return self.x
t = Test(1, 2)
print(t.y) # prints 2
d = {t: 1}
print(d[t]) # prints 1
keys = list(d.keys())
keys[0].y = 10
print(t.y) # prints 10! No copying was made when inserting object into dictionary.
Thus, the only remaining area of concern is iterating through the dict and inserting the values in your Derivatives class. This is unavoidable, unless you can somehow set the internal hash table of your class to the dict's internal hash table. There is no way to do this in pure python, as the dict class is implemented in C (as mentioned above).
Note that others have suggested using generators. This seems like a good idea too - say if you were reading the derivatives from a file or if you were generating them with a simple formula. It would avoid creating the dict object in the first place. However, there will be no noticable improvements in efficiency if the generators are just wrappers around lists (or any other data structure that can contain an arbritary set of values).
Your best bet is do stick with your original method. Generators are great, but they can't efficiently represent an arbritary set of values (which might be the case in your scenario). It's also not worth it to do it in C.
EDIT: It might be worth it to do it in C, after all!
I'm not too big on the details of the Python C API, but consider defining a class in C, for example,DerivativesBase (deriving from dict). All you do is define an __init__ function in C for DerivativesBase that takes a dict as a parameter and copies the hash table pointer from the dict into your DerivativesBase object. Then, in python, your Derivatives class derives from DerivativesBase and implements the bulk of the functionality.

Related

Sampling dictionaries in Python 3.x

In Python 3, dict_values, dict_keys and dict_items do not support indexing
my_dict = {'a': 0', 'b': 1', 'c': 2}
All of the queries below fail:
my_dict.keys()[1]
my_dict.values()[1]
my_dict.items()[1]
for that reason.
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
The key types are explained under Dictionary view objects, and also guaranteed to be subclasses of collections.abc.KeysView and friends. Basically, this means you can only count on them having __contains__, __iter__, and __len__.
They don't directly support indexing because their ordering can be invalidated.* But practically, in any implementation of Python, they're only actually invalidated if you mutate the dictionary. Which means you can safely do things like this:
next(itertools.islice(my_dict.keys(), i, None))
Basically, the same way you'd index a set, or any other non-iterator iterable.
* The actual rules as to what behavior is documented have changed a few times. The current version actually says "They provide a dynamic view on the dictionary’s entries, which means that when the dictionary changes, the view reflects these changes," which implies the practical rule can now be relied on. But even if you're using an older version that, e.g., explicitly only guarantees consistency between adjacent calls to keys, values, items, and related functions, unless you're worried about someone writing a new implementation of Python 2.6 or 3.1 or something, there's no reason to worry about that.
Of course you probably want to wrap that up in a function that's more readable. In fact, I'd do it in two steps. First, use the nth function from the itertools recipes:
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
Then wrap up the key indexing:
def getkey(mapping, index, default=None):
return nth(mapping.keys(), index, default)
What if you want a random sample? Well, dictionary views are Sized, as are dictionaries themselves, so you can always use randrange:
def choosekey(mapping):
return getkey(mapping, random.randrange(len(mapping)))
If you just want a key, value or item, use next() and iter():
next(iter(my_dict))
next(iter(my_dict.values()))
next(iter(my_dict.items()))

Should I subclass list or create class with list as attribute?

I need a container that can collect a number of objects and provides some reporting functionality on the container's elements. Essentially, I'd like to be able to do:
magiclistobject = MagicList()
magiclistobject.report() ### generates all my needed info about the list content
So I thought of subclassing the normal list and adding a report() method. That way, I get to use all the built-in list functionality.
class SubClassedList(list):
def __init__(self):
list.__init__(self)
def report(self): # forgive the silly example
if 999 in self:
print "999 Alert!"
Instead, I could also create my own class that has a magiclist attribute but I would then have to create new methods for appending, extending, etc., if I want to get to the list using:
magiclistobject.append() # instead of magiclistobject.list.append()
I would need something like this (which seems redundant):
class MagicList():
def __init__(self):
self.list = []
def append(self,element):
self.list.append(element)
def extend(self,element):
self.list.extend(element)
# more list functionality as needed...
def report(self):
if 999 in self.list:
print "999 Alert!"
I thought that subclassing the list would be a no-brainer. But this post here makes it sounds like a no-no. Why?
One reason why extending list might be bad is since it ties together your 'MagicReport' object too closely to the list. For example, a Python list supports the following methods:
append
count
extend
index
insert
pop
remove
reverse
sort
It also contains a whole host of other operations (adding, comparisons using < and >, slicing, etc).
Are all of those operations things that your 'MagicReport' object actually wants to support? For example, the following is legal Python:
b = [1, 2]
b *= 3
print b # [1, 2, 1, 2, 1, 2]
This is a pretty contrived example, but if you inherit from 'list', your 'MagicReport' object will do exactly the same thing if somebody inadvertently does something like this.
As another example, what if you try slicing your MagicReport object?
m = MagicReport()
# Add stuff to m
slice = m[2:3]
print type(slice)
You'd probably expect the slice to be another MagicReport object, but it's actually a list. You'd need to override __getslice__ in order to avoid surprising behavior, which is a bit of a pain.
It also makes it harder for you to change the implementation of your MagicReport object. If you end up needing to do more sophisticated analysis, it often helps to be able to change the underlying data structure into something more suited for the problem.
If you subclass list, you could get around this problem by just providing new append, extend, etc methods so that you don't change the interface, but you won't have any clear way of determining which of the list methods are actually being used unless you read through the entire codebase. However, if you use composition and just have a list as a field and create methods for the operations you support, you know exactly what needs to be changed.
I actually ran into a scenario very similar to your at work recently. I had an object which contained a collection of 'things' which I first internally represented as a list. As the requirements of the project changed, I ended up changing the object to internally use a dict, a custom collections object, then finally an OrderedDict in rapid succession. At least in my experience, composition makes it much easier to change how something is implemented as opposed to inheritance.
That being said, I think extending list might be ok in scenarios where your 'MagicReport' object is legitimately a list in all but name. If you do want to use MagicReport as a list in every single way, and don't plan on changing its implementation, then it just might be more convenient to subclass list and just be done with it.
Though in that case, it might be better to just use a list and write a 'report' function -- I can't imagine you needing to report the contents of the list more than once, and creating a custom object with a custom method just for that purpose might be overkill (though this obviously depends on what exactly you're trying to do)
As a general rule, whenever you ask yourself "should I inherit or have a member of that type", choose not to inherit. This rule of thumb is known as "favour composition over inheritance".
The reason why this is so is: composition is appropriate where you want to use features of another class; inheritance is appropriate if other code needs to use the features of the other class with the class you are creating.

Referencing Python list elements without specifying indexes

How can I store values in a list without specifying index numbers?
For example
outcomeHornFive=5
someList = []
someList.append(outComeHornFive)
instead of doing this,
someList[0] # to reference horn five outcome
how can i do something like this? The reason is there are many items that I need to reference within the list and I just think it's really inconvenient to keep track of which index is what.
someList.hornFive
You can use another data structure if you'd like to reference things by attribute access (or otherwise via a name).
You can put them in a dict, or create a class, or do something else. It depends what kind of other interaction you want to have with that object.
(P.S., we call those lists, not arrays).
Instead of using a list you can use a dictionary.
See data types in the python documentation.
A dictionary allows you to lookup a value using a key:
my_dict["HornFive"] = 20
You cannot and you shouldn't. If you could do that, how would you refer to the list itself? And you will need to refer to the list itself.
The reason is there are many items that i need to reference within the list and I just think it's really inconvenient to keep track of which index is what.
You'll need to do something of that ilk anyway, no matter how you organize your data. If you had separate variables, you'd need to know which variable stores what. If you had your way with this, you'd still need to know that a bare someList refers to "horn five" and not to, say, "horn six".
One advantage of lists and dicts is that you can factor out this knowledge and write generic code. A dictionary, or even a custom class (if there is a finite number of semantically distinct attributes, and you'd never have to use it as a collection), may help with the readability by giving it an actual name instead of a numeric index.
referenced from http://parand.com/say/index.php/2008/10/13/access-python-dictionary-keys-as-properties/
Say you want to access the values if your dictionary via the dot notation instead of the dictionary syntax. That is, you have:
d = {'name':'Joe', 'mood':'grumpy'}
And you want to get at “name” and “mood” via
d.name
d.mood
instead of the usual
d['name']
d['mood']
Why would you want to do this? Maybe you’re fond of the Javascript Way. Or you find it more aesthetic. In my case I need to have the same piece of code deal with items that are either instances of Django models or plain dictionaries, so I need to provide a uniform way of getting at the attributes.
Turns out it’s pretty simple:
class DictObj(object):
def __init__(self, d):
self.d = d
def __getattr__(self, m):
return self.d.get(m, None)
d = DictObj(d)
d.name
# prints Joe
d.mood
# prints grumpy

python: unintentionally modifying parameters passed into a function

A few times I accidentally modified the input to a function. Since Python has no constant references, I'm wondering what coding techniques might help me avoid making this mistake too often?
Example:
class Table:
def __init__(self, fields, raw_data):
# fields is a dictionary with field names as keys, and their types as value
# sometimes, we want to delete some of the elements
for field_name, data_type in fields.items():
if some_condition(field_name, raw_data):
del fields[field_name]
# ...
# in another module
# fields is already initialized here to some dictionary
table1 = Table(fields, raw_data1) # fields is corrupted by Table's __init__
table2 = Table(fields, raw_data2)
Of course the fix is to make a copy of the parameter before I change it:
def __init__(self, fields, raw_data):
fields = copy.copy(fields)
# but copy.copy is safer and more generally applicable than .copy
# ...
But it's so easy to forget.
I'm half thinking to make a copy of each argument at the beginning of every function, unless the argument potentially refers to a large data set which may be expensive to copy or unless the argument is intended to be modified. That would nearly eliminate the problem, but it would result in a significant amount of useless code at the start of each function. In addition, it would essentially override Python's approach of passing parameters by reference, which presumably was done for a reason.
First general rule: don't modify containers: create new ones.
So don't modify your incoming dictionary, create a new dictionary with a subset of the keys.
self.fields = dict( key, value for key, value in fields.items()
if accept_key(key, data) )
Such methods are typically slightly more efficient then going through and deleting the bad elements anyways. More generally, its often easier to avoid modifying objects and instead create new ones.
Second general rule: don't modify containers after passing them off.
You can't generally assume that containers to which you have passed data have made their own copies. As result, don't try to modify the containers you've given them. Any modifications should be done before handing off the data. Once you've passed the container to somebody else you are no longer the sole master of it.
Third general rule: don't modify containers you didn't create.
If you get passed some sort of container, you do not know who else might be using the container. So don't modify it. Either use the unmodified version or invoke rule1, creating a new container with the desired changes.
Fourth general rule: (stolen from Ethan Furman)
Some functions are supposed to modify the list. That is their job. If this is the case make that apparent in the function name (such as the list methods append and extend).
Putting it all together:
A piece of code should only modify a container when it is the only piece of code with access to that container.
Making copies of parameters 'just-in-case' is a bad idea: you end up paying for it in lousy performance; or you have to keep track of the sizes of your arguments instead.
Better to get a good understanding of objects and names and how Python deals with them. A good start being this article.
The importart point being that
def modi_list(alist):
alist.append(4)
some_list = [1, 2, 3]
modi_list(some_list)
print(some_list)
has exactly the same affect as
some_list = [1, 2, 3]
same_list = some_list
same_list.append(4)
print(some_list)
because in the function call no copying of arguments is taking place, no creating pointers is taking place... what is taking place is Python saying alist = some_list and then executing the code in the function modi_list(). In other words, Python is binding (or assigning) another name to the same object.
Finally, when you do have a function that is going to modify its arguments, and you don't want those changes visible outside the function, you can usually just do a shallow copy:
def dont_modi_list(alist):
alist = alist[:] # make a shallow copy
alist.append(4)
Now some_list and alist are two different list objects that happen to contain the same objects -- so if you are just messing with the list object (inserting, deleting, rearranging) then you are fine, buf if you are going to go even deeper and cause modifications to the objects in the list then you will need to do a deepcopy(). But it's up to you to keep track of such things and code appropriately.
You can use a metaclass as follows:
import copy, new
class MakeACopyOfConstructorArguments(type):
def __new__(cls, name, bases, dct):
rv = type.__new__(cls, name, bases, dct)
old_init = dct.get("__init__")
if old_init is not None:
cls.__old_init = old_init
def new_init(self, *a, **kw):
a = copy.deepcopy(a)
kw = copy.deepcopy(kw)
cls.__old_init(self, *a, **kw)
rv.__init__ = new.instancemethod(new_init, rv, cls)
return rv
class Test(object):
__metaclass__ = MakeACopyOfConstructorArguments
def __init__(self, li):
li[0]=3
print li
li = range(3)
print li
t = Test(li)
print li
There is a best practice for this, in Python, and is called unit testing.
The main point here is that dynamic languages allows for much rapid development even with full unit tests; and unit tests are a much tougher safety net than static typing.
As writes Martin Fowler:
The general argument for static types is that it catches bugs that are
otherwise hard to find. But I discovered that in the presence of
SelfTestingCode, most bugs that static types would have were found
just as easily by the tests.

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!
Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB
shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).
Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.
"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Categories

Resources