memory usage #on_trait_change vs _foo_changed()

memory usage #on_trait_change vs _foo_changed() - python

I did built an application with Enthought Traits, which is using too much memory. I think, the problem is caused by trait notifications:
There seems to be a fundamental difference in memory usage of events caught by #on_trait_change or by using the special naming convention (e.g. _foo_changed() ). I made a little example with two classes Foo and FooDecorator, which i assumed to show exactly the same behaviour. But they don't!
from traits.api import *
class Foo(HasTraits):
a = List(Int)
def _a_changed(self):
pass
def _a_items_changed(self):
pass
class FooDecorator(HasTraits):
a = List(Int)
#on_trait_change('a[]')
def bar(self):
pass
if __name__ == '__main__':
n = 100000
c = FooDecorator
a = [c() for i in range(n)]
When running this script with c = Foo, Windows task manager shows a memory usage for the whole python process of 70MB, which stays constant for increasing n. For c = FooDecorator, the python process is using 450MB, increasing for higher n.
Can you please explain this behaviour to me?
EDIT: Maybe i should rephrase: Why would anyone choose FooDecorator over Foo?
EDIT 2: I just uninstalled python(x,y) 2.7.9 and installed the newest version of canopy with traits 4.5.0. Now the 450MB became 750MB.
EDIT 3: Compiled traits-4.6.0.dev0-py2.7-win-amd64 myself. The outcome is the same as in EDIT 2. So despite all plausibility https://github.com/enthought/traits/pull/248/files does not seem to be the cause.

I believe you are seeing the effect of a memory leak that has been fixed recently:
https://github.com/enthought/traits/pull/248/files
As for why one would use the decorator, in this particular instance the two versions are practically equivalent.
In general, the decorator is more flexible: you can give a list of traits to listen to, and you can use the extended name notation, as described here:
http://docs.enthought.com/traits/traits_user_manual/notification.html#semantics
For example, in this case:
class Bar(HasTraits):
b = Str
class FooDecorator(HasTraits):
a = List(Bar)
#on_trait_change('a.b')
def bar(self):
print 'change'
the bar notifier is going to be called for changes to the trait a, its items, and for the change of the trait b in each of the Bar items. Extended names can be quite powerful.

What's going on here is that Traits has two distinct ways of handling notifications: static notifiers and dynamic notifiers.
Static notifiers (such as those created by the specially-named _*_changed() methods) are fairly light-weight: each trait on an instance has a list of notifiers on t, which are basically the functions or methods with a lightweight wrapper.
Dynamic notifiers (such as those created with on_trait_change() and the extended trait name conventions like a[] are significantly more powerful and flexible, but as a result they are much more heavy-weight. In particular, in addition to the wrapper object they create, they also create a parsed representation of the extended trait name and a handler object, some of which are in-turn HasTraits subclass instances.
As a result, even for a simple expression like a[] there will be a fair number of new Python objects created, and these objects have to be created for every on_trait_change listener on every instance separately to properly handle corner-cases like instance traits. The relevant code is here: https://github.com/enthought/traits/blob/master/traits/has_traits.py#L2330
Base on the reported numbers, the majority of the difference in memory usage that you are seeing is in the creation of this dynamic listener infrastructure for each instance and each on_trait_change decorator.
It's worth noting that there is a short-circuit for on_trait_change in the case where you are using a simple trait name, in which case it generates a static trait notifier instead of a dynamic notifier. So if you were to instead write something like:
class FooSimpleDecorator(HasTraits):
a = List(Int)
#on_trait_change('a')
def a_updated(self):
pass
#on_trait_change('a_items')
def a_items_updated(self):
pass
you should see similar memory performance to the specially-named methods.
To answer the rephrased question about "why use on_trait_change", in FooDecorator you can write one method instead of two if your response to a change of either the list or any items in the list is the same. This makes code significantly easier to debug and maintain, and if you aren't creating thousands of these objects then the extra memory usage is negligible.
This becomes even more of a factor when you consider more sophisticated extended trait name patterns, where the dynamic listeners automatically handle changes which would otherwise require significant manual (and error-prone) code for hooking up and removing listeners from intermediate objects and traits. The power and simplicity of this approach usually outweighs the concerns about memory usage.

Related

Best practice: how to pass many arguments to a function?

I am running some numerical simulations, in which my main function must receive lots and lots of arguments - I'm talking 10 to 30 arguments depending on the simulation to run.
What are some best practices to handle cases like this? Dividing the code into, say, 10 functions with 3 arguments each doesn't sound very feasible in my case.
What I do is create an instance of a class (with no methods), store the inputs as attributes of that instance, then pass the instance - so the function receives only one input.
I like this because the code looks clean, easy to read, and because I find it easy to define and run alternative scenarios.
I dislike it because accessing class attributes within a function is slower than accessing a local variable (see: How / why to optimise code by copying class attributes to local variables?) and because it is not an efficient use of memory - too much data stored multiple times unnecessarily.
Any thoughts or recommendations?
myinput=MyInput()
myinput.input_sql_table = that_sql_table
myinput.input_file = that_input_file
myinput.param1 = param1
myinput.param2 = param2
myoutput = calc(myinput)
Alternative scenarios:
inputs=collections.OrderedDict()
scenarios=collections.OrderedDict()
inputs['base scenario']=copy.deepcopy(myinput)
inputs['param2 = 100']=copy.deepcopy(myinput)
inputs['param2 = 100'].param2 = 100
# loop through all the inputs and stores the outputs in the ordered dictionary scenarios

I don't think this is really a StackOverflow question, more of a Software Engineering question. For example check out this question.
As far as whether or not this is a good design pattern, this is an excellent way to handle a large number of arguments. You mentioned that this isn't very efficient in terms of memory or speed, but I think you're making an improper micro-optimization.
As far as memory is concerned, the overhead of running the Python interpreter is going to dwarf the couple of extra bytes used by instantiating your class.
Unless you have run a profiler and determined that accessing members of that options class is slowing you down, I wouldn't worry about it. This is especially the case because you're using Python. If speed is a real concern, you should be using something else.
You may not be aware of this, but most of the large scale number crunching libraries for Python aren't actually written in Python, they're just wrappers around C/C++ libraries that are much faster.
I recommend reading this article, it is well established that "Premature optimization is the root of all evil".

You could pass in a dictionary like so:
all_the_kwargs = {kwarg1: 0, kwarg2: 1, kwargN: xyz}
some_func_or_class(**all_the_kwargs)
def some_func_or_class(kwarg1: int = -1, kwarg2: int = 0, kwargN: str = ''):
print(kwarg1, kwarg2, kwargN)
Or you could use several named tuples like referenced here: Type hints in namedtuple
also note that depending on which version of python you are using there may be a limit to the number of arguments you can pass into a function call.
Or you could use just a dictionary:
def some_func(a_dictionary):
a_dictionary.get('argXYZ', None) # defaults to None if argXYZ doesn't exist

Can the default implementation of repr be improved based on the init function?

As per my understanding __repr__ is used to represent a developer/interpreter friendly representation of the object and should possibly be a valid python code which when passed to eval() recreates an identical object.
From Python Docs:
object.repr(self)
Called by the repr() built-in function and by string conversions (reverse quotes) to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines repr() but not str(), then repr() is also used when an “informal” string representation of instances of that class is required.
Link: https://docs.python.org/2/reference/datamodel.html#object.repr
E.g:
class tie(object):
def __init__(self, color):
self.color = color
t = tie('green')
repr(t) # prints <tie object at 0x10fdc4c10>
# can the default implementation be improved to tie(color='green')
# based on the parameters passed in the __init__ function
What challenges would be there to change that implementation apart from backward compatibility/ existing behavior?

Having default repr based on how the object was created would bring inefficiencies and confusion
Size
the __init__ arguments would have to be stored via copy somewhere in the object - making the object bloated
not all object simply copy those values into themselves
for example:
class GreedyMan:
def __init__(self, coins):
self.most_valuable_coin = max(coins)
you would have to save the whole coins collection here
classes are mutable
a Color class that is initialized with 0xff00ff can change over its lifespan into another color
class Color:
def __init__(self, color):
self.color = color
def dilute(self, factor):
self.color = self.color * factor
the dilute can change the state of the class so you would no longer have a color of 0xff00ff but some other one and if something threw an exception like "I dont accept non red colors - provided Color('red')" the programmer would have some debugging to do until he notices someone used the dilute to get some wierd shade..
so why not print the whole state - all of the class properties
result could be huge/infinite
graphs can have nodes that contain other nodes and even cycles
class ChainPart:
def __init__(self, parent):
self.parent = parent
self.children = []
def add_child(self, child):
self.children.append(child)
a=ChainPart(None)
b=ChainPart(a)
b.add_child(a)
the chain part b when printed out with all of its contents recursively would have to print a part which would print part b ... etc ad infinum
so the most obvious way to solve these issues is to leave repr simple and allow the programmer to change it with a custom __repr__ method in the object class.

That's essentially what pickle is trying to do. The idea is that since an object in memory is a graph, if you revisit that graph you can reconstruct the object.
Pickle generates a series of instructions to reconstruct data, but, in principle, it could generate Python code directly.
While I'm going to consider all the problems with it, Pickling is used very heavily for distributed applications where systems need to transmit data from one process to another.
And __repr__ methods are very handy in debugging when you want to be able to reconstruct an object by copying and pasting into a prompt.
Let's look at where pickling doesn't work:
Some objects actually represent objects! A network socket, for instance, identifies the kernel resources managing the actual network connection. Even if you could recreate those, there wouldn't be another machine listening.
Some objects have real world consequences. An object might represent a password or other secret that should not leave the process's memory e.g. by being printed to logs.
To determine which objects to construct first you must perform a topological sort, and while fast algorithms exist, they aren't free. Especially, things can get hairy if another thread modifies the graph while this is happening.
If you save this data and try to reload it on another version, it will fail. If you design new versions to accept the old data, development slows to a crawl as your code becomes more contorted to deal with backwards compatibility.
It is very tightly coupled to your objects and Python's object model, so you wind up reimplementing it.
It has access to anything in Python, so if someone passes in shutil.rmtree('/') the machine will faithfully execute it.
Most of these aren't show stoppers, but to be fixed require tradeoffs. The main reason why the built-in __repr__ does so little is to ensure that simple things stay simple.
And modules like attrs, which inspired the dataclasses module in python 3.7, provide a canned __repr__ that largely does what you're asking for and answers most of the use cases that I think you're imagining.
I'm working on a language that adresses some of these issues, and I summarized some of the different existing approaches to dealing with the issues I mention above.

Creating a hook to a frequently accessed object

I have an application which relies heavily on a Context instance that serves as the access point to the context in which a given calculation is performed.
If I want to provide access to the Context instance, I can:
rely on global
pass the Context as a parameter to all the functions that require it
I would rather not use global variables, and passing the Context instance to all the functions is cumbersome and verbose.
How would you "hide, but make accessible" the calculation Context?
For example, imagine that Context simply computes the state (position and velocity) of planets according to different data.
class Context(object):
def state(self, planet, epoch):
"""base class --- suppose `state` is meant
to return a tuple of vectors."""
raise NotImplementedError("provide an implementation!")
class DE405Context(Context):
"""Concrete context using DE405 planetary ephemeris"""
def state(self, planet, epoch):
"""suppose that de405 reader exists and can provide
the required (position, velocity) tuple."""
return de405reader(planet, epoch)
def angular_momentum(planet, epoch, context):
"""suppose we care about the angular momentum of the planet,
and that `cross` exists"""
r, v = context.state(planet, epoch)
return cross(r, v)
# a second alternative, a "Calculator" class that contains the context
class Calculator(object):
def __init__(self, context):
self._ctx = context
def angular_momentum(self, planet, epoch):
r, v = self._ctx.state(planet, epoch)
return cross(r, v)
# use as follows:
my_context = DE405Context()
now = now() # assume this function returns an epoch
# first case:
print angular_momentum("Saturn", now, my_context)
# second case:
calculator = Calculator(my_context)
print calculator.angular_momentum("Saturn", now)
Of course, I could add all the operations directly into "Context", but it does not feel right.
In real life, the Context not only computes positions of planets! It computes many more things, and it serves as the access point to a lot of data.
So, to make my question more succinct: how do you deal with objects which need to be accessed by many classes?
I am currently exploring: python's context manager, but without much luck. I also thought about dynamically adding a property "context" to all functions directly (functions are objects, so they can have an access point to arbitrary objects), i.e.:
def angular_momentum(self, planet, epoch):
r, v = angular_momentum.ctx.state(planet, epoch)
return cross(r, v)
# somewhere before calling anything...
import angular_momentum
angular_momentum.ctx = my_context
edit
Something that would be great, is to create a "calculation context" with a with statement, for example:
with my_context:
h = angular_momentum("Earth", now)
Of course, I can already do that if I simply write:
with my_context as ctx:
h = angular_momentum("Earth", now, ctx) # first implementation above
Maybe a variation of this with the Strategy pattern?

You generally don't want to "hide" anything in Python. You may want to signal human readers that they should treat it as "private", but this really just means "you should be able to understand my API even if you ignore this object", not "you can't access this".
The idiomatic way to do that in Python is to prefix it with an underscore—and, if your module might ever be used with from foo import *, add an explicit __all__ global that lists all the public exports. Again, neither of these will actually prevent anyone from seeing your variable, or even accessing it from outside after import foo.
See PEP 8 on Global Variable Names for more details.
Some style guides suggest special prefixes, all-caps-names, or other special distinguishing marks for globals, but PEP 8 specifically says that the conventions are the same, except for the __all__ and/or leading underscore.
Meanwhile, the behavior you want is clearly that of a global variable—a single object that everyone implicitly shares and references. Trying to disguise it as anything other than what it is will do you no good, except possibly for passing a lint check or a code review that you shouldn't have passed. All of the problems with global variables come from being a single object that everyone implicitly shares and references, not from being directly in the globals() dictionary or anything like that, so any decent fake global is just as bad as a real global. If that truly is the behavior you want, make it a global variable.
Putting it together:
# do not include _context here
__all__ = ['Context', 'DE405Context', 'Calculator', …
_context = Context()
Also, of course, you may want to call it something like _global_context or even _private_global_context, instead of just _context.
But keep in mind that globals are still members of a module, not of the entire universe, so even a public context will still be scoped as foo.context when client code does an import foo. And this may be exactly what you want. If you want a way for client scripts to import your module and then control its behavior, maybe foo.context = foo.Context(…) is exactly the right way. Of course this won't work in multithreaded (or gevent/coroutine/etc.) code, and it's inappropriate in various other cases, but if that's not an issue, in some cases, this is fine.
Since you brought up multithreading in your comments: In the simple style of multithreading where you have long-running jobs, the global style actually works perfectly fine, with a trivial change—replace the global Context with a global threading.local instance that contains a Context. Even in the style where you have small jobs handled by a thread pool, it's not much more complicated. You attach a context to each job, and then when a worker pulls a job off the queue, it sets the thread-local context to that job's context.
However, I'm not sure multithreading is going to be a good fit for your app anyway. Multithreading is great in Python when your tasks occasionally have to block for IO and you want to be able to do that without stopping other tasks—but, thanks to the GIL, it's nearly useless for parallelizing CPU work, and it sounds like that's what you're looking for. Multiprocessing (whether via the multiprocessing module or otherwise) may be more of what you're after. And with separate processes, keeping separate contexts is even simpler. (Or, you can write thread-based code and switch it to multiprocessing, leaving the threading.local variables as-is and only changing the way you spawn new tasks, and everything still works just fine.)
It may make sense to provide a "context" in the context manager sense, as an external version of the standard library's decimal module did, so someone can write:
with foo.Context(…):
# do stuff under custom context
# back to default context
However, nobody could really think of a good use case for that (especially since, at least in the naive implementation, it doesn't actually solve the threading/etc. problem), so it wasn't added to the standard library, and you may not need it either.
If you want to do this, it's pretty trivial. If you're using a private global, just add this to your Context class:
def __enter__(self):
global _context
self._stashedcontext = _context
_context = self
def __exit__(self, *args):
global context
_context = self._stashedcontext
And it should be obvious how to adjust this to public, thread-local, etc. alternatives.
Another alternative is to make everything a member of the Context object. The top-level module functions then just delegate to the global context, which has a reasonable default value. This is exactly how the standard library random module works—you can create a random.Random() and call randrange on it, or you can just call random.randrange(), which calls the same thing on a global default random.Random() object.
If creating a Context is too heavy to do at import time, especially if it might not get used (because nobody might ever call the global functions), you can use the singleton pattern to create it on first access. But that's rarely necessary. And when it's not, the code is trivial. For example, the source to random, starting at line 881, does this:
_inst = Random()
seed = _inst.seed
random = _inst.random
uniform = _inst.uniform
…
And that's all there is to it.
And finally, as you suggested, you could make everything a member of a different Calculator object which owns a Context object. This is the traditional OOP solution; overusing it tends to make Python feel like Java, but using it when it's appropriate is not a bad thing.

You might consider using a proxy object, here's a library that helps in creating object proxies:
http://pypi.python.org/pypi/ProxyTypes
Flask uses object proxies for it's "current_app", "request" and other variables, all it takes to reference them is:
from flask import request
You could create a proxy object that is a reference to your real context, and use thread locals to manage the instances (if that would work for you).

Which is a better repr for a custom Python class?

It seems there are different ways the __repr__ function can return.
I have a class InfoObj that stores a number of things, some of which I don't particularly want users of the class to set by themselves. I recognize nothing is protected in python and they could just dive in and set it anyway, but seems defining it in __init__ makes it more likely someone might see it and assume it's fine to just pass it in.
(Example: Booleans that get set by a validation function when it determines that the object has been fully populated, and values that get calculated from other values when enough information is stored to do so... e.g. A = B + C, so once A and B are set then C is calculated and the object is marked Valid=True.)
So, given all that, which is the best way to design the output of __ repr__?
bob = InfoObj(Name="Bob")
# Populate bob.
# Output type A:
bob.__repr__()
'<InfoObj object at 0x1b91ca42>'
# Output type B:
bob.__repr__()
'InfoObj(Name="Bob",Pants=True,A=7,B=5,C=2,Valid=True)'
# Output type C:
bob.__repr__()
'InfoObj.NewInfoObj(Name="Bob",Pants=True,A=7,B=5,C=2,Valid=True)'
... the point of type C would be to not happily take all the stuff I'd set 'private' in C++ as arguments to the constructor, and make teammates using the class set it up using the interface functions even if it's more work for them. In that case I would define a constructor that does not take certain things in, and a separate function that's slightly harder to notice, for the purposes of __repr__
If it makes any difference, I am planning to store these python objects in a database using their __repr__ output and retrieve them using eval(), at least unless I come up with a better way. The consequence of a teammate creating a full object manually instead of going through the proper interface functions is just that one type of info retrieval might be unstable until someone figures out what he did.

The __repr__ method is designed to produce the most useful output for the developer, not the enduser, so only you can really answer this question. However, I'd typically go with option B. Option A isn't very useful, and option C is needlessly verbose -- you don't know how your module is imported anyway. Others may prefer option C.
However, if you want to store Python objects is a database, use pickle.
import pickle
bob = InfoObj(Name="Bob")
> pickle.dumps(bob)
b'...some bytestring representation of Bob...'
> pickle.loads(pickle.dumps(bob))
Bob(...)
If you're using older Python (pre-3.x), then note that cPickle is faster, but pickle is more extensible. Pickle will work on some of your classes without any configuration, but for more complicated objects you might want to write custom picklers.

What is the best-maintained generic functions implementation for Python?

A generic function is dispatched based on the type of all its arguments. The programmer defines several implementations of a function. The correct one is chosen at call time based on the types of its arguments. This is useful for object adaptation among other things. Python has a few generic functions including len().
These packages tend to allow code that looks like this:
#when(int)
def dumbexample(a):
return a * 2
#when(list)
def dumbexample(a):
return [("%s" % i) for i in a]
dumbexample(1) # calls first implementation
dumbexample([1,2,3]) # calls second implementation
A less dumb example I've been thinking about lately would be a web component that requires a User. Instead of requiring a particular web framework, the integrator would just need to write something like:
class WebComponentUserAdapter(object):
def __init__(self, guest):
self.guest = guest
def canDoSomething(self):
return guest.member_of("something_group")
#when(my.webframework.User)
componentNeedsAUser(user):
return WebComponentUserAdapter(user)
Python has a few generic functions implementations. Why would I chose one over the others? How is that implementation being used in applications?
I'm familiar with Zope's zope.component.queryAdapter(object, ISomething). The programmer registers a callable adapter that takes a particular class of object as its argument and returns something compatible with the interface. It's a useful way to allow plugins. Unlike monkey patching, it works even if an object needs to adapt to multiple interfaces with the same method names.

I'd recommend the PEAK-Rules library by P. Eby. By the same author (deprecated though) is the RuleDispatch package (the predecessor of PEAK-Rules). The latter being no longer maintained IIRC.
PEAK-Rules has a lot of nice features, one being, that it is (well, not easily, but) extensible. Besides "classic" dispatch on types ony, it features dispatch on arbitrary expressions as "guardians".
The len() function is not a true generic function (at least in the sense of the packages mentioned above, and also in the sense, this term is used in languages like Common Lisp, Dylan or Cecil), as it is simply a convenient syntax for a call to specially named (but otherwise regular) method:
len(s) == s.__len__()
Also note, that this is single-dispatch only, that is, the actual receiver (s in the code above) determines the method implementation called. And even a hypothetical
def call_special(receiver, *args, **keys):
return receiver.__call_special__(*args, **keys)
is still a single-dispatch function, as only the receiver is used when the method to be called is resolved. The remaining arguments are simply passed on, but they don't affect the method selection.
This is different from multiple-dispatch, where there is no dedicated receiver, and all arguments are used in order to find the actual method implementation to call. This is, what actually makes the whole thing worthwhile. If it were only some odd kind of syntactic sugar, nobody would bother with using it, IMHO.
from peak.rules import abstract, when
#abstract
def serialize_object(object, target):
pass
#when(serialize_object, (MyStuff, BinaryStream))
def serialize_object(object, target):
target.writeUInt32(object.identifier)
target.writeString(object.payload)
#when(serialize_object, (MyStuff, XMLStream))
def serialize_object(object, target):
target.openElement("my-stuff")
target.writeAttribute("id", str(object.identifier))
target.writeText(object.payload)
target.closeElement()
In this example, a call like
serialize_object(MyStuff(10, "hello world"), XMLStream())
considers both arguments in order to decide, which method must actually be called.
For a nice usage scenario of generic functions in Python I'd recommend reading the refactored code of the peak.security which gives a very elegant solution to access permission checking using generic functions (using RuleDispatch).

You can use a construction like this:
def my_func(*args, **kwargs):
pass
In this case args will be a list of any unnamed arguments, and kwargs will be a dictionary of the named ones. From here you can detect their types and act as appropriate.

I'm unable to see the point in these "generic" functions. It sounds like simple polymorphism.
Your "generic" features can be implemented like this without resorting to any run-time type identification.
class intWithDumbExample( int ):
def dumbexample( self ):
return self*2
class listWithDumbExmaple( list ):
def dumpexample( self ):
return [("%s" % i) for i in self]
def dumbexample( a ):
return a.dumbexample()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.