Hashing a python function to regenerate output when the function is modified

Hashing a python function to regenerate output when the function is modified - python

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.

If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result

Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.

The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.

So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);

What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

Related

Python decorate function preserving part of signature

I have written the following decorator:
def partializable(fn):
def arg_partializer(*fixable_parameters):
def partialized_fn(dynamic_arg):
return fn(dynamic_arg, *fixable_parameters)
return partialized_fn
return arg_partializer
The purpose of this decorator is to break the function call into two calls. If I decorate the following:
#partializable
def my_fn(dyn, fix1, fix2):
return dyn + fix1 + fix2
I then can do:
core_accepting_dynamic_argument = my_fn(my_fix_1, my_fix_2)
final_result = core_accepting_dynamic_argument(my_dyn)
My problem is that the now decorated my_fn exhibits the following signature: my_fn(*fixable_parameters)
I want it to be: my_fn(fix1, fix2)
How can I accomplish this? I probably have to use wraps or the decorator module, but I need to preserve only part of the original signature and I don't know if that's possible.

Taking inspiration from https://stackoverflow.com/a/33112180/9204395, it's possible to accomplish this by manually altering the signature of arg_partializer, since only the signature of fn is known in the relevant scope and can be handled with inspect.
from inspect import signature
def partializable(fn):
def arg_partializer(*fixable_parameters):
def partialized_fn(dynamic_arg):
return fn(dynamic_arg, *fixable_parameters)
return partialized_fn
# Override signature
sig = signature(fn)
sig = sig.replace(parameters=tuple(sig.parameters.values())[1:])
arg_partializer.__signature__ = sig
return arg_partializer
This is not particularly elegant, but as I think about the problem I'm starting to suspect that this (or a conceptual equivalent) is the only possible way to pull this stunt. Feel free to contradict me.

cleaning up nested function calls

I have written several functions that run sequentially, each one taking as its input the output of the previous function so in order to run it, I have to run this line of code
make_list(cleanup(get_text(get_page(URL))))
and I just find that ugly and inefficient, is there a better way to do sequential function calls?

Really, this is the same as any case where you want to refactor commonly-used complex expressions or statements: just turn the expression or statement into a function. The fact that your expression happens to be a composition of function calls doesn't make any difference (but see below).
So, the obvious thing to do is to write a wrapper function that composes the functions together in one place, so everywhere else you can make a simple call to the wrapper:
def get_page_list(url):
return make_list(cleanup(get_text(get_page(url))))
things = get_page_list(url)
stuff = get_page_list(another_url)
spam = get_page_list(eggs)
If you don't always call the exact same chain of functions, you can always factor out into the pieces that you frequently call. For example:
def get_clean_text(page):
return cleanup(get_text(page))
def get_clean_page(url):
return get_clean_text(get_page(url))
This refactoring also opens the door to making the code a bit more verbose but a lot easier to debug, since it only appears once instead of multiple times:
def get_page_list(url):
page = get_page(url)
text = get_text(page)
cleantext = cleanup(text)
return make_list(cleantext)
If you find yourself needing to do exactly this kind of refactoring of composed functions very often, you can always write a helper that generates the refactored functions. For example:
def compose1(*funcs):
#wraps(funcs[0])
def composed(arg):
for func in reversed(funcs):
arg = func(arg)
return arg
return composed
get_page_list = compose1(make_list, cleanup, get_text, get_page)
If you want a more complicated compose function (that, e.g., allows passing multiple args/return values around), it can get a bit complicated to design, so you might want to look around on PyPI and ActiveState for the various existing implementations.

You could try something like this. I always like separating train wrecks(the book "Clean Code" calls those nested functions train wrecks). This is easier to read and debug. Remember you probably spend twice as long reading your code than writing it so make it easier to read. You will thank yourself later.
url = get_page(URL)
url_text = get_text(url)
make_list(cleanup(url_text))
# you can also encapsulate that into its own function
def build_page_list_from_url(url):
url = get_page(URL)
url_text = get_text(url)
return make_list(cleanup(url_text))

Options:
Refactor: implement this series of function calls as one, aptly-named method.
Look into decorators. They're syntactic sugar for 'chaining' functions in this way. E.g. implement cleanup and make_list as a decorators, then decorate get_text with them.
Compose the functions. See code in this answer.

You could shorten constructs like that with something like the following:
class ChainCalls(object):
def __init__(self, *funcs):
self.funcs = funcs
def __call__(self, *args, **kwargs):
result = self.funcs[-1](*args, **kwargs)
for func in self.funcs[-2::-1]:
result = func(result)
return result
def make_list(arg): return 'make_list(%s)' % arg
def cleanup(arg): return 'cleanup(%s)' % arg
def get_text(arg): return 'get_text(%s)' % arg
def get_page(arg): return 'get_page(%r)' % arg
mychain = ChainCalls(make_list, cleanup, get_text, get_page)
print( mychain('http://is.gd') )
Output:
make_list(cleanup(get_text(get_page('http://is.gd'))))

Replacing parts of the function code on-the-fly

Here I came up with the solution to the other question asked by me on how to remove all costly calling to debug output function scattered over the function code (slowdown was 25 times with using empty function lambda *p: None).
The solution is to edit function code dynamically and prepend all function calls with comment sign #.
from __future__ import print_function
DEBUG = False
def dprint(*args,**kwargs):
'''Debug print'''
print(*args,**kwargs)
def debug(on=False,string='dprint'):
'''Decorator to comment all the lines of the function code starting with string'''
def helper(f):
if not on:
import inspect
source = inspect.getsource(f)
source = source.replace(string, '#'+string) #Beware! Swithces off the whole line after dprint statement
with open('temp_f.py','w') as file:
file.write(source)
from temp_f import f as f_new
return f_new
else:
return f #return f intact
return helper
def f():
dprint('f() started')
print('Important output')
dprint('f() ended')
f = debug(DEBUG,'dprint')(f) #If decorator #debug(True) is used above f(), inspect.getsource somehow includes #debug(True) inside the code.
f()
The problems I see now are these:
# commets all line to the end; but there may be other statements separated by ;. This may be addressed by deleting all pprint calls in f, not commenting, still it may be not that trivial, as there may be nested parantheses.
temp_f.py is created, and then new f code is loaded from it. There should be a better way to do this without writing to hard drive. I found this recipe, but haven't managed to make it work.
if decorator is applied with special syntax used #debug, then inspect.getsource includes the line with decorator to the function code. This line can be manually removed from string, but it may lead to bugs if there are more than one decorator applied to f. I solved it with resorting to old-style decorator application f=decorator(f).
What other problems do you see here?
How can all these problems be solved?
What are upsides and downsides of this approach?
What can be improved here?
Is there any better way to do what I try to achieve with this code?
I think it's a very interesting and contentious technique to preprocess function code before compilation to byte-code. Strange though that nobody got interested in it. I think the code I gave may have a lot of shaky points.

A decorator can return either a wrapper, or the decorated function unaltered. Use it to create a better debugger:
from functools import wraps
def debug(enabled=False):
if not enabled:
return lambda x: x # Noop, returns decorated function unaltered
def debug_decorator(f):
#wraps(f)
def print_start(*args, **kw):
print('{0}() started'.format(f.__name__))
try:
return f(*args, **kw)
finally:
print('{0}() completed'.format(f.__name__))
return print_start
return debug_decorator
The debug function is a decorator factory, when called it produces a decorator function. If debugging is disabled, it simply returns a lambda that returns it argument unchanged, a no-op decorator. When debugging is enabled, it returns a debugging decorator that prints when a decorated function has started and prints again when it returns.
The returned decorator is then applied to the decorated function.
Usage:
DEBUG = True
#debug(DEBUG)
def my_function_to_be_tested():
print('Hello world!')
To reiterate: when DEBUG is set to false, the my_function_to_be_tested remains unaltered, so runtime performance is not affected at all.

Here is the solution I came up with after composing answers from another questions asked by me here on StackOverflow.
This solution don't comment anything and just deletes standalone dprint statements. It uses ast module and works with Abstract Syntax Tree, it lets us avoid parsing source code. This idea was written in the comment here.
Writing to temp_f.py is replaced with execution f in necessary environment. This solution was offered here.
Also, the last solution addresses the problem of decorator recursive application. It's solved by using _blocked global variable.
This code solves the problem asked to be solved in the question. But still, it's suggested not to be used in real projects:
You are correct, you should never resort to this, there are so many
ways it can go wrong. First, Python is not a language designed for
source-level transformations, and it's hard to write it a transformer
such as comment_1 without gratuitously breaking valid code. Second,
this hack would break in all kinds of circumstances - for example,
when defining methods, when defining nested functions, when used in
Cython, when inspect.getsource fails for whatever reason. Python is
dynamic enough that you really don't need this kind of hack to
customize its behavior.
from __future__ import print_function
DEBUG = False
def dprint(*args,**kwargs):
'''Debug print'''
print(*args,**kwargs)
_blocked = False
def nodebug(name='dprint'):
'''Decorator to remove all functions with name 'name' being a separate expressions'''
def helper(f):
global _blocked
if _blocked:
return f
import inspect, ast, sys
source = inspect.getsource(f)
a = ast.parse(source) #get ast tree of f
class Transformer(ast.NodeTransformer):
'''Will delete all expressions containing 'name' functions at the top level'''
def visit_Expr(self, node): #visit all expressions
try:
if node.value.func.id == name: #if expression consists of function with name a
return None #delete it
except(ValueError):
pass
return node #return node unchanged
transformer = Transformer()
a_new = transformer.visit(a)
f_new_compiled = compile(a_new,'<string>','exec')
env = sys.modules[f.__module__].__dict__
_blocked = True
try:
exec(f_new_compiled,env)
finally:
_blocked = False
return env[f.__name__]
return helper
#nodebug('dprint')
def f():
dprint('f() started')
print('Important output')
dprint('f() ended')
print('Important output2')
f()
Other relevant links:
Switching off debug prints

Caching disk operations

I might need to do multiple reads over a big code-base, and with different tools.
I then thought that is a real waste to read on disk so many times while the text won't change, so I wrote the following.
class Module(object):
def __init__(self, module_path):
self.module_path = module_path
self._text = None
self._ast = None
#property
def text(self):
if not self._text:
self._text = open(self.module_path).read()
return self._text
#property
def ast(self):
s = self.text # which is actually discarded
if not self._ast:
self._ast = parse(self.text)
return self._ast
class ContentDirectory(object):
def __init__(self):
self.content = {}
def __getitem__(self, module_path):
if module_path not in self.content:
self.content[module_path] = Module(module_path)
return self.content[module_path]
But now it comes the problem, because I would like to avoid changing the rest of the code, while being able to use this new trick.
The only way I see would be to patch the "open" builtin function everywhere it might be used, for example.
from myotherlib import __builtins__ as other_builtins
other_builtins.open = my_dummy_open # which uses this cache
But it does not really seem like a wise idea.
Should I just give up and only try if the performance is really too bad maybe?

you can use mmap module: http://docs.python.org/library/mmap.html

Replacing the system open() call is potentially a bad thing. It requires that everything which uses open() uses it as you expect.
Why do you want to avoid changing the code?
Yes, measure the performance and see if it's worthwhile. For example, put in your above code and see how much faster things are. If it's only 1% faster then there's no reason to do anything. If it's significantly faster, then see what's using the open() and change that code if you can.
BTW, something like an LRU cache (part of functools in Python 3.2) would also be helpful for your task.

I'm not sure if the functionality offered by this library is of any use in your scenario, but thought to mention nevertheless the existence of the linecache library. From the linked docs:
The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.
...of course this doesn't come even close to your problem of implementing a solution in an elegant and transparent way...

Is there a high-level profiling module for Python?

I want to profile my Python code. I am well-aware of cProfile, and I use it, but it's too low-level. (For example, there isn't even a straightforward way to catch the return value from the function you're profiling.)
One of the things I would like to do: I want to take a function in my program and set it to be profiled on the fly while running the program.
For example, let's say I have a function heavy_func in my program. I want to start the program and have the heavy_func function not profile itself. But sometime during the runtime of my program, I want to change heavy_func to profile itself while it's running. (If you're wondering how I can manipulate stuff while the program is running: I can do it either from the debug probe or from the shell that's integrated into my GUI app.)
Is there a module already written which does stuff like this? I can write it myself but I just wanted to ask before so I won't be reinventing the wheel.

It may be a little mind-bending, but this technique should help you find the "bottlenecks", it that's what you want to do.
You're pretty sure of what routine you want to focus on.
If that's the routine you need to focus on, it will prove you right.
If the real problem(s) are somewhere else, it will show you where they are.
If you want a tedious list of reasons why, look here.

I wrote my own module for it. I called it cute_profile. Here is the code. Here are the tests.
Here is the blog post explaining how to use it.
It's part of GarlicSim, so if you want to use it you can install garlicsim and do from garlicsim.general_misc import cute_profile.
If you want to use it on Python 3 code, just install the Python 3 fork of garlicsim.
Here's an outdated excerpt from the code:
import functools
from garlicsim.general_misc import decorator_tools
from . import base_profile
def profile_ready(condition=None, off_after=True, sort=2):
'''
Decorator for setting a function to be ready for profiling.
For example:
#profile_ready()
def f(x, y):
do_something_long_and_complicated()
The advantages of this over regular `cProfile` are:
1. It doesn't interfere with the function's return value.
2. You can set the function to be profiled *when* you want, on the fly.
How can you set the function to be profiled? There are a few ways:
You can set `f.profiling_on=True` for the function to be profiled on the
next call. It will only be profiled once, unless you set
`f.off_after=False`, and then it will be profiled every time until you set
`f.profiling_on=False`.
You can also set `f.condition`. You set it to a condition function taking
as arguments the decorated function and any arguments (positional and
keyword) that were given to the decorated function. If the condition
function returns `True`, profiling will be on for this function call,
`f.condition` will be reset to `None` afterwards, and profiling will be
turned off afterwards as well. (Unless, again, `f.off_after` is set to
`False`.)
`sort` is an `int` specifying which column the results will be sorted by.
'''
def decorator(function):
def inner(function_, *args, **kwargs):
if decorated_function.condition is not None:
if decorated_function.condition is True or \
decorated_function.condition(
decorated_function.original_function,
*args,
**kwargs
):
decorated_function.profiling_on = True
if decorated_function.profiling_on:
if decorated_function.off_after:
decorated_function.profiling_on = False
decorated_function.condition = None
# This line puts it in locals, weird:
decorated_function.original_function
base_profile.runctx(
'result = '
'decorated_function.original_function(*args, **kwargs)',
globals(), locals(), sort=decorated_function.sort
)
return locals()['result']
else: # decorated_function.profiling_on is False
return decorated_function.original_function(*args, **kwargs)
decorated_function = decorator_tools.decorator(inner, function)
decorated_function.original_function = function
decorated_function.profiling_on = None
decorated_function.condition = condition
decorated_function.off_after = off_after
decorated_function.sort = sort
return decorated_function
return decorator

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hashing a python function to regenerate output when the function is modified - python

Related

Python decorate function preserving part of signature

cleaning up nested function calls

Replacing parts of the function code on-the-fly

Caching disk operations

Is there a high-level profiling module for Python?

Categories

Resources