Chain memoizer in python

Chain memoizer in python - python

I already have a memoizer that works quite well. It uses the pickle dumps to serializes the inputs and creates MD5 hash as a key. The function results are quite large and are stored as pickle files with file name being MD5 hash. When I call two memoized function one after another, the memoizer will load the output of the first function and pass it to the second function. The second function will serialize it, create MD5 and then load the output. Here is a very simple code:
#memoize
def f(x):
...
return y
#memoize
def g(x):
...
return y
y1 = f(x1)
y2 = g(y1)
y1 is loaded from the disk when f is evaluated and then it's serialized when g is evaluated. Is it possible to somehow bypass this step and pass the key of y1 (i.e. MD5 hash) to g? If g already has this key, it loads the y2 from the disk. If it doesn't, it "requests" the full y1 for the evaluation of g.
EDIT:
import cPickle as pickle
import inspect
import hashlib
class memoize(object):
def __init__(self, func):
self.func = func
def __call__(self, *args, **kwargs):
arg = inspect.getargspec(self.func).args
file_name = self._get_key(*args, **kwargs)
try:
f = open(file_name, "r")
out = pickle.load(f)
f.close()
except:
out = self.func(*args, **kwargs)
f = open(file_name, "wb")
pickle.dump(out, f, 2)
f.close()
return out
def _arg_hash(self, *args, **kwargs):
_str = pickle.dumps(args, 2) + pickle.dumps(kwargs, 2)
return hashlib.md5(_str).hexdigest()
def _src_hash(self):
_src = inspect.getsource(self.func)
return hashlib.md5(_src).hexdigest()
def _get_key(self, *args, **kwargs):
arg = self._arg_hash(*args, **kwargs)
src = self._src_hash()
return src + '_' + arg + '.pkl'

I think you could do it in an automatic fashion but I generally think it's better to be explicit about "lazy" evaluation. So I'll present a way that adds an additional argument to your memoized functions: lazy. But instead of files, pickle and md5 I'll simplify the helpers a bit:
# I use a dictionary as storage instead of files
storage = {}
# No md5, just hash
def calculate_md5(obj):
print('calculating md5 of', obj)
return hash(obj)
# create dictionary entry instead of pickling the data to a file
def create_file(md5, data):
print('creating file for md5', md5)
storage[md5] = data
# Load dictionary entry instead of unpickling a file
def load_file(md5):
print('loading file with md5 of', md5)
return storage[md5]
I use a custom class as intermediate object:
class MemoizedObject(object):
def __init__(self, md5):
self.md5 = result_md5
def get_real_data(self):
print('load...')
return load_file(self.md5)
def __repr__(self):
return '{self.__class__.__name__}(md5={self.md5})'.format(self=self)
And finally I show the changed Memoize assuming your functions take only one argument:
class Memoize(object):
def __init__(self, func):
self.func = func
# The md5 to md5 storage is needed to find the result file
# or result md5 for lazy evaluation.
self.md5_to_md5_storage = {}
def __call__(self, x, lazy=False):
# If the argument is a memoized object no need to
# calculcate the hash, we can just look it up.
if isinstance(x, MemoizedObject):
key = x.md5
else:
key = calculate_md5(x)
if lazy and key in self.md5_to_md5_storage:
# Check if the key is present in the md5 to md5 storage, otherwise
# we can't be lazy
return MemoizedObject(self.md5_to_md5_storage[key])
elif not lazy and key in self.md5_to_md5_storage:
# Not lazy but we know the result
result = load_file(self.md5_to_md5_storage[key])
else:
# Unknown argument
result = self.func(x)
result_md5 = calculate_md5(result)
create_file(result_md5, result)
self.md5_to_md5_storage[key] = result_md5
return result
Now if you call your functions and specify lazy at the correct positions you can avoid loading (unpickling) your file:
#Memoize
def f(x):
return x+1
#Memoize
def g(x):
return x+2
Normal (first) run:
>>> x1 = 10
>>> y1 = f(x1)
calculating md5 of 10
calculating md5 of 11
creating file for md5 11
>>> y2 = g(y1)
calculating md5 of 11
calculating md5 of 13
creating file for md5 13
Without lazy:
>>> x1 = 10
>>> y1 = f(x1)
calculating md5 of 10
loading file with md5 of 11
>>> y2 = g(y1)
calculating md5 of 11
loading file with md5 of 13
With lazy=True
>>> x1 = 10
>>> y1 = f(x1, lazy=True)
calculating md5 of 10
>>> y2 = g(y1)
loading file with md5 of 13
The last option only calculates the "md5" of the first argument and loads the file of the end-result. That should be exactly what you wanted.

Related

Emulating functions with an internal state

I have a working solution for what I am trying to achieve, but I am looking for simpler way to do it.
I have a class that encapsulates a function and a user can pass a function (lambda expression) to it. Those functions always take one input data argument and an arbitrary amount of user defined custom-arguments:
c.set_func(lambda x, offset, mul: mul*(x**2 + offset), offset=3, mul=1)
The user can then call a class method that will run the function with a predefined input and the currently set custom-arguments. The user also has the option to change the custom-arguments by just changing attributes of the class.
Here is my code:
from functools import partial
class C:
def __init__(self):
self.data = 10 # just an example
self.func = None
self.arg_keys = []
def set_func(self, func, **kwargs):
self.func = func
for key, value in kwargs.iteritems():
# add arguments to __dict__
self.__dict__[key] = value
self.arg_keys.append(key)
# store a list of the argument names
self.arg_keys = list(set(self.arg_keys))
def run_function(self):
# get all arguments from __dict__ that are in the stored argument-list
argdict = {key: self.__dict__[key] for key in self.arg_keys}
f = partial(self.func, **argdict)
return f(self.data)
if __name__ == '__main__':
# Here is a testrun:
c = C()
c.set_func(lambda x, offset, mul: mul*(x**2 + offset), offset=3, mul=1)
print c.run_function()
# -> 103
c.offset = 5
print c.run_function()
# -> 105
c.mul = -1
print c.run_function()
# -> -105
The important part are:
that the user can initially set the function with any number of arguments
The values of those arguments are stored until changed
Is there any builtin or otherwise simpler solution to this?

Reduce indentation of nested functions in Python code generation

My goal is to generate functions dynamically and then save them in a file. For e.g, in my current attempt, On calling create_file
import io
def create_file():
nested_func = make_nested_func()
write_to_file([nested_func, a_root_func], '/tmp/code.py')
def a_root_func(x):
pass
def make_nested_func():
def a_nested_func(b, k):
return b, k
return a_nested_func
def write_to_file(code_list, path):
import inspect
code_str_list = [inspect.getsource(c) for c in code_list]
with open(path, 'w') as ofh:
for c in code_str_list:
fh = io.StringIO(c)
ofh.writelines(fh.readlines())
ofh.write('\n')
create_file()
The output I want is('/tmp/code.py'):
def a_nested_func(b, k):
return b, k
def a_root_func(x):
pass
The output I get is('/tmp/code.py'):
def a_nested_func(b, k):
return b, k
def a_root_func(x):
pass
a_nested_func is indented. How can I reduce the indentation? I can do lstrip etc. but I wonder if there is a better way.

Use the textwrap.dedent() function to remove the common leading whitespace:
import inspect
from textwrap import dedent
def write_to_file(code_list, path):
code_str_list = [inspect.getsource(c) for c in code_list]
with open(path, 'w') as ofh:
for c in code_str_list:
dedented = dedent(c)
ofh.write(dedented + '\n')
Note that there is no need for a StringIO(string).readlines() dance here.

There's a function in built-in module, textwrap.dedent.
import textwrap
s = """
abc
def
"""
s2 = """
abc
def
"""
assert textwrap.dedent(s) == s2

Generate function with arguments filled in when creating it?

My goal is to generate functions dynamically and then save them in a file. For e.g, in my current attempt, On calling create_file
import io
def create_file(a_value):
a_func = make_concrete_func(a_value)
write_to_file([a_func], '/tmp/code.py')
def make_concrete_func(a_value):
def concrete_func(b, k):
return b + k + a_value
return concrete_func
def write_to_file(code_list, path):
import inspect
code_str_list = [inspect.getsource(c) for c in code_list]
with open(path, 'w') as ofh:
for c in code_str_list:
fh = io.StringIO(c)
ofh.writelines(fh.readlines())
ofh.write('\n')
create_file('my_value')
The output I want is (file /tmp/code.py):
def concrete_func(b, k):
return b + k + 'my_value'
The output I get is (file '/tmp/code.py'):
def concrete_func(b, k):
return b + k + a_value
UPDATE: My solution uses inspect.getsource which returns a string. I wonder if I have limited your options as most solutions below suggest a string replacement. The solution need not use inspect.getsource. You could write it anyhow to get the desired output.
UPDATE 2: The reason I am doing this is because I want to generate a file for Amazon Lambda. Amazon Lambda takes a python file and its virtual environment and will execute it for you(relieving you from worrying about scalability and fault tolerance). You have to tell Lambda which file and which function to call and Lambda will execute it for you.

A function definition doesn't look up its free variables (variables that are not defined in the function itself) at time of definition. I.e. concrete_func here:
def make_concrete_func(a_value):
def concrete_func(b, k):
return b + k + a_value
return concrete_func
doesn't look up a_value when it is defined, instead it will contain code to load a_value from its closure (simplified the enclosing function) at runtime.
You can see this by disassembling the returned function:
f = make_concrete_func(42)
import dis
print dis.dis(f)
3 0 LOAD_FAST 0 (b)
3 LOAD_FAST 1 (k)
6 BINARY_ADD
7 LOAD_DEREF 0 (a_value)
10 BINARY_ADD
11 RETURN_VALUE
None
You can maybe do what you want by editing the byte code.. it's been done before (http://bytecodehacks.sourceforge.net/bch-docs/bch/module-bytecodehacks.macro.html ..shudder).

Use getsource to convert the function to a string, and replace the variable names with simple string manipulation.
from inspect import getsource
def write_func(fn, path, **kwargs):
fn_as_string = getsource(fn)
for var in kwargs:
fn_as_string = fn_as_string.replace(var, kwargs[var])
with open(path, 'a') as fp: # append to file
fp.write('\n' + fn_as_string)
def base_func(b, k):
return b + k + VALUE
# add quotes to string literals
write_func(base_func, '/tmp/code.py', VALUE="'my value'")
# you should replace the function name if you write multiple functions to the file
write_func(base_func, '/tmp/code.py', base_func='another_func', VALUE='5')
Output is as expected in /tmp/code.py:
def base_func(b, k):
return b + k + 'my value'
def another_func(b, k):
return b + k + 5

Try this. Note that I have added another parameter to write_to_file
def write_to_file(code_list, path,a_value):
print "lc",code_list
code_str_list = [inspect.getsource(c) for c in code_list]
with open(path, 'w') as ofh:
for c in code_str_list:
c= c.replace('a_value','\''+a_value+'\'')
fh = io.StringIO(c)
ofh.writelines(fh.readlines())
ofh.write('\n')

If the file doesn't have to be human readable and you trust it won't be manipulated by attackers, combining functools.partial and pickle might be the most pythonic approach. However it comes with disadvantages I don't completely understand: for one thing it doesn't seem to work with locally defined functions (or maybe locally defined variables in general?).
I might just ask my own question about this.
import functools
import pickle
def write_file_not_working():
def concrete_func_not_working(b, k, a_value):
return b + k + a_value
with open('data.pickle', 'wb') as f:
data = functools.partial(concrete_func_not_working, a_value='my_value')
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
def use_file_not_working():
with open('data.pickle', 'rb') as f:
resurrected_data = pickle.load(f)
print(resurrected_data('hi', 'there'))
def top_level_concrete_func(b, k, a_value):
return a_value + b + k
def write_file_working():
with open('working.pickle', 'wb') as f:
data = functools.partial(top_level_concrete_func, a_value='my_working_value')
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
def use_file_working():
with open('working.pickle', 'rb') as f:
resurrected_data = pickle.load(f)
print(resurrected_data('hi', 'there'))
if __name__ == "__main__":
write_file_working()
use_file_working()
write_file_not_working()
use_file_not_working()

#Ben made me realize that I didn't need to use a string based approach for code generation and that I could use serialization. Instead of the limited pickle library, I used dill which overcomes the limitation as mentioned by Ben
So, I finally did something like.
import dill
def create_file(a_value, path):
a_func = make_concrete_func(a_value)
dill.dump(a_func, open(path, "wb"))
return path
def make_concrete_func(a_value):
def concrete_func(b, k):
return b + k + a_value
return concrete_func
if __name__ == '__main__':
path = '/tmp/code.dill'
create_file('Ben', path)
a_func = dill.load(open(path, "rb"))
print(a_func('Thank ', 'You '))

if the function you want to create all have a determinate pattern, I would create a template for it and use it to mass produce the functions
>>> def test(*values):
template="""
def {name}(b,k):
return b + k + {value}
"""
for i,v in enumerate(values):
print( template.format(name="func{}".format(i),value=repr(v)) )
>>> test("my_value",42,[1])
def func0(b,k):
return b + k + 'my_value'
def func1(b,k):
return b + k + 42
def func2(b,k):
return b + k + [1]
>>>

"Sub-classes" and self in Python

Note: I see that I need to more clearly work out what it is that I want each property/descriptor/class/method to do before I ask how to do it! I don't think my question can be answered at this time. Thanks all for helping me out.
Thanks to icktoofay and BrenBarn, I'm starting to understand discriptors and properties, but now I have a slightly harder question to ask:
I see now how these work:
class Blub(object):
def __get__(self, instance, owner):
print('Blub gets ' + instance._blub)
return instance._blub
def __set__(self, instance, value):
print('Blub becomes ' + value)
instance._blub = value
class Quish(object):
blub = Blub()
def __init__(self, value):
self.blub = value
And how a = Quish('one') works (produces "Blub becomes one") but take a gander at this code:
import os
import glob
class Index(object):
def __init__(self, dir=os.getcwd()):
self.name = dir #index name is directory of indexes
# index is the list of indexes
self.index = glob.glob(os.path.join(self.name, 'BatchStarted*'))
# which is the pointer to the index (index[which] == BatchStarted_12312013_115959.txt)
self.which = 0
# self.file = self.File(self.index[self.which])
def get(self):
return self.index[self.which]
def next(self):
self.which += 1
if self.which < len(self.index):
return self.get()
else:
# loop back to the first
self.which = 0
return None
def back(self):
if self.which > 0:
self.which -= 1
return self.get()
class File(object):
def __init__(self, file):
# if the file exists, we'll use it.
if os.path.isfile(file):
self.name = file
# otherwise, our name is none and we return.
else:
self.name = None
return None
# 'file' attribute is the actual file object
self.file = open(self.name, 'r')
self.line = Lines(self.file)
class Lines(object):
# pass through the actual file object (not filename)
def __init__(self, file):
self.file = file
# line is the list if this file's lines
self.line = self.file.readlines()
self.which = 0
self.extension = Extension(self.line[self.which])
def __get__(self):
return self.line[self.which]
def __set__(self, value):
self.which = value
def next(self):
self.which += 1
return self.__get__()
def back(self):
self.which -= 1
return self.__get__()
class Extension(object):
def __init__(self, lineStr):
# check to make sure a string is passed
if lineStr:
self.lineStr = lineStr
self.line = self.lineStr.split('|')
self.pathStr = self.line[0]
self.path = self.pathStr.split('\\')
self.fileStr = self.path[-1]
self.file = self.fileStr.split('.')
else:
self.lineStr = None
def __get__(self):
self.line = self.lineStr.split('|')
self.pathStr = self.line[0]
self.path = self.pathStr.split('\\')
self.fileStr = self.path[-1]
self.file = self.fileStr.split('.')
return self.file[-1]
def __set__(self, ext):
self.file[-1] = ext
self.fileStr = '.'.join(self.file)
self.path[-1] = fileStr
self.pathStr = '\\'.join(self.path)
self.line[0] = self.pathStr
self.lineStr = '|'.join(self.line)
Firstly, there may be some typos in here because I've been working on it and leaving it half-arsed. That's not my point. My point is that in icktoofay's example, nothing gets passed to Blub(). Is there any way to do what I'm doing here, that is set some "self" attributes and after doing some processing, taking that and passing it to the next class? Would this be better suited for a property?
I would like to have it so that:
>>> i = Index() # i contains list of index files
>>> f = File(i.get()) # f is now one of those files
>>> f.line
'\\\\server\\share\\folder\\file0.txt|Name|Sean|Date|10-20-2000|Type|1'
>>> f.line.extension
'txt'
>>> f.line.extension = 'rtf'
>>> f.line
'\\\\server\\share\\folder\\file0.rtf|Name|Sean|Date|10-20-2000|Type|1'

You can do that, but the issue there is less about properties/descriptors and more about creating classes that give the behavior you want.
So, when you do f.line, that is some object. When you do f.line.extension, that is doing (f.line).extension --- that is, it first evalautes f.line and then gets the extension attribute of whatever f.line is.
The important thing here is that f.line cannot know whether you are later going to try to access its extension. So you can't have f.line do one thing for "plain" f.line and another thing for f.line.extension. The f.line part has to be the same in both, and the extension part can't change that.
The solution for what you seem to want to do is to make f.line return some kind of object that in some way looks or works like a string, but also allows setting attributes and updating itself accordingly. Exactly how you do this depends on how much you need f.lines to behave like a string and how much you need it to do other stuff. Basically you need f.line to be a "gatekeeper" object that handles some operations by acting like a string (e.g., you apparently want it to display as a string), and handles other objects in custom ways (e.g., you apparently want to be able to set an extension attribute on it and have that update its contents).
Here's a simplistic example:
class Line(object):
def __init__(self, txt):
self.base, self.extension = txt.split('.')
def __str__(self):
return self.base + "." + self.extension
Now you can do:
>>> line = Line('file.txt')
>>> print line
file.txt
>>> line.extension
'txt'
>>> line.extension = 'foo'
>>> print line
file.foo
However, notice that I did print line, not just line. By writing a __str__ method, I defined the behavior that happens when you do print line. But if you evaluate it "raw" without printing it, you'll see it's not really a string:
>>> line
<__main__.Line object at 0x000000000233D278>
You could override this behavior as well (by defining __repr__), but do you want to? That depends on how you want to use line. The point is that you need to decide what you want your line to do in what situations, and then craft a class that does that.

How to watch for a variable change in python without dunder setattr or pdb

There is large python project where one attribute of one class just have wrong value in some place.
It should be sqlalchemy.orm.attributes.InstrumentedAttribute, but when I run tests it is constant value, let's say string.
There is some way to run python program in debug mode, and run some check (if variable changed type) after each step throught line of code automatically?
P.S. I know how to log changes of attribute of class instance with help of inspect and property decorator. Possibly here I can use this method with metaclasses...
But sometimes I need more general and powerfull solution...
Thank you.
P.P.S. I need something like there: https://stackoverflow.com/a/7669165/816449, but may be with more explanation of what is going on in that code.

Well, here is a sort of slow approach. It can be modified for watching for local variable change (just by name). Here is how it works: we do sys.settrace and analyse the value of obj.attr each step. The tricky part is that we receive 'line' events (that some line was executed) before line is executed. So, when we notice that obj.attr has changed, we are already on the next line and we can't get the previous line frame (because frames aren't copied for each line, they are modified ). So on each line event I save traceback.format_stack to watcher.prev_st and if on the next call of trace_command value has changed, we print the saved stack trace to file. Saving traceback on each line is quite an expensive operation, so you'd have to set include keyword to a list of your projects directories (or just the root of your project) in order not to watch how other libraries are doing their stuff and waste cpu.
watcher.py
import traceback
class Watcher(object):
def __init__(self, obj=None, attr=None, log_file='log.txt', include=[], enabled=False):
"""
Debugger that watches for changes in object attributes
obj - object to be watched
attr - string, name of attribute
log_file - string, where to write output
include - list of strings, debug files only in these directories.
Set it to path of your project otherwise it will take long time
to run on big libraries import and usage.
"""
self.log_file=log_file
with open(self.log_file, 'wb'): pass
self.prev_st = None
self.include = [incl.replace('\\','/') for incl in include]
if obj:
self.value = getattr(obj, attr)
self.obj = obj
self.attr = attr
self.enabled = enabled # Important, must be last line on __init__.
def __call__(self, *args, **kwargs):
kwargs['enabled'] = True
self.__init__(*args, **kwargs)
def check_condition(self):
tmp = getattr(self.obj, self.attr)
result = tmp != self.value
self.value = tmp
return result
def trace_command(self, frame, event, arg):
if event!='line' or not self.enabled:
return self.trace_command
if self.check_condition():
if self.prev_st:
with open(self.log_file, 'ab') as f:
print >>f, "Value of",self.obj,".",self.attr,"changed!"
print >>f,"###### Line:"
print >>f,''.join(self.prev_st)
if self.include:
fname = frame.f_code.co_filename.replace('\\','/')
to_include = False
for incl in self.include:
if fname.startswith(incl):
to_include = True
break
if not to_include:
return self.trace_command
self.prev_st = traceback.format_stack(frame)
return self.trace_command
import sys
watcher = Watcher()
sys.settrace(watcher.trace_command)
testwatcher.py
from watcher import watcher
import numpy as np
import urllib2
class X(object):
def __init__(self, foo):
self.foo = foo
class Y(object):
def __init__(self, x):
self.xoo = x
def boom(self):
self.xoo.foo = "xoo foo!"
def main():
x = X(50)
watcher(x, 'foo', log_file='log.txt', include =['C:/Users/j/PycharmProjects/hello'])
x.foo = 500
x.goo = 300
y = Y(x)
y.boom()
arr = np.arange(0,100,0.1)
arr = arr**2
for i in xrange(3):
print 'a'
x.foo = i
for i in xrange(1):
i = i+1
main()

There's a very simple way to do this: use watchpoints.
Basically you only need to do
from watchpoints import watch
watch(your_object.attr)
That's it. Whenever the attribute is changed, it will print out the line that changed it and how it's changed. Super easy to use.
It also has more advanced features, for example, you can call pdb when the variable is changed, or use your own callback functions instead of print it to stdout.

A simpler way to watch for an object's attribute change (which can also be a module-level variable or anything accessible with getattr) would be to leverage hunter library, a flexible code tracing toolkit. To detect state changes we need a predicate which can look like the following:
import traceback
class MutationWatcher:
def __init__(self, target, attrs):
self.target = target
self.state = {k: getattr(target, k) for k in attrs}
def __call__(self, event):
result = False
for k, v in self.state.items():
current_value = getattr(self.target, k)
if v != current_value:
result = True
self.state[k] = current_value
print('Value of attribute {} has chaned from {!r} to {!r}'.format(
k, v, current_value))
if result:
traceback.print_stack(event.frame)
return result
Then given a sample code:
class TargetThatChangesWeirdly:
attr_name = 1
def some_nested_function_that_does_the_nasty_mutation(obj):
obj.attr_name = 2
def some_public_api(obj):
some_nested_function_that_does_the_nasty_mutation(obj)
We can instrument it with hunter like:
# or any other entry point that calls the public API of interest
if __name__ == '__main__':
obj = TargetThatChangesWeirdly()
import hunter
watcher = MutationWatcher(obj, ['attr_name'])
hunter.trace(watcher, stdlib=False, action=hunter.CodePrinter)
some_public_api(obj)
Running the module produces:
Value of attribute attr_name has chaned from 1 to 2
File "test.py", line 44, in <module>
some_public_api(obj)
File "test.py", line 10, in some_public_api
some_nested_function_that_does_the_nasty_mutation(obj)
File "test.py", line 6, in some_nested_function_that_does_the_nasty_mutation
obj.attr_name = 2
test.py:6 return obj.attr_name = 2
... return value: None
You can also use other actions that hunter supports. For instance, Debugger which breaks into pdb (debugger on an attribute change).

Try using __setattr__ to override the function that is called when an attribute assignment is attempted. Documentation for __setattr__

You can use the python debugger module (part of the standard library)
To use, just import pdb at the top of your source file:
import pdb
and then set a trace wherever you want to start inspecting the code:
pdb.set_trace()
You can then step through the code with n, and investigate the current state by running python commands.

def __setattr__(self, name, value):
if name=="xxx":
util.output_stack('xxxxx')
super(XXX, self).__setattr__(name, value)
This sample code helped me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Chain memoizer in python - python

Related

Emulating functions with an internal state

Reduce indentation of nested functions in Python code generation

Generate function with arguments filled in when creating it?

"Sub-classes" and self in Python

How to watch for a variable change in python without dunder setattr or pdb

Categories

Resources