Structuring Python Code for Data Analysis - python

I wrote code for a data analysis project, but it's becoming unwieldy and I'd like to find a better way of structuring it so I can share it with others.
For the sake of brevity, I have something like the following:
def process_raw_text(txt_file):
# do stuff
return token_text
def tag_text(token_text):
# do stuff
return tagged
def bio_tag(tagged):
# do stuff
return bio_tagged
def restructure(bio_tagged):
# do stuff
return(restructured)
print(restructured)
Basically I'd like the program to run through all of the functions sequentially and print the output.
In looking into ways to structure this, I read up on classes like the following:
class Calculator():
def add(x, y):
return x + y
def subtract(x, y):
return x - y
This seems useful when structuring a project to allow individual functions to be called separately, such as the add function with Calculator.add(x,y), but I'm not sure it's what I want.
Is there something I should be looking into for a sequential run of functions (that are meant to structure the data flow and provide readability)? Ideally, I'd like all functions to be within "something" I could call once, that would in turn run everything within it.

Chain together the output from each function as the input to the next:
def main():
print restructure(bio_tag(tag_text(process_raw_text(txt_file))
if __name__ == '__main__':
main()
#SvenMarnach makes a nice suggestion. A more general solution is to realise that this idea of repeatedly using the output as the input for the next in a sequence is exactly what the reduce function does. We want to start with some input txt_file:
def main():
pipeline = [process_raw_text, tag_text, bio_tag, restructure]
print reduce(apply, pipeline, txt_file)

There's nothing preventing you from creating a class (or set of classes) that represent that you want to manage with implementations that will call the functions you need in a sequence.
class DataAnalyzer():
# ...
def your_method(self, **kwargs):
# call sequentially, or use the 'magic' proposed by others
# but internally to your class and not visible to clients
pass
The functions themselves could remain private within the module, which seem to be implementation details.

you can implement a simple dynamic pipeline just using modules and functions.
my_module.py
def 01_process_raw_text(txt_file):
# do stuff
return token_text
def 02_tag_text(token_text):
# do stuff
return tagged
my_runner.py
import my_module
if __name__ == '__main__':
funcs = sorted([x in my_module.__dict__.iterkeys() if re.match('\d*.*', x)])
data = initial_data
for f in funcs:
data = my_module.__dict__[f](data)

Related

Making functions given as parameters really visible

A function I'm writing takes a list of functions as parameter. How can I bind the functions from the list to their names in the outer function's namespace? To clarify, this should be executable:
def fun1(x):
return x+1
def fun2(x):
return x-1
def outer(funcs, execstr):
#???
exec(execstr)
outer([fun1, fun2], "print(fun1(x) + fun2(x))")
I think that I can find a hacky workaround with executing the code of funcs in the outer function, but it seems very wasteful. Is there a know way of doing this better?
Added:
It's an exercise in program synthesis: I can synthesise a string with a definition of a python function F. I want to evaluate the function in the context of some wrapper function whose symbol table fills all the missing details in the code of F. The code below works fine, because fun1 and fun2 are in the outer's scope.
def outer():
def fun1(x):
return x+1
def fun2(x):
return x-1
exec("print(fun1(2) + fun2(3))")
outer()
So, I guess that I want to add to the locals() dict of outer entries like 'fun1': ....
Edit after reading more on exec:
def fun1(x):
return x+1
def fun2(x):
return x-1
def outer(f1,f2):
lf = {}
lf['fun1'] = f1
lf['fun2'] = f2
exec("print(fun1(2) + fun2(3))", lf)
outer(fun1, fun2)
Okay, I learned a bit about how it works. This is solved for me.
I think I don't fully understand the whole problem and what you intend with this, but I found it could be funny to try to do something similar.
Answering (what I believe is) your question, I think it is not possible to pass functions as parameters to "dynamically" invoke them (as far as I know in python). But you can dynamically create a program which invokes those functions.
First, create the "main" script containing the functions and variables you will use (I called it "scratch.py"). Inside that script, write a function which will write another script, according to the parameters passed to it:
x = 2
def fun1(x):
return x+1
def fun2(x):
return x-1
def outer(execstr):
f = open('test.py', 'w')
f.write("from scratch import *\n")
f.write("def exec_code():\n")
f.write("\t" + execstr + "\n")
f.write("exec_code()\n")
f.close()
def main():
outer("print(fun1(x) + fun2(x))")
import test
if __name__ == "__main__":
main()
This script will generate a second "dynamic" script (called test.py), which imports "main" script (scratch) to be able to use the functions and variables, and will be imported (and executed) in previous script after creating it. It will look like this:
from scratch import *
def exec_code():
print(fun1(x) + fun2(x))
exec_code()
Output:
4
... Useless, but funny!

Randomly select functions

I am writing a test script which contains different functions for different tests. I would like to be able to randomly select a test to run. I have already achieved this with the following function...
test_options = ("AOI", "RMODE")
def random_test(test_options, control_obj):
ran_test_opt = choice(test_options)
if ran_test_opt.upper() == "AOI":
logging.debug("Random AOI Test selected")
random_aoi()
elif ran_test_opt.upper() == "RMODE":
logging.debug("Random Read Mode Test selected")
random_read_mode(control_obj)
However, I want to be to add further test functions without having to modify the random test select function. All I would like to do is add in the test function to the script. Additionally I would also like a way to selecting which test will be included in the random selection. This is what the variable test_options does. How would I go about changing my random generate function to achieve this?
EDIT: I got around the fact that all the tests might need different arguments by including them all in a test class. All the arguments will be passed into the init and the test functions will refer to them using "self." when they need a specific variable...
class Test(object):
"""A class that contains and keeps track of the tests and the different modes"""
def __init__(self, parser, control_obj):
self.parser = parser
self.control_obj = control_obj
def random_test(self):
test_options = []
for name in self.parser.options('Test_Selection'):
if self.parser.getboolean('Test_Selection', name):
test_options.append(name.lower())
ran_test_opt = choice(test_options)
ran_test_func = getattr(self, ran_test_opt)
ran_test_func()
#### TESTS ####
def random_aoi(self):
logging.info("Random AOI Test")
self.control_obj.random_readout_size()
def random_read_mode(self):
logging.info("Random Readout Mode Test")
self.control_obj.random_read_mode()
You can create a list of functions in python which you can call:
test_options = (random_aoi, random_read_mode)
def random_test(test_options, control_obj):
ran_test_opt = choice(test_options)
ran_test_opt(control_obj) # call the randomly selected function
You have to make each function take the same arguments this way, so you can call them all the same way.
If you need to have some human readable names for the functions, you can store them in a dictionary, together with the function. I expect that you pass the control_obj to every function.
EDIT: This seems to be identical to the answer by #Ikke, but uses the dictionary instead of a list of functions.
>>> test_options = {'AOI': random_aoi, 'RMODE': random_read_mode}
>>> def random_test(test_options, control_obj):
... ran_test_opt = test_options[choice(test_options.keys())]
... ran_test_opt(control_obj)
Or you could pick out a test from test_options.values(). The 'extra' call to list() is because in python 3.x, dict.values() returns an iterator.
>>> ran_test_opt = choice(list(test_options.values()))
I'm gonna go out on a limb here and suggest that you actually use a real unittest framework for this. Python provides one -- conveniently name unittest. To randomly run a test, it would work like this:
import unittest
class Tests(unittest.TestCase):
def test_something(self):
...
def test_something_else(self):
...
if __name__ == '__main__':
tests = list(unittest.defaultTestLoader.loadTestsFromTestCase(Tests))
import random
# repeat as many times as you would like...
test = random.choice(tests)
print(test)
test()
To run all the tests, instead you would use: unittest.main() -- I suppose you could toggle which happens via a simple commandline switch.
This has the huge advantage that you don't need to keep an up-to-date list of tests separate from the tests themselves. If you want to add a new test, just add one and unittest will find it (as long as the method name starts with test). It will also tell you information about which test runs and which one fails, etc.
If you wrapped your test selecting function and the tests themselves in a class, you could do the following:
from random import choice
class Test(object):
""" Randomly selects a test from the methods with 'random' in the name """
def random_aoi(self):
print 'aoi'
def random_read_mode(self):
print 'read_mode'
def random_text(self):
print 'test'
# add as many tests as needed here
# important that this function doesn't have 'random' in the name(in my code anyway)
def run_test(self): # you can add control_obj to the args
methods = [m for m in dir(self) if callable(getattr(self, m)) and 'random' in m]
test = choice(methods)
getattr(self, test)() # you would add control_obj between these parens also
app = Test()
app.run_test()
This makes it easy to add tests without the need to change any other code.
Here is info on getattr
In addition to the other options, look at functools.partial. It allows to create closures for functions:
from functools import partial
def test_function_a(x,y):
return x+y
def other_function(b, c=5, d=4):
return (b*c)**d
def my_test_functions(some_input):
funclist = (partial(test_function_a, x=some_input, y=4),
partial(other_function, b=some_input, d=9))
return funclist
random.choice(funclist)()
This lets you normalize the argument list for each test function.

cleaning up nested function calls

I have written several functions that run sequentially, each one taking as its input the output of the previous function so in order to run it, I have to run this line of code
make_list(cleanup(get_text(get_page(URL))))
and I just find that ugly and inefficient, is there a better way to do sequential function calls?
Really, this is the same as any case where you want to refactor commonly-used complex expressions or statements: just turn the expression or statement into a function. The fact that your expression happens to be a composition of function calls doesn't make any difference (but see below).
So, the obvious thing to do is to write a wrapper function that composes the functions together in one place, so everywhere else you can make a simple call to the wrapper:
def get_page_list(url):
return make_list(cleanup(get_text(get_page(url))))
things = get_page_list(url)
stuff = get_page_list(another_url)
spam = get_page_list(eggs)
If you don't always call the exact same chain of functions, you can always factor out into the pieces that you frequently call. For example:
def get_clean_text(page):
return cleanup(get_text(page))
def get_clean_page(url):
return get_clean_text(get_page(url))
This refactoring also opens the door to making the code a bit more verbose but a lot easier to debug, since it only appears once instead of multiple times:
def get_page_list(url):
page = get_page(url)
text = get_text(page)
cleantext = cleanup(text)
return make_list(cleantext)
If you find yourself needing to do exactly this kind of refactoring of composed functions very often, you can always write a helper that generates the refactored functions. For example:
def compose1(*funcs):
#wraps(funcs[0])
def composed(arg):
for func in reversed(funcs):
arg = func(arg)
return arg
return composed
get_page_list = compose1(make_list, cleanup, get_text, get_page)
If you want a more complicated compose function (that, e.g., allows passing multiple args/return values around), it can get a bit complicated to design, so you might want to look around on PyPI and ActiveState for the various existing implementations.
You could try something like this. I always like separating train wrecks(the book "Clean Code" calls those nested functions train wrecks). This is easier to read and debug. Remember you probably spend twice as long reading your code than writing it so make it easier to read. You will thank yourself later.
url = get_page(URL)
url_text = get_text(url)
make_list(cleanup(url_text))
# you can also encapsulate that into its own function
def build_page_list_from_url(url):
url = get_page(URL)
url_text = get_text(url)
return make_list(cleanup(url_text))
Options:
Refactor: implement this series of function calls as one, aptly-named method.
Look into decorators. They're syntactic sugar for 'chaining' functions in this way. E.g. implement cleanup and make_list as a decorators, then decorate get_text with them.
Compose the functions. See code in this answer.
You could shorten constructs like that with something like the following:
class ChainCalls(object):
def __init__(self, *funcs):
self.funcs = funcs
def __call__(self, *args, **kwargs):
result = self.funcs[-1](*args, **kwargs)
for func in self.funcs[-2::-1]:
result = func(result)
return result
def make_list(arg): return 'make_list(%s)' % arg
def cleanup(arg): return 'cleanup(%s)' % arg
def get_text(arg): return 'get_text(%s)' % arg
def get_page(arg): return 'get_page(%r)' % arg
mychain = ChainCalls(make_list, cleanup, get_text, get_page)
print( mychain('http://is.gd') )
Output:
make_list(cleanup(get_text(get_page('http://is.gd'))))

Is this a good approach to execute a list of operations on a data structure in Python?

I have a dictionary of data, the key is the file name and the value is another dictionary of its attribute values. Now I'd like to pass this data structure to various functions, each of which runs some test on the attribute and returns True/False.
One approach would be to call each function one by one explicitly from the main code. However I can do something like this:
#MYmodule.py
class Mymodule:
def MYfunc1(self):
...
def MYfunc2(self):
...
#main.py
import Mymodule
...
#fill the data structure
...
#Now call all the functions in Mymodule one by one
for funcs in dir(Mymodule):
if funcs[:2]=='MY':
result=Mymodule.__dict__.get(funcs)(dataStructure)
The advantage of this approach is that implementation of main class needn't change when I add more logic/tests to MYmodule.
Is this a good way to solve the problem at hand? Are there better alternatives to this solution?
I'd say a better and much more Pythonic approach would be to define a decorator to indicate which functions you want to use:
class MyFunc(object):
funcs = []
def __init__(self, func):
self.funcs.append(func)
#MyFunc
def foo():
return 5
#MyFunc
def bar():
return 10
def quux():
# Not decorated, so will not be in MyFunc
return 20
for func in MyFunc.funcs:
print func()
Output:
5
10
Essentially you're performing the same logic: taking only functions who were defined in a particular manner and applying them to a specific set of data.
Sridhar, the method you proposed is very similar to the one used in the unittest module.
For example, this is how unittest.TestLoader finds the names of all the test methods to run (lifted from /usr/lib/python2.6/unittest.py):
def getTestCaseNames(self, testCaseClass):
"""Return a sorted sequence of method names found within testCaseClass
"""
def isTestMethod(attrname, testCaseClass=testCaseClass, prefix=self.testMethodPrefix):
return attrname.startswith(prefix) and hasattr(getattr(testCaseClass, attrname), '__call__')
testFnNames = filter(isTestMethod, dir(testCaseClass))
if self.sortTestMethodsUsing:
testFnNames.sort(key=_CmpToKey(self.sortTestMethodsUsing))
return testFnNames
Just like your proposal, unittest uses dir to list all the attributes of
testCaseClass, and filters the list for those whose name startswith prefix (which is set elsewhere to equal 'test').
I suggest a few minor changes:
If you place the functions in MYmodule.py, then (of course) the import statement must be
import MYmodule
Use getattr instead of .__dict__.get. Not only is it shorter, but it continue to work if you subclass Mymodule. That might not be your intention at this point, but using getattr is probably a good default habit anyway.
for funcs in dir(MYmodule.Mymodule):
if funcs.startswith('MY'):
result=getattr(MYmodule.Mymodule,funcs)(dataStructure)

Hashing a python function to regenerate output when the function is modified

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.
If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.
The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.
So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);
What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

Categories

Resources