I had a program that read in a text file and took out the necessary variables for serialization into turtle format and storing in an RDF graph. The code I had was crude and I was advised to separate it into functions. As I am new to Python, I had no idea how to do this. Below is some of the functions of the program.
I am getting confused as to when parameters should be passed into the functions and when they should be initialized with self. Here are some of my functions. If I could get an explanation as to what I am doing wrong that would be great.
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
self.graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
self.split_pointer_part()
self.split_word_part()
self.split_gloss_part()
self.process_lex_filenum()
self.process_synset_offset()
+more functions............
self.print_graph()
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def get_num_words(self, word_part, num_words):
""" 1 as default, may want 0 as an invalid case """
""" do if else statements on l3 variable """
if word_part[3] == '0a':
num_words = 10
else:
num_words = int(word_part[3])
return num_words
def get_pointers_list(self, pointers, after_at, num_pointers, pointerList):
pointers = after_at.split()[0:0 +4 * num_pointers:4]
pointerList = iter(pointers)
return pointerList
............code to create triples for graph...............
def print_graph(self):
print graph.serialize(format='nt')
def main():
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
if __name__ == "__main__":
main()
You question is mainly a question about what object oriented programming is. I will try to explain quickly, but I recommend reading a proper tutorial on it like
http://www.voidspace.org.uk/python/articles/OOP.shtml
http://net.tutsplus.com/tutorials/python-tutorials/python-from-scratch-object-oriented-programming/
and/or http://www.tutorialspoint.com/python/python_classes_objects.htm
When you create a class and instantiate it (with mywordnet=WordNet(somegraph)), you can resue the mywordnet instance many times. Each variable you set on self. in WordNet, is stored in that instance. So for instance self.graph is always available if you call any method of mywordnet. If you wouldn't store it in self.graph, you would need to specify it as a parameter in each method (function) that requires it. Which would be tedious if all of these method calls require the same graph anyway.
So to look at it another way: everything you set with self. can be seen as a sort of configuration for that specific instance of Wordnet. It influences the Wordnet behaviour. You could for instance have two Wordnet instances, each instantiated with a different graph, but all other functionality the same. That way you can choose which graph to print to, depending on which Wordnet instance you use, but everything else stays the same.
I hope this helps you out a little.
First, I suggest you figure out the basic functional decomposition on its own - don't worry about writing a class at all.
For example,
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
doesn't touch any instance variables (it never refers to self), so it can just be a standalone function.
It also exhibits a peculiarity I see in your other code: you pass two arguments (before_at, after_at) but never use their values. If the caller doesn't already know what they are, why pass them in?
So, a free function should probably look like:
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
If you want to put this function in your class scope (so it doesn't pollute the top-level namespace, or just because it's a logical grouping), you still don't need to pass self if it isn't used. You can make it a static method:
#staticmethod
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
One thing that would be very helpful for you is a good visual debugger. There's a nice free one for Python called Winpdb. There are also excellent debuggers in the commercial products IntelliJ IDEA/PyCharm, Komodo IDE, WingIDE, and Visual Studio (with the Python Tools add-in). Probably a few others too.
I highly recommend setting up one of these debuggers and running your code under it. It will let you step through your code line by line and see what happens with all your variables and objects.
You may find people who tell you that real programmers don't need or shouldn't use debuggers. Don't listen to them: a good debugger is one of the very best tools to help you learn a new language or to get familiar with a piece of code.
Related
I want to test my code that is based on the API created by someone else, but im not sure how should I do this.
I have created some function to save the json into file so I don't need to send requests each time I run test, but I don't know how to make it work in situation when the original (check) function takes an input arg (problem_report) which is an instance of some class provided by API and it has this
problem_report.get_correction(corr_link) method. I just wonder if this is a sign of bad written code by me, beacuse I can't write a test to this, or maybe I should rewrite this function in my tests file like I showed at the end of provided below code.
# I to want test this function
def check(problem_report):
corrections = {}
for corr_link, corr_id in problem_report.links.items():
if re.findall(pattern='detailCorrection', string=corr_link):
correction = problem_report.get_correction(corr_link)
corrections.update({corr_id: correction})
return corrections
# function serves to load json from file, normally it is downloaded by API from some page.
def load_pr(pr_id):
print('loading')
with open('{}{}_view_pr.json'.format(saved_prs_path, pr_id)) as view_pr:
view_pr = json.load(view_pr)
...
pr_info = {'view_pr': view_pr, ...}
return pr_info
# create an instance of class MyPR which takes json to __init__
#pytest.fixture
def setup_pr():
print('setup')
pr = load_pr('123')
my_pr = MyPR(pr['view_pr'])
return my_pr
# test function
def test_check(setup_pr):
pr = setup_pr
checked_pr = pr.check(setup_rft[1]['problem_report_pr'])
assert checker_pr
# rewritten check function in test file
#mock.patch('problem_report.get_correction', side_effect=get_corr)
def test_check(problem_report):
corrections = {}
for corr_link, corr_id in problem_report.links.items():
if re.findall(pattern='detailCorrection', string=corr_link):
correction = problem_report.get_correction(corr_link)
corrections.update({corr_id: correction})
return corrections
Im' not sure if I provided enough code and explanation to underastand the problem, but I hope so. I wish you could tell me if this is normal that some function are just hard to test, and if this is good practice to rewritte them separately so I can mock functions inside the tested function. I also was thinking that I could write new class with similar functionality but API is very large and it would be very long process.
I understand your question as follows: You have a function check that you consider hard to test because of its dependency on the problem_report. To make it better testable you have copied the code into the test file. You will test the copied code because you can modify this to be easier testable. And, you want to know if this approach makes sense.
The answer is no, this does not make sense. You are not testing the real function, but completely different code. Well, the code may not start being completely different, but in short time the copy and the original will deviate, and it will be a maintenance nightmare to ensure that the copy always resembles the original. Improving code for testability is a different story: You can make changes to the check function to improve its testability. But then, exactly the same resulting function should be used both in the test and the production code.
How to better test the function check then? First, are you sure that using the original problem_report objects really can not be sensibly used in your tests? (Here are some criteria that help you decide: What to mock for python test cases?). Now, lets assume that you come to the conclusion you can not sensibly use the original problem_report.
In that case, here the interface is simple enough to define a mocked problem_report. Keep in mind that Python uses duck typing, so you only have to create a class that has a links member which has an items() method. Plus, your mocked problem_report class needs a method get_correction(). Beyond that, your mock does not have to produce types that are similar to the types used by problem_report. The items() method can return simply a list of lists, like [["a",2],["xxxxdetailCorrectionxxxx",4]]. The same argument holds for get_correction, which could for example simply return its argument or a derived value, like, its negative.
For the above example (items() returning [["a",2],["xxxxdetailCorrectionxxxx",4]] and get_correction returning the negative of its argument) the expected result would be {4: -4}. No need to simulate real correction objects. And, you can create your mocked versions of problem_report without need to read data from files - the mocks can be setup completely from within the unit-testing code.
Try patching the problem_report symbol in the module. You should put your tests in a separate class.
#mock.patch('some.module.path.problem_report')
def test_check(problem_report):
problem_report.side_effect = get_corr
corrections = {}
for corr_link, corr_id in problem_report.links.items():
if re.findall(pattern='detailCorrection', string=corr_link):
correction = problem_report.get_correction(corr_link)
corrections.update({corr_id: correction})
return corrections
In Python it is easy to create new functions programmatically. How would I assign this to programmatically determined names in the current scope?
This is what I'd like to do (in non-working code):
obj_types = ('cat', 'dog', 'donkey', 'camel')
for obj_type in obj_types:
'create_'+obj_type = lambda id: id
In the above example, the assignment of lambda into a to-be-determined function name obviously does not work. In the real code, the function itself would be created by a function factory.
The background is lazyness and do-not-repeat-yourself: I've got a dozen and more object types for which I'd assign a generated function. So the code currently looks like:
create_cat = make_creator('cat')
# ...
create_camel = make_creator('camel')
The functions create_cat etc are used hardcoded in a parser.
If I would create classes as a new type programmatically, types.new_class() as seen in the docs seems to be the solution.
Is it my best bet to (mis)use this approach?
One way to accomplish what you are trying to do (but not create functions with dynamic names) is to store the lamda's in a dict using the name as the key. Instead of calling create_cat() you would call create['cat'](). That would dovetail nicely with not hardcoding names in the parser logic as well.
Vaughn Cato points out that one could just assign into locals()[object_type] = factory(object_type). However the Python docs prohibit this: "Note: The contents of this dictionary should not be modified; changes may not affect the values of local and free variables used by the interpreter"
D. Shawley points out that it would be wiser to use a dict() object which entries would hold the functions. Access would be simple by using create['cat']() in the parser. While this is compelling I do not like the syntax overhead of the brackets and ticks required.
J.F. Sebastian points to classes. And this is what I ended up with:
# Omitting code of these classes for clarity
class Entity:
def __init__(file_name, line_number):
# Store location, good for debug, messages, and general indexing
# The following classes are the real objects to be generated by a parser
# Their constructors must consume whatever data is provided by the tokens
# as well as calling super() to forward the file_name,line_number info.
class Cat(Entity): pass
class Camel(Entity): pass
class Parser:
def parse_file(self, fn):
# ...
# Function factory to wrap object constructor calls
def create_factory(obj_type):
def creator(text, line_number, token):
try:
return obj_type(*token,
file_name=fn, line_number=line_number)
except Exception as e:
# For debug of constructor during development
print(e)
return creator
# Helper class, serving as a 'dictionary' of obj construction functions
class create: pass
for obj_type in (Cat, Camel):
setattr(create,
obj_type.__name__.lower(),
create_factory(obj_type))
# Parsing code now can use (again simplified for clarity):
expression = Keyword('cat').setParseAction(create.cat)
This is helper code for deploying a pyparsing parser. D. Shawley is correct in that the dict would actually more easily allow to dynamically generate the parser grammar.
I might need to do multiple reads over a big code-base, and with different tools.
I then thought that is a real waste to read on disk so many times while the text won't change, so I wrote the following.
class Module(object):
def __init__(self, module_path):
self.module_path = module_path
self._text = None
self._ast = None
#property
def text(self):
if not self._text:
self._text = open(self.module_path).read()
return self._text
#property
def ast(self):
s = self.text # which is actually discarded
if not self._ast:
self._ast = parse(self.text)
return self._ast
class ContentDirectory(object):
def __init__(self):
self.content = {}
def __getitem__(self, module_path):
if module_path not in self.content:
self.content[module_path] = Module(module_path)
return self.content[module_path]
But now it comes the problem, because I would like to avoid changing the rest of the code, while being able to use this new trick.
The only way I see would be to patch the "open" builtin function everywhere it might be used, for example.
from myotherlib import __builtins__ as other_builtins
other_builtins.open = my_dummy_open # which uses this cache
But it does not really seem like a wise idea.
Should I just give up and only try if the performance is really too bad maybe?
you can use mmap module: http://docs.python.org/library/mmap.html
Replacing the system open() call is potentially a bad thing. It requires that everything which uses open() uses it as you expect.
Why do you want to avoid changing the code?
Yes, measure the performance and see if it's worthwhile. For example, put in your above code and see how much faster things are. If it's only 1% faster then there's no reason to do anything. If it's significantly faster, then see what's using the open() and change that code if you can.
BTW, something like an LRU cache (part of functools in Python 3.2) would also be helpful for your task.
I'm not sure if the functionality offered by this library is of any use in your scenario, but thought to mention nevertheless the existence of the linecache library. From the linked docs:
The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.
...of course this doesn't come even close to your problem of implementing a solution in an elegant and transparent way...
Lets say I have a program that has a large number of configuration options. The user can specify them in a config file. My program can parse this config file, but how should it internally store and pass around the options?
In my case, the software is used to perform a scientific simulation. There are about 200 options most of which have sane defaults. Typically the user only has to specify a dozen or so. The difficulty I face is how to design my internal code. Many of the objects that need to be constructed depend on many configuration options. For example an object might need several paths (for where data will be stored), some options that need to be passed to algorithms that the object will call, and some options that are used directly by the object itself.
This leads to objects needing a very large number of arguments to be constructed. Additionally, as my codebase is under very active development, it is a big pain to go through the call stack and pass along a new configuration option all the way down to where it is needed.
One way to prevent that pain is to have a global configuration object that can be freely used anywhere in the code. I don't particularly like this approach as it leads to functions and classes that don't take any (or only one) argument and it isn't obvious to the reader what data the function/class deals with. It also prevents code reuse as all of the code depends on a giant config object.
Can anyone give me some advice about how a program like this should be structured?
Here is an example of what I mean for the configuration option passing style:
class A:
def __init__(self, opt_a, opt_b, ..., opt_z):
self.opt_a = opt_a
self.opt_b = opt_b
...
self.opt_z = opt_z
def foo(self, arg):
algo(arg, opt_a, opt_e)
Here is an example of the global config style:
class A:
def __init__(self, config):
self.config = config
def foo(self, arg):
algo(arg, config)
The examples are in Python but my question stands for any similar programming langauge.
matplotlib is a large package with many configuration options. It use a rcParams module to manage all the default parameters. rcParams save all the default parameters in a dict.
Every functions will get the options from keyword argurments:
for example:
def f(x,y,opt_a=None, opt_b=None):
if opt_a is None: opt_a = rcParams['group1.opt_a']
A few design patterns will help
Prototype
Factory and Abstract Factory
Use these two patterns with configuration objects. Each method will then take a configuration object and use what it needs. Also consider applying a logical grouping to config parameters and think about ways to reduce the number of inputs.
psuedo code
// Consider we can run three different kinds of Simulations. sim1, sim2, sim3
ConfigFactory configFactory = new ConfigFactory("/path/to/option/file");
....
Simulation1 sim1;
Simulation2 sim2;
Simulation3 sim3;
sim1.run( configFactory.ConfigForSim1() );
sim2.run( configFactory.ConfigForSim2() );
sim3.run( configFactory.ConfigForSim3() );
Inside of each factory method it might create a configuration from a prototype object (that has all of the "sane" defaults) and the option file becomes just the things that are different from default. This would be paired with clear documentation on what these defaults are and when a person (or other program) might want to change them.
** Edit: **
Also consider that each config returned by the factory is a subset of the overall config.
Pass around either the config parsing class, or write a class that wraps it and intelligently pulls out the requested options.
Python's standard library configparser exposes the sections and options of an INI style configuration file using the mapping protocol, and so you can retrieve your options directly from that as though it were a dictionary.
myconf = configparser.ConfigParser()
myconf.read('myconf.ini')
what_to_do = myconf['section']['option']
If you explicitly want to provide the options using the attribute notation, create a class that overrides __getattr__:
class MyConf:
def __init__(self, path):
self._parser = configparser.ConfigParser()
self._parser.read('myconf.ini')
def __getattr__(self, option):
return self._parser[{'what_to_do': 'section'}[option]][option]
myconf = MyConf()
what_to_do = myconf.what_to_do
Have a module load the params to its namespace, then import it and use wherever you want.
Also see related question here
I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.
If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.
The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.
So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);
What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.