Python define variable as "load file at first use" - python

Python beginner here. I currently have some code that looks like
a=some_file_reading_function('filea')
b=some_file_reading_function('fileb')
# ...
if some_condition(a):
do_complicated_stuff(b)
else:
# nothing that involves b
What itches me is that the load of 'fileb' may not be necessary, and it has some performance penalty. Ideally, I would load it only if b is actually required later on. OTOH, b might be used multiple times, so if it is used once, it should load the file once for all. I do not know how to achieve that.
In the above pseudocode, one could trivially bring the loading of 'fileb' inside the conditional loop, but in reality there are more than two files and the conditional branching is quite complex. Also the code is still under heavy development and the conditional branching may change.
I looked a bit at either iterators or defining a class, but (probably due to my inexperience) could not make either work. The key problem I met was to load the file zero times if unneeded, and only once if needed. I found nothing on searching, because "how to load a file by chunks" pollutes the results for "file lazy loading" and similar queries.
If needed: Python 3.5 on Win7, and some_file_reading_function returns 1D- numpy.ndarray 's.

class LazyFile():
def __init__(self, file):
self.file = file
self._data = None
#property # so you can use .data instead of .data()
def data(self):
if self._data is None: # if not loaded
self._data = some_file_reading_function(self.file) #load it
return self._data
a = LazyFile('filea')
b = LazyFile('fileb')
if some_condition(a.data):
do_complicated_stuff(b.data)
else:
# other stuff

Actually, just found a workaround with classes. Try/except inspired by How to know if an object has an attribute in Python. A bit ugly, but does the job:
class Filecontents:
def __init__(self,filepath):
self.fp = filepath
def eval(self):
try:
self.val
except AttributeError:
self.val = some_file_reading_function(self.fp)
print(self.txt)
finally:
return self.val
def some_file_reading_function(fp):
# For demonstration purposes: say that you are loading something
print('Loading '+fp)
# Return value
return 0
a=Filecontents('somefile')
print('Not used yet')
print('Use #1: value={}'.format(a.eval()))
print('Use #2: value={}'.format(a.eval()))
Not sure that is the "best" (prettiest, most Pythonic) solution though.

Related

searching python object hierarchy

I'm using an unstable python library that's undergoing changes to the API in its various git submodules. Every few weeks, some member or method's location will get changed or renamed. For example, the expression
vk.rinkront[0].asyncamera.printout()
that worked a few weeks back, now has the asyncamera member located in
vk.toplink.enabledata.tx[0].asyncamera[0].print()
I manage to figure out the method's new location by greping, git diffing, ipython autocomplete, and generally bumbling my way around. This process is painful, because the repository has been quite abstracted and the member names observable from the python shell do not necessarily appear in the git submodule code.
Is there a python routine that performs a graph traversal of the object hierarchy while checking for keywords (e.g. print ) ?
(if no, I'll hack some bfs/dfs that checks the child node type before pushing/appending the __dict__, array, dir(). etc... contents onto a queue/stack. But some one must have come across this problem before and come up with a more elegant solution.)
-------EDITS----------
to sum up, is there an existing library such that the following code would work?
import unstable_library
import object_inspecter.search
vk = unstable_library.create() #initializing vk
object_inspecter.search(vk,"print") #searching for lost method
Oy, I feel for you... You probably should just lock down on a version, use it until the API becomes stable, and then upgrade later. That's why complex projects focus on dependency version control.
There is a way, but it's not really standard. The dir() built-in function returns a list of string names of all attributes and methods of a class. With a little bit a wrangling, you could write a script that recursively digs down.
The ultimate issue, though, is that you're going to run into infinite recursive loops when you try to investigate class members. You'll need to add in smarts to either recognize the pattern or limit the depth of your searching.
you can use fairly simple graph traversal to do this
def find_thing(ob,pattern,seen=set(),name=None,curDepth=0,maxDepth=99,isDict=True):
if(ob is None):
return []
if id(ob) in seen:
return []
seen.add(id(ob))
name = str(name or getattr(ob,"__name__",str(ob)))
base_case = check_base_cases(ob,name,pattern)
if base_case is not None:
return base_case
if(curDepth>=maxDepth):
return []
return recursive_step(ob,pattern,name=name,curDepth=curDepth,isDict=isDict)
now just define your two steps (base_case and recursive_step)... something like
def check_base_cases(ob,name,pattern):
if isinstance(ob,str):
if re.match(pattern,ob):
return [ob]
else:
return []
if isinstance(ob,(int,float,long)):
if ob == pattern or str(ob) == pattern:
return [ob]
else:
return []
if isinstance(ob,types.FunctionType):
if re.match(pattern,name):
return [name]
else:
return []
def recursive_step(ob,pattern,name,curDepth,isDict=True):
matches = []
if isinstance(ob,(list,tuple)):
for i,sub_ob in enumerate(ob):
matches.extend(find_thing(sub_ob,pattern,name='%s[%s]'%(name,i),curDepth=curDepth+1))
return matches
if isinstance(ob,dict):
for key,item in ob.items():
if re.match(pattern,str(key)):
matches.append('%s.%s'%(name,key) if not isDict else '%s["%s"]'%(name,key))
else:
matches.extend(find_thing(item,pattern,
name='%s["%s"]'%(name,key) if isDict else '%s.%s'%(name,key),
curDepth=curDepth+1))
return matches
else:
data = dict([(x,getattr(ob,x)) for x in dir(ob) if not x.startswith("_")])
return find_thing(data,pattern,name=name,curDepth=curDepth+1,isDict=False)
finally you can test it like so
print(find_thing(vk,".*print.*"))
I used the following Example
class vk(object):
class enabledata:
class camera:
class object2:
def printX(*args):
pass
asynccamera = [object2]
tx = [camera]
toplink = {'enabledata':enabledata}
running the following
print( find_thing(vk,'.*print.*') )
#['vk.toplink["enabledata"].camera.asynccamera[0].printX']

Running multiple functions in Python

I had a program that read in a text file and took out the necessary variables for serialization into turtle format and storing in an RDF graph. The code I had was crude and I was advised to separate it into functions. As I am new to Python, I had no idea how to do this. Below is some of the functions of the program.
I am getting confused as to when parameters should be passed into the functions and when they should be initialized with self. Here are some of my functions. If I could get an explanation as to what I am doing wrong that would be great.
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
self.graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
self.split_pointer_part()
self.split_word_part()
self.split_gloss_part()
self.process_lex_filenum()
self.process_synset_offset()
+more functions............
self.print_graph()
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def get_num_words(self, word_part, num_words):
""" 1 as default, may want 0 as an invalid case """
""" do if else statements on l3 variable """
if word_part[3] == '0a':
num_words = 10
else:
num_words = int(word_part[3])
return num_words
def get_pointers_list(self, pointers, after_at, num_pointers, pointerList):
pointers = after_at.split()[0:0 +4 * num_pointers:4]
pointerList = iter(pointers)
return pointerList
............code to create triples for graph...............
def print_graph(self):
print graph.serialize(format='nt')
def main():
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
if __name__ == "__main__":
main()
You question is mainly a question about what object oriented programming is. I will try to explain quickly, but I recommend reading a proper tutorial on it like
http://www.voidspace.org.uk/python/articles/OOP.shtml
http://net.tutsplus.com/tutorials/python-tutorials/python-from-scratch-object-oriented-programming/
and/or http://www.tutorialspoint.com/python/python_classes_objects.htm
When you create a class and instantiate it (with mywordnet=WordNet(somegraph)), you can resue the mywordnet instance many times. Each variable you set on self. in WordNet, is stored in that instance. So for instance self.graph is always available if you call any method of mywordnet. If you wouldn't store it in self.graph, you would need to specify it as a parameter in each method (function) that requires it. Which would be tedious if all of these method calls require the same graph anyway.
So to look at it another way: everything you set with self. can be seen as a sort of configuration for that specific instance of Wordnet. It influences the Wordnet behaviour. You could for instance have two Wordnet instances, each instantiated with a different graph, but all other functionality the same. That way you can choose which graph to print to, depending on which Wordnet instance you use, but everything else stays the same.
I hope this helps you out a little.
First, I suggest you figure out the basic functional decomposition on its own - don't worry about writing a class at all.
For example,
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
doesn't touch any instance variables (it never refers to self), so it can just be a standalone function.
It also exhibits a peculiarity I see in your other code: you pass two arguments (before_at, after_at) but never use their values. If the caller doesn't already know what they are, why pass them in?
So, a free function should probably look like:
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
If you want to put this function in your class scope (so it doesn't pollute the top-level namespace, or just because it's a logical grouping), you still don't need to pass self if it isn't used. You can make it a static method:
#staticmethod
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
One thing that would be very helpful for you is a good visual debugger. There's a nice free one for Python called Winpdb. There are also excellent debuggers in the commercial products IntelliJ IDEA/PyCharm, Komodo IDE, WingIDE, and Visual Studio (with the Python Tools add-in). Probably a few others too.
I highly recommend setting up one of these debuggers and running your code under it. It will let you step through your code line by line and see what happens with all your variables and objects.
You may find people who tell you that real programmers don't need or shouldn't use debuggers. Don't listen to them: a good debugger is one of the very best tools to help you learn a new language or to get familiar with a piece of code.

Caching disk operations

I might need to do multiple reads over a big code-base, and with different tools.
I then thought that is a real waste to read on disk so many times while the text won't change, so I wrote the following.
class Module(object):
def __init__(self, module_path):
self.module_path = module_path
self._text = None
self._ast = None
#property
def text(self):
if not self._text:
self._text = open(self.module_path).read()
return self._text
#property
def ast(self):
s = self.text # which is actually discarded
if not self._ast:
self._ast = parse(self.text)
return self._ast
class ContentDirectory(object):
def __init__(self):
self.content = {}
def __getitem__(self, module_path):
if module_path not in self.content:
self.content[module_path] = Module(module_path)
return self.content[module_path]
But now it comes the problem, because I would like to avoid changing the rest of the code, while being able to use this new trick.
The only way I see would be to patch the "open" builtin function everywhere it might be used, for example.
from myotherlib import __builtins__ as other_builtins
other_builtins.open = my_dummy_open # which uses this cache
But it does not really seem like a wise idea.
Should I just give up and only try if the performance is really too bad maybe?
you can use mmap module: http://docs.python.org/library/mmap.html
Replacing the system open() call is potentially a bad thing. It requires that everything which uses open() uses it as you expect.
Why do you want to avoid changing the code?
Yes, measure the performance and see if it's worthwhile. For example, put in your above code and see how much faster things are. If it's only 1% faster then there's no reason to do anything. If it's significantly faster, then see what's using the open() and change that code if you can.
BTW, something like an LRU cache (part of functools in Python 3.2) would also be helpful for your task.
I'm not sure if the functionality offered by this library is of any use in your scenario, but thought to mention nevertheless the existence of the linecache library. From the linked docs:
The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.
...of course this doesn't come even close to your problem of implementing a solution in an elegant and transparent way...

Is this an acceptable pythonic idiom?

I have a class that assists in importing a special type of file, and a 'factory' class that allows me to do these in batch. The factory class uses a generator so the client can iterate through the importers.
My question is, did I use the iterator correctly? Is this an acceptable idiom? I've just started using Python.
class FileParser:
""" uses an open filehandle to do stuff """
class BatchImporter:
def __init__(self, files):
self.files=files
def parsers(self):
for file in self.files:
try:
fh = open(file, "rb")
parser = FileParser(fh)
yield parser
finally:
fh.close()
def verifyfiles(
def cleanup(
---
importer = BatchImporter(filelist)
for p in BatchImporter.parsers():
p.method1()
...
You could make one thing a little simpler: Instead of try...finally, use a with block:
with open(file, "rb") as fh:
yield FileParser(fh)
This will close the file for you automatically as soon as the with block is left.
It's absolutely fine to have a method that's a generator, as you do. I would recommend making all your classes new-style (if you're on Python 2, either set __metaclass__ = type at the start of your module, or add (object) to all your base-less class statements), because legacy classes are "evil";-); and, for clarity and conciseness, I would also recomment coding the generator differently...:
def parsers(self):
for afile in self.files:
with open(afile, "rb") as fh:
yield FileParser(fh)
but neither of these bits of advice condemns in any way the use of generator methods!-)
Note the use of afile in lieu of file: the latter is a built-in identifier, and as a general rule it's better to get used to not "hide" built-in identifiers with your own (it doesn't bite you here, but it will in many nasty ways in the future unless you get into the right habit!-).
The design is fine if you ask me, though using finally the way you use it isn't exactly idiomatic. Use catch and maybe re-raise the exception (using the raise keyword alone, otherwise you mess the stacktrace up), and for bonus points, don't catch: but catch Exception: (otherwise, you catch SystemExit and KeyboardInterrupt).
Or simply use the with-statement as shown by Tim Pietzcker.
In general, it isn't safe to close the file after you yield a parser object that will try to read it. Consider this code:
parsers = list(BatchImporter.parsers())
for p in parsers:
# the file object that p holds will already be closed!
If you're not writing a long-running daemon process, most of the time you don't need to worry about closing files -- they will all get closed when your program exits, or when the file objects are garbage-collected. (And if you use CPython, that will happen as soon as all references to them are lost, since CPython uses reference counting.)
Nevertheless, taking care to free resources is a good habit to acquire, so I would probably write the FileParser class this way:
class FileParser:
def __init__(self, file_or_filename, closing=False):
if hasattr(file_or_filename, 'read'):
self.f = file_or_filename
self._need_to_close = closing
else:
self.f = open(file_or_filename, 'rb')
self._need_to_close = True
def close(self):
if self._need_to_close:
self.f.close()
self._need_to_close = False
and then BatchImporter.parsers would become
def parsers(self):
for file in self.files:
yield FileParser(file)
or, if you love functional programming
def parsers(self):
return itertools.imap(FileParser, self.files)
An aside: if you're new to Python, I recommend you take a look at the Python style guide (also known as PEP 8). Two-space indents look weird.

Hashing a python function to regenerate output when the function is modified

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.
If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.
The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.
So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);
What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

Categories

Resources