Caching disk operations

Caching disk operations - python

I might need to do multiple reads over a big code-base, and with different tools.
I then thought that is a real waste to read on disk so many times while the text won't change, so I wrote the following.
class Module(object):
def __init__(self, module_path):
self.module_path = module_path
self._text = None
self._ast = None
#property
def text(self):
if not self._text:
self._text = open(self.module_path).read()
return self._text
#property
def ast(self):
s = self.text # which is actually discarded
if not self._ast:
self._ast = parse(self.text)
return self._ast
class ContentDirectory(object):
def __init__(self):
self.content = {}
def __getitem__(self, module_path):
if module_path not in self.content:
self.content[module_path] = Module(module_path)
return self.content[module_path]
But now it comes the problem, because I would like to avoid changing the rest of the code, while being able to use this new trick.
The only way I see would be to patch the "open" builtin function everywhere it might be used, for example.
from myotherlib import __builtins__ as other_builtins
other_builtins.open = my_dummy_open # which uses this cache
But it does not really seem like a wise idea.
Should I just give up and only try if the performance is really too bad maybe?

you can use mmap module: http://docs.python.org/library/mmap.html

Replacing the system open() call is potentially a bad thing. It requires that everything which uses open() uses it as you expect.
Why do you want to avoid changing the code?
Yes, measure the performance and see if it's worthwhile. For example, put in your above code and see how much faster things are. If it's only 1% faster then there's no reason to do anything. If it's significantly faster, then see what's using the open() and change that code if you can.
BTW, something like an LRU cache (part of functools in Python 3.2) would also be helpful for your task.

I'm not sure if the functionality offered by this library is of any use in your scenario, but thought to mention nevertheless the existence of the linecache library. From the linked docs:
The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.
...of course this doesn't come even close to your problem of implementing a solution in an elegant and transparent way...

Related

Python define variable as "load file at first use"

Python beginner here. I currently have some code that looks like
a=some_file_reading_function('filea')
b=some_file_reading_function('fileb')
# ...
if some_condition(a):
do_complicated_stuff(b)
else:
# nothing that involves b
What itches me is that the load of 'fileb' may not be necessary, and it has some performance penalty. Ideally, I would load it only if b is actually required later on. OTOH, b might be used multiple times, so if it is used once, it should load the file once for all. I do not know how to achieve that.
In the above pseudocode, one could trivially bring the loading of 'fileb' inside the conditional loop, but in reality there are more than two files and the conditional branching is quite complex. Also the code is still under heavy development and the conditional branching may change.
I looked a bit at either iterators or defining a class, but (probably due to my inexperience) could not make either work. The key problem I met was to load the file zero times if unneeded, and only once if needed. I found nothing on searching, because "how to load a file by chunks" pollutes the results for "file lazy loading" and similar queries.
If needed: Python 3.5 on Win7, and some_file_reading_function returns 1D- numpy.ndarray 's.

class LazyFile():
def __init__(self, file):
self.file = file
self._data = None
#property # so you can use .data instead of .data()
def data(self):
if self._data is None: # if not loaded
self._data = some_file_reading_function(self.file) #load it
return self._data
a = LazyFile('filea')
b = LazyFile('fileb')
if some_condition(a.data):
do_complicated_stuff(b.data)
else:
# other stuff

Actually, just found a workaround with classes. Try/except inspired by How to know if an object has an attribute in Python. A bit ugly, but does the job:
class Filecontents:
def __init__(self,filepath):
self.fp = filepath
def eval(self):
try:
self.val
except AttributeError:
self.val = some_file_reading_function(self.fp)
print(self.txt)
finally:
return self.val
def some_file_reading_function(fp):
# For demonstration purposes: say that you are loading something
print('Loading '+fp)
# Return value
return 0
a=Filecontents('somefile')
print('Not used yet')
print('Use #1: value={}'.format(a.eval()))
print('Use #2: value={}'.format(a.eval()))
Not sure that is the "best" (prettiest, most Pythonic) solution though.

Running multiple functions in Python

I had a program that read in a text file and took out the necessary variables for serialization into turtle format and storing in an RDF graph. The code I had was crude and I was advised to separate it into functions. As I am new to Python, I had no idea how to do this. Below is some of the functions of the program.
I am getting confused as to when parameters should be passed into the functions and when they should be initialized with self. Here are some of my functions. If I could get an explanation as to what I am doing wrong that would be great.
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
self.graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
self.split_pointer_part()
self.split_word_part()
self.split_gloss_part()
self.process_lex_filenum()
self.process_synset_offset()
+more functions............
self.print_graph()
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def get_num_words(self, word_part, num_words):
""" 1 as default, may want 0 as an invalid case """
""" do if else statements on l3 variable """
if word_part[3] == '0a':
num_words = 10
else:
num_words = int(word_part[3])
return num_words
def get_pointers_list(self, pointers, after_at, num_pointers, pointerList):
pointers = after_at.split()[0:0 +4 * num_pointers:4]
pointerList = iter(pointers)
return pointerList
............code to create triples for graph...............
def print_graph(self):
print graph.serialize(format='nt')
def main():
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
if __name__ == "__main__":
main()

You question is mainly a question about what object oriented programming is. I will try to explain quickly, but I recommend reading a proper tutorial on it like
http://www.voidspace.org.uk/python/articles/OOP.shtml
http://net.tutsplus.com/tutorials/python-tutorials/python-from-scratch-object-oriented-programming/
and/or http://www.tutorialspoint.com/python/python_classes_objects.htm
When you create a class and instantiate it (with mywordnet=WordNet(somegraph)), you can resue the mywordnet instance many times. Each variable you set on self. in WordNet, is stored in that instance. So for instance self.graph is always available if you call any method of mywordnet. If you wouldn't store it in self.graph, you would need to specify it as a parameter in each method (function) that requires it. Which would be tedious if all of these method calls require the same graph anyway.
So to look at it another way: everything you set with self. can be seen as a sort of configuration for that specific instance of Wordnet. It influences the Wordnet behaviour. You could for instance have two Wordnet instances, each instantiated with a different graph, but all other functionality the same. That way you can choose which graph to print to, depending on which Wordnet instance you use, but everything else stays the same.
I hope this helps you out a little.

First, I suggest you figure out the basic functional decomposition on its own - don't worry about writing a class at all.
For example,
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
doesn't touch any instance variables (it never refers to self), so it can just be a standalone function.
It also exhibits a peculiarity I see in your other code: you pass two arguments (before_at, after_at) but never use their values. If the caller doesn't already know what they are, why pass them in?
So, a free function should probably look like:
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
If you want to put this function in your class scope (so it doesn't pollute the top-level namespace, or just because it's a logical grouping), you still don't need to pass self if it isn't used. You can make it a static method:
#staticmethod
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)

One thing that would be very helpful for you is a good visual debugger. There's a nice free one for Python called Winpdb. There are also excellent debuggers in the commercial products IntelliJ IDEA/PyCharm, Komodo IDE, WingIDE, and Visual Studio (with the Python Tools add-in). Probably a few others too.
I highly recommend setting up one of these debuggers and running your code under it. It will let you step through your code line by line and see what happens with all your variables and objects.
You may find people who tell you that real programmers don't need or shouldn't use debuggers. Don't listen to them: a good debugger is one of the very best tools to help you learn a new language or to get familiar with a piece of code.

Is this an acceptable pythonic idiom?

I have a class that assists in importing a special type of file, and a 'factory' class that allows me to do these in batch. The factory class uses a generator so the client can iterate through the importers.
My question is, did I use the iterator correctly? Is this an acceptable idiom? I've just started using Python.
class FileParser:
""" uses an open filehandle to do stuff """
class BatchImporter:
def __init__(self, files):
self.files=files
def parsers(self):
for file in self.files:
try:
fh = open(file, "rb")
parser = FileParser(fh)
yield parser
finally:
fh.close()
def verifyfiles(
def cleanup(
---
importer = BatchImporter(filelist)
for p in BatchImporter.parsers():
p.method1()
...

You could make one thing a little simpler: Instead of try...finally, use a with block:
with open(file, "rb") as fh:
yield FileParser(fh)
This will close the file for you automatically as soon as the with block is left.

It's absolutely fine to have a method that's a generator, as you do. I would recommend making all your classes new-style (if you're on Python 2, either set __metaclass__ = type at the start of your module, or add (object) to all your base-less class statements), because legacy classes are "evil";-); and, for clarity and conciseness, I would also recomment coding the generator differently...:
def parsers(self):
for afile in self.files:
with open(afile, "rb") as fh:
yield FileParser(fh)
but neither of these bits of advice condemns in any way the use of generator methods!-)
Note the use of afile in lieu of file: the latter is a built-in identifier, and as a general rule it's better to get used to not "hide" built-in identifiers with your own (it doesn't bite you here, but it will in many nasty ways in the future unless you get into the right habit!-).

The design is fine if you ask me, though using finally the way you use it isn't exactly idiomatic. Use catch and maybe re-raise the exception (using the raise keyword alone, otherwise you mess the stacktrace up), and for bonus points, don't catch: but catch Exception: (otherwise, you catch SystemExit and KeyboardInterrupt).
Or simply use the with-statement as shown by Tim Pietzcker.

In general, it isn't safe to close the file after you yield a parser object that will try to read it. Consider this code:
parsers = list(BatchImporter.parsers())
for p in parsers:
# the file object that p holds will already be closed!
If you're not writing a long-running daemon process, most of the time you don't need to worry about closing files -- they will all get closed when your program exits, or when the file objects are garbage-collected. (And if you use CPython, that will happen as soon as all references to them are lost, since CPython uses reference counting.)
Nevertheless, taking care to free resources is a good habit to acquire, so I would probably write the FileParser class this way:
class FileParser:
def __init__(self, file_or_filename, closing=False):
if hasattr(file_or_filename, 'read'):
self.f = file_or_filename
self._need_to_close = closing
else:
self.f = open(file_or_filename, 'rb')
self._need_to_close = True
def close(self):
if self._need_to_close:
self.f.close()
self._need_to_close = False
and then BatchImporter.parsers would become
def parsers(self):
for file in self.files:
yield FileParser(file)
or, if you love functional programming
def parsers(self):
return itertools.imap(FileParser, self.files)
An aside: if you're new to Python, I recommend you take a look at the Python style guide (also known as PEP 8). Two-space indents look weird.

Hashing a python function to regenerate output when the function is modified

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result
I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.
Is this possible or ridiculous?
Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.

If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.
import inspect
import functools
import json
def memoize_zeroadic_function_to_disk(memo_filename):
def decorator(f):
try:
with open(memo_filename, 'r') as fp:
cache = json.load(fp)
except IOError:
# file doesn't exist yet
cache = {}
source = inspect.getsource(f)
#functools.wraps(f)
def wrapper():
if source not in cache:
cache[source] = f()
with open(memo_filename, 'w') as fp:
json.dump(cache, fp)
return cache[source]
return wrapper
return decorator
#memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
# lots_of_computing_time to come up with the_result
return the_result

Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:
def time_consuming_method():
# your existing method here
# Is the cached data older than this file?
if (not os.path.exists(data_file_name)
or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
data = time_consuming_method()
save_data(data_file_name, data)
else:
data = load_data(data_file_name)
# redefine method
def time_consuming_method():
return data
While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.

The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.

So, here is a really neat trick using decorators:
def memoize(f):
cache={};
def result(*args):
if args not in cache:
cache[args]=f(*args);
return cache[args];
return result;
With the above, you can then use:
#memoize
def myfunc(x,y,z):
# Some really long running computation
When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "#memoize" again, or explicitly write:
myfunc = memoize(new_definition_for_myfunc);
Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:
import os;
import os.path;
import cPickle;
class MemoizedFunction(object):
def __init__(self,f):
self.function=f;
self.filename=str(hash(f))+".cache";
self.cache={};
if os.path.exists(self.filename):
with open(filename,'rb') as file:
self.cache=cPickle.load(file);
def __call__(self,*args):
if args not in self.cache:
self.cache[args]=self.function(*args);
return self.cache[args];
def __del__(self):
with open(self.filename,'wb') as file:
cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);
def memoize(f):
return MemoizedFunction(f);

What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.
A (overly simplified) example:
def memoized(f):
cache={}
def memo(*args):
if args in cache:
return cache[args]
else:
ret=f(*args)
cache[args]=ret
return ret
return memo
#memoized
def time_consuming_method():
# lots_of_computing_time to come up with the_result
return the_result
Edit:
From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

A simple freeze behavior decorator

I'm trying to write a freeze decorator for Python.
The idea is as follows :
(In response to the two comments)
I might be wrong but I think there is two main use of
test case.
One is the test-driven development :
Ideally , developers are writing case before writing the code.
It usually helps defining the architecture because this discipline
forces to define the real interfaces before development.
One may even consider that in some case the person who
dispatches job between dev is writing the test case and
use it to illustrate efficiently the specification he has in mind.
I don't have any experience of the use of test case like that.
The second is the idea that all project with a decent
size and a several programmers is suffering from broken code.
Something that use to work may get broken from a change
that looked like an innocent refactoring.
Though good architecture, loose couple between component may
help to fight against this phenomenon ; you will sleep better
at night if you have written some test case to make sure
that nothing will break your program's behavior.
HOWEVER,
Nobody can deny the overhead of writting test cases. In the
first case one may argue that test case is actually guiding
development and is therefore not to be considered as an overhead.
Frankly speaking, I'm a pretty young programmer and if I were
you, my word on this subject is not really valuable...
Anyway, I think that mosts company/projects are not working
like that, and that unit tests are mainly used in the second
case...
In other words, rather than ensuring that the program is
working correctly, it is aiming at checking that it will
work the same in the future.
This needs can be met without the cost of writing tests,
by using this freezing decorator.
Let's say you have a function
def pow(n,k):
if n == 0: return 1
else: return n * pow(n,k-1)
It is perfectly nice, and you want to rewrite it as an optimized version.
It is part of a big project. You want it to give back the same result
for a few value.
Rather than going through the pain of test cases, one could use some
kind of freeze decorator.
Something such that the first time the decorator is run,
the decorator run the function with the defined args (below 0, and 7)
and saves the result in a map ( f --> args --> result )
#freeze(2,0)
#freeze(1,3)
#freeze(3,5)
#freeze(0,0)
def pow(n,k):
if n == 0: return 1
else: return n * pow(n,k-1)
Next time the program is executed, the decorator will load this map and check
that the result of this function for these args as not changed.
I already wrote quickly the decorator (see below), but hurt a few problems about
which I need your advise...
from __future__ import with_statement
from collections import defaultdict
from types import GeneratorType
import cPickle
def __id_from_function(f):
return ".".join([f.__module__, f.__name__])
def generator_firsts(g, N=100):
try:
if N==0:
return []
else:
return [g.next()] + generator_firsts(g, N-1)
except StopIteration :
return []
def __post_process(v):
specialized_postprocess = [
(GeneratorType, generator_firsts),
(Exception, str),
]
try:
val_mro = v.__class__.mro()
for ( ancestor, specialized ) in specialized_postprocess:
if ancestor in val_mro:
return specialized(v)
raise ""
except:
print "Cannot accept this as a value"
return None
def __eval_function(f):
def aux(args, kargs):
try:
return ( True, __post_process( f(*args, **kargs) ) )
except Exception, e:
return ( False, __post_process(e) )
return aux
def __compare_behavior(f, past_records):
for (args, kargs, result) in past_records:
assert __eval_function(f)(args,kargs) == result
def __record_behavior(f, past_records, args, kargs):
registered_args = [ (a, k) for (a, k, r) in past_records ]
if (args, kargs) not in registered_args:
res = __eval_function(f)(args, kargs)
past_records.append( (args, kargs, res) )
def __open_frz():
try:
with open(".frz", "r") as __open_frz:
return cPickle.load(__open_frz)
except:
return defaultdict(list)
def __save_frz(past_records):
with open(".frz", "w") as __open_frz:
return cPickle.dump(past_records, __open_frz)
def freeze_behavior(*args, **kvargs):
def freeze_decorator(f):
past_records = __open_frz()
f_id = __id_from_function(f)
f_past_records = past_records[f_id]
__compare_behavior(f, f_past_records)
__record_behavior(f, f_past_records, args, kvargs)
__save_frz(past_records)
return f
return freeze_decorator
Dumping and Comparing of results is not trivial for all type. Right now I'm thinking about using a function (I call it postprocess here), to solve this problem.
Basically instead of storing res I store postprocess(res) and I compare postprocess(res1)==postprocess(res2), instead of comparing res1 res2.
It is important to let the user overload the predefined postprocess function.
My first question is :
Do you know a way to check if an object is dumpable or not?
Defining a key for the function decorated is a pain. In the following snippets
I am using the function module and its name.
** Can you think of a smarter way to do that. **
The snippets below is kind of working, but opens and close the file when testing and when recording. This is just a stupid prototype... but do you know a nice way to open the file, process the decorator for all function, close the file...
I intend to add some functionalities to this. For instance, add the possibity to define
an iterable to browse a set of argument, record arguments from real use, etc.
Why would you expect from such a decorator?
In general, would you use such a feature, knowing its limitation... Especially when trying to use it with POO?

"In general, would you use such a feature, knowing its limitation...?"
Frankly speaking -- never.
There are no circumstances under which I would "freeze" results of a function in this way.
The use case appears to be based on two wrong ideas: (1) that unit testing is either hard or complex or expensive; and (2) it could be simpler to write the code, "freeze" the results and somehow use the frozen results for refactoring. This isn't helpful. Indeed, the very real possibility of freezing wrong answers makes this a bad idea.
First, on "consistency vs. correctness". This is easier to preserve with a simple mapping than with a complex set of decorators.
Do this instead of writing a freeze decorator.
print "frozen_f=", dict( (i,f(i)) for i in range(100) )
The dictionary object that's created will work perfectly as a frozen result set. No decorator. No complexity to speak of.
Second, on "unit testing".
The point of a unit test is not to "freeze" some random results. The point of a unit test is to compare real results with results developed another (simpler, more obvious, poorly-performing way). Usually unit tests compare hand-developed results. Other times unit tests use obvious but horribly slow algorithms to produce a few key results.
The point of having test data around is not that it's a "frozen" result. The point of having test data is that it is an independent result. Done differently -- sometimes by different people -- that confirms that the function works.
Sorry. This appears to me to be a bad idea; it looks like it subverts the intent of unit testing.
"HOWEVER, Nobody can deny the overhead of writting test cases"
Actually, many folks would deny the "overhead". It isn't "overhead" in the sense of wasted time and effort. For some of us, unittests are essential. Without them, the code may work, but only by accident. With them, we have ample evidence that it actually works; and the specific cases for which it works.

Are you looking to implement invariants or post conditions?
You should specify the result explicitly, this wil remove most of you problems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Caching disk operations - python

you can use mmap module: http://docs.python.org/library/mmap.html

Related

Python define variable as "load file at first use"

Running multiple functions in Python

Is this an acceptable pythonic idiom?

Hashing a python function to regenerate output when the function is modified

A simple freeze behavior decorator

Categories

Resources