Constructing efficient functions in Python - python

My programming is almost all self taught, so I apologise in advance if some of my terminology is off in this question. Also, I am going to use a simple example to help illustrate my question, but please note that the example itself is not important, its just a way to hopefully make my question clearer.
Imagine that I have some poorly formatted text with a lot of extra white space that I want to clean up. So I create a function that will replace any groups of white space characters that has a new line character in it with a single new line character and any other groups of white space characters with a single space. The function might look like this
def white_space_cleaner(text):
new_line_finder = re.compile(r"\s*\n\s*")
white_space_finder = re.compile(r"\s\s+")
text = new_line_finder.sub("\n", text)
text = white_space_finder.sub(" ", text)
return text
That works just fine, the problem is that now every time I call the function it has to compile the regular expressions. To make it run faster I can rewrite it like this
new_line_finder = re.compile(r"\s*\n\s*")
white_space_finder = re.compile(r"\s\s+")
def white_space_cleaner(text):
text = new_line_finder.sub("\n", text)
text = white_space_finder.sub(" ", text)
return text
Now the regular expressions are only compiled once and the function runs faster. Using timeit on both functions I find that the first function takes 27.3 µs per loop and the second takes 25.5 µs per loop. A small speed up, but one that could be significant if the function is called millions of time or has hundreds of patterns instead of 2. Of course, the downside of the second function is that it pollutes the global namespace and makes the code less readable. Is there some "Pythonic" way to include an object, like a compiled regular expression, in a function without having it be recompiled every time the function is called?

Keep a list of tuples (regular expressions and the replacement text) to apply; there doesn't seem to be a compelling need to name each one individually.
finders = [
(re.compile(r"\s*\n\s*"), "\n"),
(re.compile(r"\s\s+"), " ")
]
def white_space_cleaner(text):
for finder, repl in finders:
text = finder.sub(repl, text)
return text
You might also incorporate functools.partial:
from functools import partial
replacers = {
r"\s*\n\s*": "\n",
r"\s\s+": " "
}
# Ugly boiler-plate, but the only thing you might need to modify
# is the dict above as your needs change.
replacers = [partial(re.compile(regex).sub, repl) for regex, repl in replacers.iteritems()]
def white_space_cleaner(text):
for replacer in replacers:
text = replacer(text)
return text

Another way to do it is to group the common functionality in a class:
class ReUtils(object):
new_line_finder = re.compile(r"\s*\n\s*")
white_space_finder = re.compile(r"\s\s+")
#classmethod
def white_space_cleaner(cls, text):
text = cls.new_line_finder.sub("\n", text)
text = cls.white_space_finder.sub(" ", text)
return text
if __name__ == '__main__':
print ReUtils.white_space_cleaner("the text")
It's already grouped in a module, but depending on the rest of the code a class can also be suitable.

You could put the regular expression compilation into the function parameters, like this:
def white_space_finder(text, new_line_finder=re.compile(r"\s*\n\s*"),
white_space_finder=re.compile(r"\s\s+")):
text = new_line_finder.sub("\n", text)
text = white_space_finder.sub(" ", text)
return text
Since default function arguments are evaluated when the function is parsed, they'll only be loaded once and they won't be in the module namespace. They also give you the flexibility to replace those from calling code if you really need to. The downside is that some people might consider it to be polluting the function signature.
I wanted to try timing this but I couldn't figure out how to use timeit properly. You should see similar results to the global version.
Markus's comment on your post is correct, though; sometimes it's fine to put variables at module-level. If you don't want them to be easily visible to other modules, though, consider prepending the names with an underscore; this marks them as module-private and if you do from module import * it won't import names starting with an underscore (you can still get them if you ask from them by name, though).
Always remember; the end-all to "what's the best way to do this in Python" is almost always "what makes the code most readable?" Python was created, first and foremost, to be easy to read, so do what you think is the most readable thing.

In this particular case I think it doesn't matter. Check:
Is it worth using Python's re.compile?
As you can see in the answer, and in the source code:
https://github.com/python/cpython/blob/master/Lib/re.py#L281
The implementation of the re module has a cache of the regular expression itself. So, the small speed up you see is probably because you avoid the lookup for the cache.
Now, as with the question, sometimes doing something like this is very relevant like, again, building a internal cache that remains namespaced to the function.
def heavy_processing(arg):
return arg + 2
def myfunc(arg1):
# Assign attribute to function if first call
if not hasattr(myfunc, 'cache'):
myfunc.cache = {}
# Perform lookup in internal cache
if arg1 in myfunc.cache:
return myfunc.cache[arg1]
# Very heavy and expensive processing with arg1
result = heavy_processing(arg1)
myfunc.cache[arg1] = result
return result
And this is executed like this:
>>> myfunc.cache
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'cache'
>>> myfunc(10)
12
>>> myfunc.cache
{10: 12}

You can use a static function attribute to hold the compiled re. This example does something similar, keeping a translation table in one function attribute.
def static_var(varname, value):
def decorate(func):
setattr(func, varname, value)
return func
return decorate
#static_var("complements", str.maketrans('acgtACGT', 'tgcaTGCA'))
def rc(seq):
return seq.translate(rc.complements)[::-1]

Related

Using f-strings to format class attribute [duplicate]

I am using template strings to generate some files and I love the conciseness of the new f-strings for this purpose, for reducing my previous template code from something like this:
template_a = "The current name is {name}"
names = ["foo", "bar"]
for name in names:
print (template_a.format(**locals()))
Now I can do this, directly replacing variables:
names = ["foo", "bar"]
for name in names:
print (f"The current name is {name}")
However, sometimes it makes sense to have the template defined elsewhere — higher up in the code, or imported from a file or something. This means the template is a static string with formatting tags in it. Something would have to happen to the string to tell the interpreter to interpret the string as a new f-string, but I don't know if there is such a thing.
Is there any way to bring in a string and have it interpreted as an f-string to avoid using the .format(**locals()) call?
Ideally I want to be able to code like this... (where magic_fstring_function is where the part I don't understand comes in):
template_a = f"The current name is {name}"
# OR [Ideal2] template_a = magic_fstring_function(open('template.txt').read())
names = ["foo", "bar"]
for name in names:
print (template_a)
...with this desired output (without reading the file twice):
The current name is foo
The current name is bar
...but the actual output I get is:
The current name is {name}
The current name is {name}
Here's a complete "Ideal 2".
It's not an f-string—it doesn't even use f-strings—but it does as requested. Syntax exactly as specified. No security headaches since we are not using eval().
It uses a little class and implements __str__ which is automatically called by print. To escape the limited scope of the class we use the inspect module to hop one frame up and see the variables the caller has access to.
import inspect
class magic_fstring_function:
def __init__(self, payload):
self.payload = payload
def __str__(self):
vars = inspect.currentframe().f_back.f_globals.copy()
vars.update(inspect.currentframe().f_back.f_locals)
return self.payload.format(**vars)
template = "The current name is {name}"
template_a = magic_fstring_function(template)
# use it inside a function to demonstrate it gets the scoping right
def new_scope():
names = ["foo", "bar"]
for name in names:
print(template_a)
new_scope()
# The current name is foo
# The current name is bar
A concise way to have a string evaluated as an f-string (with its full capabilities) is using following function:
def fstr(template):
return eval(f"f'{template}'")
Then you can do:
template_a = "The current name is {name}"
names = ["foo", "bar"]
for name in names:
print(fstr(template_a))
# The current name is foo
# The current name is bar
And, in contrast to many other proposed solutions, you can also do:
template_b = "The current name is {name.upper() * 2}"
for name in names:
print(fstr(template_b))
# The current name is FOOFOO
# The current name is BARBAR
This means the template is a static string with formatting tags in it
Yes, that's exactly why we have literals with replacement fields and .format, so we can replace the fields whenever we like by calling format on it.
Something would have to happen to the string to tell the interpreter to interpret the string as a new f-string
That's the prefix f/F. You could wrap it in a function and postpone the evaluation during call time but of course that incurs extra overhead:
def template_a():
return f"The current name is {name}"
names = ["foo", "bar"]
for name in names:
print(template_a())
Which prints out:
The current name is foo
The current name is bar
but feels wrong and is limited by the fact that you can only peek at the global namespace in your replacements. Trying to use it in a situation which requires local names will fail miserably unless passed to the string as arguments (which totally beats the point).
Is there any way to bring in a string and have it interpreted as an f-string to avoid using the .format(**locals()) call?
Other than a function (limitations included), nope, so might as well stick with .format.
How about:
s = 'Hi, {foo}!'
s
> 'Hi, {foo}!'
s.format(foo='Bar')
> 'Hi, Bar!'
Using .format is not a correct answer to this question. Python f-strings are very different from str.format() templates ... they can contain code or other expensive operations - hence the need for deferral.
Here's an example of a deferred logger. This uses the normal preamble of logging.getLogger, but then adds new functions that interpret the f-string only if the log level is correct.
log = logging.getLogger(__name__)
def __deferred_flog(log, fstr, level, *args):
if log.isEnabledFor(level):
import inspect
frame = inspect.currentframe().f_back.f_back
try:
fstr = 'f"' + fstr + '"'
log.log(level, eval(fstr, frame.f_globals, frame.f_locals))
finally:
del frame
log.fdebug = lambda fstr, *args: __deferred_flog(log, fstr, logging.DEBUG, *args)
log.finfo = lambda fstr, *args: __deferred_flog(log, fstr, logging.INFO, *args)
This has the advantage of being able to do things like: log.fdebug("{obj.dump()}") .... without dumping the object unless debugging is enabled.
IMHO: This should have been the default operation of f-strings, however now it's too late. F-string evaluation can have massive and unintended side-effects, and having that happen in a deferred manner will change program execution.
In order to make f-strings properly deferred, python would need some way of explicitly switching behavior. Maybe use the letter 'g'? ;)
It has been pointed out that deferred logging shouldn't crash if there's a bug in the string converter. The above solution can do this as well, change the finally: to except:, and stick a log.exception in there.
An f-string is simply a more concise way of creating a formatted string, replacing .format(**names) with f. If you don't want a string to be immediately evaluated in such a manner, don't make it an f-string. Save it as an ordinary string literal, and then call format on it later when you want to perform the interpolation, as you have been doing.
Of course, there is an alternative with eval.
template.txt:
f'The current name is {name}'
Code:
>>> template_a = open('template.txt').read()
>>> names = 'foo', 'bar'
>>> for name in names:
... print(eval(template_a))
...
The current name is foo
The current name is bar
But then all you've managed to do is replace str.format with eval, which is surely not worth it. Just keep using regular strings with a format call.
What you want appears to be being considered as a Python enhancement.
Meanwhile — from the linked discussion — the following seems like it would be a reasonable workaround that doesn't require using eval():
class FL:
def __init__(self, func):
self.func = func
def __str__(self):
return self.func()
template_a = FL(lambda: f"The current name, number is {name!r}, {number+1}")
names = "foo", "bar"
numbers = 40, 41
for name, number in zip(names, numbers):
print(template_a)
Output:
The current name, number is 'foo', 41
The current name, number is 'bar', 42
inspired by the answer by kadee, the following can be used to define a deferred-f-string class.
class FStr:
def __init__(self, s):
self._s = s
def __repr__(self):
return eval(f"f'{self._s}'")
...
template_a = FStr('The current name is {name}')
names = ["foo", "bar"]
for name in names:
print (template_a)
which is exactly what the question asked for
Or maybe do not use f-strings, just format:
fun = "The curent name is {name}".format
names = ["foo", "bar"]
for name in names:
print(fun(name=name))
In version without names:
fun = "The curent name is {}".format
names = ["foo", "bar"]
for name in names:
print(fun(name))
Most of these answers will get you something that behaves sort of like f-strings some of the time, but they will all go wrong in some cases.
There is a package on pypi f-yeah that does all this, only costing you two extra characters! (full disclosure, I am the author)
from fyeah import f
print(f("""'{'"all" the quotes'}'"""))
There are a lot of differences between f-strings and format calls, here is a probably incomplete list
f-strings allow for arbitrary eval of python code
f-strings cannot contain a backslash in the expression (since formatted strings don't have an expression, so I suppose you could say this isn't a difference, but it does differ from what a raw eval() can do)
dict lookups in formatted strings must not be quoted. dict lookups in f-strings can be quoted, and so non-string keys can also be looked up
f-strings have a debug format that format() does not: f"The argument is {spam=}"
f-string expressions cannot be empty
The suggestions to use eval will get you full f-string format support, but they don't work on all string types.
def f_template(the_string):
return eval(f"f'{the_string}'")
print(f_template('some "quoted" string'))
print(f_template("some 'quoted' string"))
some "quoted" string
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in f_template
File "<string>", line 1
f'some 'quoted' string'
^
SyntaxError: invalid syntax
This example will also get variable scoping wrong in some cases.
to do that I prefer to use fstring inside a lambda function like:
s = lambda x: f'this is your string template to embed {x} in it.'
n = ['a' , 'b' , 'c']
for i in n:
print( s(i) )
There's a lot of talk about using str.format(), but as noted it doesn't allow most of the expressions that are allowed in f-strings such as arithmetic or slices. Using eval() obviously also has it's downsides.
I'd recommend looking into a templating language such as Jinja. For my use-case it works quite well. See the example below where I have overridden the variable annotation syntax with a single curly brace to match the f-string syntax. I didn't fully review the differences between f-strings and Jinja invoked like this.
from jinja2 import Environment, BaseLoader
a, b, c = 1, 2, "345"
templ = "{a or b}{c[1:]}"
env = Environment(loader=BaseLoader, variable_start_string="{", variable_end_string="}")
env.from_string(templ).render(**locals())
results in
'145'
A suggestion that uses f-strings. Do your evaluation on the
logical level where the templating is occurring and pass it as a generator.
You can unwind it at whatever point you choose, using f-strings
In [46]: names = (i for i in ('The CIO, Reed', 'The homeless guy, Arnot', 'The security guard Spencer'))
In [47]: po = (f'Strangely, {next(names)} has a nice {i}' for i in (" nice house", " fast car", " big boat"))
In [48]: while True:
...: try:
...: print(next(po))
...: except StopIteration:
...: break
...:
Strangely, The CIO, Reed has a nice nice house
Strangely, The homeless guy, Arnot has a nice fast car
Strangely, The security guard Spencer has a nice big boat
You could use a .format styled replacement and explicitly define the replaced variable name:
template_a = "The current name is {name}"
names = ["foo", "bar"]
for name in names:
print (template_a.format(name=name))
Output
The current name is foo
The current name is bar

Is it possible to "hack" Python's print function?

Note: This question is for informational purposes only. I am interested to see how deep into Python's internals it is possible to go with this.
Not very long ago, a discussion began inside a certain question regarding whether the strings passed to print statements could be modified after/during the call to print has been made. For example, consider the function:
def print_something():
print('This cat was scared.')
Now, when print is run, then the output to the terminal should display:
This dog was scared.
Notice the word "cat" has been replaced by the word "dog". Something somewhere somehow was able to modify those internal buffers to change what was printed. Assume this is done without the original code author's explicit permission (hence, hacking/hijacking).
This comment from the wise #abarnert, in particular, got me thinking:
There are a couple of ways to do that, but they're all very ugly, and
should never be done. The least ugly way is to probably replace the
code object inside the function with one with a different co_consts
list. Next is probably reaching into the C API to access the str's
internal buffer. [...]
So, it looks like this is actually possible.
Here's my naive way of approaching this problem:
>>> import inspect
>>> exec(inspect.getsource(print_something).replace('cat', 'dog'))
>>> print_something()
This dog was scared.
Of course, exec is bad, but that doesn't really answer the question, because it does not actually modify anything during when/after print is called.
How would it be done as #abarnert has explained it?
First, there's actually a much less hacky way. All we want to do is change what print prints, right?
_print = print
def print(*args, **kw):
args = (arg.replace('cat', 'dog') if isinstance(arg, str) else arg
for arg in args)
_print(*args, **kw)
Or, similarly, you can monkeypatch sys.stdout instead of print.
Also, nothing wrong with the exec … getsource … idea. Well, of course there's plenty wrong with it, but less than what follows here…
But if you do want to modify the function object's code constants, we can do that.
If you really want to play around with code objects for real, you should use a library like bytecode (when it's finished) or byteplay (until then, or for older Python versions) instead of doing it manually. Even for something this trivial, the CodeType initializer is a pain; if you actually need to do stuff like fixing up lnotab, only a lunatic would do that manually.
Also, it goes without saying that not all Python implementations use CPython-style code objects. This code will work in CPython 3.7, and probably all versions back to at least 2.2 with a few minor changes (and not the code-hacking stuff, but things like generator expressions), but it won't work with any version of IronPython.
import types
def print_function():
print ("This cat was scared.")
def main():
# A function object is a wrapper around a code object, with
# a bit of extra stuff like default values and closure cells.
# See inspect module docs for more details.
co = print_function.__code__
# A code object is a wrapper around a string of bytecode, with a
# whole bunch of extra stuff, including a list of constants used
# by that bytecode. Again see inspect module docs. Anyway, inside
# the bytecode for string (which you can read by typing
# dis.dis(string) in your REPL), there's going to be an
# instruction like LOAD_CONST 1 to load the string literal onto
# the stack to pass to the print function, and that works by just
# reading co.co_consts[1]. So, that's what we want to change.
consts = tuple(c.replace("cat", "dog") if isinstance(c, str) else c
for c in co.co_consts)
# Unfortunately, code objects are immutable, so we have to create
# a new one, copying over everything except for co_consts, which
# we'll replace. And the initializer has a zillion parameters.
# Try help(types.CodeType) at the REPL to see the whole list.
co = types.CodeType(
co.co_argcount, co.co_kwonlyargcount, co.co_nlocals,
co.co_stacksize, co.co_flags, co.co_code,
consts, co.co_names, co.co_varnames, co.co_filename,
co.co_name, co.co_firstlineno, co.co_lnotab,
co.co_freevars, co.co_cellvars)
print_function.__code__ = co
print_function()
main()
What could go wrong with hacking up code objects? Mostly just segfaults, RuntimeErrors that eat up the whole stack, more normal RuntimeErrors that can be handled, or garbage values that will probably just raise a TypeError or AttributeError when you try to use them. For examples, try creating a code object with just a RETURN_VALUE with nothing on the stack (bytecode b'S\0' for 3.6+, b'S' before), or with an empty tuple for co_consts when there's a LOAD_CONST 0 in the bytecode, or with varnames decremented by 1 so the highest LOAD_FAST actually loads a freevar/cellvar cell. For some real fun, if you get the lnotab wrong enough, your code will only segfault when run in the debugger.
Using bytecode or byteplay won't protect you from all of those problems, but they do have some basic sanity checks, and nice helpers that let you do things like insert a chunk of code and let it worry about updating all offsets and labels so you can't get it wrong, and so on. (Plus, they keep you from having to type in that ridiculous 6-line constructor, and having to debug the silly typos that come from doing so.)
Now on to #2.
I mentioned that code objects are immutable. And of course the consts are a tuple, so we can't change that directly. And the thing in the const tuple is a string, which we also can't change directly. That's why I had to build a new string to build a new tuple to build a new code object.
But what if you could change a string directly?
Well, deep enough under the covers, everything is just a pointer to some C data, right? If you're using CPython, there's a C API to access the objects, and you can use ctypes to access that API from within Python itself, which is such a terrible idea that they put a pythonapi right there in the stdlib's ctypes module. :) The most important trick you need to know is that id(x) is the actual pointer to x in memory (as an int).
Unfortunately, the C API for strings won't let us safely get at the internal storage of an already-frozen string. So screw safely, let's just read the header files and find that storage ourselves.
If you're using CPython 3.4 - 3.7 (it's different for older versions, and who knows for the future), a string literal from a module that's made of pure ASCII is going to be stored using the compact ASCII format, which means the struct ends early and the buffer of ASCII bytes follows immediately in memory. This will break (as in probably segfault) if you put a non-ASCII character in the string, or certain kinds of non-literal strings, but you can read up on the other 4 ways to access the buffer for different kinds of strings.
To make things slightly easier, I'm using the superhackyinternals project off my GitHub. (It's intentionally not pip-installable because you really shouldn't be using this except to experiment with your local build of the interpreter and the like.)
import ctypes
import internals # https://github.com/abarnert/superhackyinternals/blob/master/internals.py
def print_function():
print ("This cat was scared.")
def main():
for c in print_function.__code__.co_consts:
if isinstance(c, str):
idx = c.find('cat')
if idx != -1:
# Too much to explain here; just guess and learn to
# love the segfaults...
p = internals.PyUnicodeObject.from_address(id(c))
assert p.compact and p.ascii
addr = id(c) + internals.PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_int8 * 3).from_address(addr + idx)
buf[:3] = b'dog'
print_function()
main()
If you want to play with this stuff, int is a whole lot simpler under the covers than str. And it's a lot easier to guess what you can break by changing the value of 2 to 1, right? Actually, forget imagining, let's just do it (using the types from superhackyinternals again):
>>> n = 2
>>> pn = PyLongObject.from_address(id(n))
>>> pn.ob_digit[0]
2
>>> pn.ob_digit[0] = 1
>>> 2
1
>>> n * 3
3
>>> i = 10
>>> while i < 40:
... i *= 2
... print(i)
10
10
10
… pretend that code box has an infinite-length scrollbar.
I tried the same thing in IPython, and the first time I tried to evaluate 2 at the prompt, it went into some kind of uninterruptable infinite loop. Presumably it's using the number 2 for something in its REPL loop, while the stock interpreter isn't?
Monkey-patch print
print is a builtin function so it will use the print function defined in the builtins module (or __builtin__ in Python 2). So whenever you want to modify or change the behavior of a builtin function you can simply reassign the name in that module.
This process is called monkey-patching.
# Store the real print function in another variable otherwise
# it will be inaccessible after being modified.
_print = print
# Actual implementation of the new print
def custom_print(*args, **options):
_print('custom print called')
_print(*args, **options)
# Change the print function globally
import builtins
builtins.print = custom_print
After that every print call will go through custom_print, even if the print is in an external module.
However you don't really want to print additional text, you want to change the text that is printed. One way to go about that is to replace it in the string that would be printed:
_print = print
def custom_print(*args, **options):
# Get the desired seperator or the default whitspace
sep = options.pop('sep', ' ')
# Create the final string
printed_string = sep.join(args)
# Modify the final string
printed_string = printed_string.replace('cat', 'dog')
# Call the default print function
_print(printed_string, **options)
import builtins
builtins.print = custom_print
And indeed if you run:
>>> def print_something():
... print('This cat was scared.')
>>> print_something()
This dog was scared.
Or if you write that to a file:
test_file.py
def print_something():
print('This cat was scared.')
print_something()
and import it:
>>> import test_file
This dog was scared.
>>> test_file.print_something()
This dog was scared.
So it really works as intended.
However, in case you only temporarily want to monkey-patch print you could wrap this in a context-manager:
import builtins
class ChangePrint(object):
def __init__(self):
self.old_print = print
def __enter__(self):
def custom_print(*args, **options):
# Get the desired seperator or the default whitspace
sep = options.pop('sep', ' ')
# Create the final string
printed_string = sep.join(args)
# Modify the final string
printed_string = printed_string.replace('cat', 'dog')
# Call the default print function
self.old_print(printed_string, **options)
builtins.print = custom_print
def __exit__(self, *args, **kwargs):
builtins.print = self.old_print
So when you run that it depends on the context what is printed:
>>> with ChangePrint() as x:
... test_file.print_something()
...
This dog was scared.
>>> test_file.print_something()
This cat was scared.
So that's how you could "hack" print by monkey-patching.
Modify the target instead of the print
If you look at the signature of print you'll notice a file argument which is sys.stdout by default. Note that this is a dynamic default argument (it really looks up sys.stdout every time you call print) and not like normal default arguments in Python. So if you change sys.stdout print will actually print to the different target even more convenient that Python also provides a redirect_stdout function (from Python 3.4 on, but it's easy to create an equivalent function for earlier Python versions).
The downside is that it won't work for print statements that don't print to sys.stdout and that creating your own stdout isn't really straightforward.
import io
import sys
class CustomStdout(object):
def __init__(self, *args, **kwargs):
self.current_stdout = sys.stdout
def write(self, string):
self.current_stdout.write(string.replace('cat', 'dog'))
However this also works:
>>> import contextlib
>>> with contextlib.redirect_stdout(CustomStdout()):
... test_file.print_something()
...
This dog was scared.
>>> test_file.print_something()
This cat was scared.
Summary
Some of these points have already be mentioned by #abarnet but I wanted to explore these options in more detail. Especially how to modify it across modules (using builtins/__builtin__) and how to make that change only temporary (using contextmanagers).
A simple way to capture all output from a print function and then process it, is to change the output stream to something else, e.g. a file.
I'll use a PHP naming conventions (ob_start, ob_get_contents,...)
from functools import partial
output_buffer = None
print_orig = print
def ob_start(fname="print.txt"):
global print
global output_buffer
print = partial(print_orig, file=output_buffer)
output_buffer = open(fname, 'w')
def ob_end():
global output_buffer
close(output_buffer)
print = print_orig
def ob_get_contents(fname="print.txt"):
return open(fname, 'r').read()
Usage:
print ("Hi John")
ob_start()
print ("Hi John")
ob_end()
print (ob_get_contents().replace("Hi", "Bye"))
Would print
Hi John
Bye John
Let's combine this with frame introspection!
import sys
_print = print
def print(*args, **kw):
frame = sys._getframe(1)
_print(frame.f_code.co_name)
_print(*args, **kw)
def greetly(name, greeting = "Hi")
print(f"{greeting}, {name}!")
class Greeter:
def __init__(self, greeting = "Hi"):
self.greeting = greeting
def greet(self, name):
print(f"{self.greeting}, {name}!")
You'll find this trick prefaces every greeting with the calling function or method. This might be very useful for logging or debugging; especially as it lets you "hijack" print statements in third party code.

Save any output a function gives, into a variable

Being new at programming in general, and new with Python in particular, I'm having some beginner's troubles.
I'm trying out a function from NLTK called generate:
string.generate()
It returns what seems like a string. However, if I write:
stringvariable = string.generate()
or
stringvariable = str(string.generate())
… the stringvariable is always Empty.
So I guess I'm missing something here. Can the text output generated, that I see on the screen, be something else than a string output? And if so, is there any way for me to grab that output and put it into a variable?
Briefly put, how to I get what comes out of string.generate() into stringvariable, if not as described above?
you can rewrite generate. The only disadvantage is that it can change and your code might not be updated to reflect these changes:
from nltk.util import tokenwrap
def generate_no_stdout(self, length=100):
if '_trigram_model' not in self.__dict__:
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
self._trigram_model = NgramModel(3, self, estimator=estimator)
text = self._trigram_model.generate(length)
return tokenwrap(text)
then "a.generate()" becomes "generate_no_stdout(a)"
generate() prints its output rather than returning a string, so you need to capture it.

How to call a function based on list entry?

I'm currently working on an experiment where I'm implementing an interpreter for an old in-game scripting language. It's a forth based language, so I figure it would be fairly easy to just have the instructions (once verified and santized) put into a big list.
Once I've got the code in a list, I am trying to iterate through the entire program in a for loop that processes the instructions one at a time. Certain items, like strings, could be placed onto a variable that holds the current stack, which is easy enough. But where I'm stuck is making commands happen.
I have a big list of functions that are valid and I'd like it to where if any instruction matches them, it calls the associated function.
So, for example, if I had:
"Hello, world!" notify
...the code would check for notify in a list and then execute the notify function. The bottom line is: How do I translate a string into a function name?
You could keep a dictionary of functions the code can call, and then do a look up when you need to:
def notify(s):
print(s)
d = {"notify": notify}
d["notify"]("Hello, world!")
You can do it through locals which is a dictionary with th current local symbol table:
locals()["notify"]()
or though globals which returns a dictionary with the symbol table of globals:
globals()["notify"]()
You can give arguments too e.g.:
locals()["notify"]("Hello, world!")
or
globals()["notify"]("Hello, world!")
If you have a dict called commands that maps names to functions, you can do it like this:
def my_notify_function():
print(stack.pop)
commands = {'notify': my_notify_function, ...}
for item in program:
if item in commands:
commands[item]()
else:
stack.push(item)
Something like:
import re
class LangLib(object):
pattern = re.compile(r'"(.*?)" (.*)')
def run_line(self, line):
arg, command = re.match(LangLib.pattern, line).groups()
return getattr(self, command)(arg)
def notify(self, arg):
print arg
Then your engine code would be:
parser = LangLib()
for line in program_lines:
parser.run_line(line)
Create a dictionary of function names and some tags.
I have tried it several times before, it works really well.

Scope, using functions in current module

I know this must be a trivial question, but I've tried many different ways, and searched quie a bit for a solution, but how do I create and reference subfunctions in the current module?
For example, I am writing a program to parse through a text file, and for each of the 300 different names in it, I want to assign to a category.
There are 300 of these, and I have a list of these structured to create a dict, so of the form lookup[key]=value (bonus question; any more efficient or sensible way to do this than a massive dict?).
I would like to keep all of this in the same module, but with the functions (dict initialisation, etc) at the
end of the file, so I dont have to scroll down 300 lines to see the code, i.e. as laid out as in the example below.
When I run it as below, I get the error 'initlookups is not defined'. When I structure is so that it is initialisation, then function definition, then function use, no problem.
I'm sure there must be an obvious way to initialise the functions and associated dict without keeping the code inline, but have tried quite a few so far without success. I can put it in an external module and import this, but would prefer not to for simplicity.
What should I be doing in terms of module structure? Is there any better way than using a dict to store this lookup table (It is 300 unique text keys mapping on to approx 10 categories?
Thanks,
Brendan
import ..... (initialisation code,etc )
initLookups() # **Should create the dict - How should this be referenced?**
print getlookup(KEY) # **How should this be referenced?**
def initLookups():
global lookup
lookup={}
lookup["A"]="AA"
lookup["B"]="BB"
(etc etc etc....)
def getlookup(value)
if name in lookup.keys():
getlookup=lookup[name]
else:
getlookup=""
return getlookup
A function needs to be defined before it can be called. If you want to have the code that needs to be executed at the top of the file, just define a main function and call it from the bottom:
import sys
def main(args):
pass
# All your other function definitions here
if __name__ == '__main__':
exit(main(sys.argv[1:]))
This way, whatever you reference in main will have been parsed and is hence known already. The reason for testing __name__ is that in this way the main method will only be run when the script is executed directly, not when it is imported by another file.
Side note: a dict with 300 keys is by no means massive, but you may want to either move the code that fills the dict to a separate module, or (perhaps more fancy) store the key/value pairs in a format like JSON and load it when the program starts.
Here's a more pythonic ways to do this. There aren't a lot of choices, BTW.
A function must be defined before it can be used. Period.
However, you don't have to strictly order all functions for the compiler's benefit. You merely have to put your execution of the functions last.
import # (initialisation code,etc )
def initLookups(): # Definitions must come before actual use
lookup={}
lookup["A"]="AA"
lookup["B"]="BB"
(etc etc etc....)
return lookup
# Any functions initLookups uses, can be define here.
# As long as they're findable in the same module.
if __name__ == "__main__": # Use comes last
lookup= initLookups()
print lookup.get("Key","")
Note that you don't need the getlookup function, it's a built-in feature of a dict, named get.
Also, "initialisation code" is suspicious. An import should not "do" anything. It should define functions and classes, but not actually provide any executable code. In the long run, executable code that is processed by an import can become a maintenance nightmare.
The most notable exception is a module-level Singleton object that gets created by default. Even then, be sure that the mystery object which makes a module work is clearly identified in the documentation.
If your lookup dict is unchanging, the simplest way is to just make it a module scope variable. ie:
lookup = {
'A' : 'AA',
'B' : 'BB',
...
}
If you may need to make changes, and later re-initialise it, you can do this in an initialisation function:
def initLookups():
global lookup
lookup = {
'A' : 'AA',
'B' : 'BB',
...
}
(Alternatively, lookup.update({'A':'AA', ...}) to change the dict in-place, affecting all callers with access to the old binding.)
However, if you've got these lookups in some standard format, it may be simpler simply to load it from a file and create the dictionary from that.
You can arrange your functions as you wish. The only rule about ordering is that the accessed variables must exist at the time the function is called - it's fine if the function has references to variables in the body that don't exist yet, so long as nothing actually tries to use that function. ie:
def foo():
print greeting, "World" # Note that greeting is not yet defined when foo() is created
greeting = "Hello"
foo() # Prints "Hello World"
But:
def foo():
print greeting, "World"
foo() # Gives an error - greeting not yet defined.
greeting = "Hello"
One further thing to note: your getlookup function is very inefficient. Using "if name in lookup.keys()" is actually getting a list of the keys from the dict, and then iterating over this list to find the item. This loses all the performance benefit the dict gives. Instead, "if name in lookup" would avoid this, or even better, use the fact that .get can be given a default to return if the key is not in the dictionary:
def getlookup(name)
return lookup.get(name, "")
I think that keeping the names in a flat text file, and loading them at runtime would be a good alternative. I try to stick to the lowest level of complexity possible with my data, starting with plain text and working up to a RDMS (I lifted this idea from The Pragmatic Programmer).
Dictionaries are very efficient in python. It's essentially what the whole language is built on. 300 items is well within the bounds of sane dict usage.
names.txt:
A = AAA
B = BBB
C = CCC
getname.py:
import sys
FILENAME = "names.txt"
def main(key):
pairs = (line.split("=") for line in open(FILENAME))
names = dict((x.strip(), y.strip()) for x,y in pairs)
return names.get(key, "Not found")
if __name__ == "__main__":
print main(sys.argv[-1])
If you really want to keep it all in one module for some reason, you could just stick a string at the top of the module. I think that a big swath of text is less distracting than a huge mess of dict initialization code (and easier to edit later):
import sys
LINES = """
A = AAA
B = BBB
C = CCC
D = DDD
E = EEE""".strip().splitlines()
PAIRS = (line.split("=") for line in LINES)
NAMES = dict((x.strip(), y.strip()) for x,y in PAIRS)
def main(key):
return NAMES.get(key, "Not found")
if __name__ == "__main__":
print main(sys.argv[-1])

Categories

Resources