How to break from a Python generator with open file handles

How to break from a Python generator with open file handles - python

I'm writing a Python generator which looks like "cat". My specific use case is for a "grep like" operation. I want it to be able to break out of the generator if a condition is met:
summary={}
for fn in cat("filelist.dat"):
for line in cat(fn):
if line.startswith("FOO"):
summary[fn] = line
break
So when break happens, I need the cat() generator to finish and close the file handle to fn.
I have to read 100k files with 30 GB of total data, and the FOO keyword happens in the header region, so it is important in this case that the cat() function stops reading the file ASAP.
There are other ways I can solve this problem, but I'm still interested to know how to get an early exit from a generator which has open file handles. Perhaps Python cleans them up right away and closes them when the generator is garbage collected?
Thanks,
Ian

Generators have a close method that raises GeneratorExit at the yield statement. If you specifically catch this exception, you can run some tear-down code:
import contextlib
with contextlib.closing( cat( fn ) ):
...
and then in cat:
try:
...
except GeneratorExit:
# close the file
If you'd like a simpler way to do this (without using the arcane close method on generators), just make cat take a file-like object instead of a string to open, and handle the file IO yourself:
for filename in filenames:
with open( filename ) as theFile:
for line in cat( theFile ):
...
However, you basically don't need to worry about any of this, because the garbage collection will handle it all. Still,
explicit is better than implicit

By implementing the context protocol and the iterator protocol in the same object, you can write pretty sweet code like this:
with cat("/etc/passwd") as lines:
for line in lines:
if "mail" in line:
print line.strip()
break
This is a sample implementation, tested with Python 2.5 on a Linux box. It reads the lines of /etc/passwd until it finds the one for user audio, and then stops:
from __future__ import with_statement
class cat(object):
def __init__(self, fname):
self.fname = fname
def __enter__(self):
print "[Opening file %s]" % (self.fname,)
self.file_obj = open(self.fname, "rt")
return self
def __exit__(self, *exc_info):
print "[Closing file %s]" % (self.fname,)
self.file_obj.close()
def __iter__(self):
return self
def next(self):
line = self.file_obj.next().strip()
print "[Read: %s]" % (line,)
return line
def main():
with cat("/etc/passwd") as lines:
for line in lines:
if "mail" in line:
print line.strip()
break
if __name__ == "__main__":
import sys
sys.exit(main())
Or even simpler:
with open("/etc/passwd", "rt") as f:
for line in f:
if "mail" in line:
break
File objects implement the iterator protocol (see http://docs.python.org/library/stdtypes.html#file-objects)

Please also consider this example:
def itertest():
try:
for i in xrange(1000):
print i
yield i
finally:
print 'finally'
x = itertest()
for i in x:
if i > 2:
break
print 'del x'
del x
print 'exit'
0
1
2
3
del x
finally
exit
It shows that finally is run after the iterator is cleaned up. I think __del__(self) is calling self.close(), see also here: https://docs.python.org/2.7/reference/expressions.html#generator.close

There seems to be another possibility using try..finally (tested on Python 2.7.6):
def gen():
i = 0
try:
while True:
print 'yield %i' % i
yield i
i += 1
print 'will never get here'
finally:
print 'done'
for i in gen():
if i > 1:
print 'break'
break
print i
Gives me the following printout:
yield 0
0
yield 1
1
yield 2
break
done

Related

Getting a multiline expression from a function

I'll try to be clear, please do any questions you need.
I'm working on mezcla, just trying to make things a little bit more pythonic.
To be specific, in debug there's a function called assertion, which takes an expression and evaluates it, giving an error message which doesn't raise an exception.
It works giving to the function an expression, like
from debug import assertion
def func():
return 2+2==5,
def probe(expr):
print(assertion(expr))
probe(
func()
)
##OR alone like
assertion(
2+2
==5)
And it should take the expression itself and print it. I'm looking for a way to get and evaluate a multiline expression, just like icecream, for example, can do this way:
In [2]: ic(2+2
...: ==5
...: )
ic| 2+2
==5: False
I tried this massive code, it reads the expression from ipython history, or it reads the file and line with inspect and iterates line by line of the script looking for the closure parenthesis.
def read_line(filename, line_number):
"""Returns contents of FILENAME at LINE_NUMBER"""
# ex: "debugging" in read_line(os.path.join(os.getcwd(), "debug.py"), 3)
try:
with open(filename) as file:
line_contents = file.readlines()[line_number - 1]
except OSError:
line_contents = "<stdin>"
except:
line_contents = "???"
return line_contents
def multi_assert(assertion, line_number=1):
"""Handles multiline assertions until 10 lines"""
line = ""
counter = 0
#While count(just for prevent infinite loop)
while counter < 10:
counter += 1
#Reads line, appends it and compares number of ()
line += read_line(assertion, line_number)
if line.count("(") > line.count(")"):
line_number += 1
else:
return line
break
def assertion(expression, message=None):
"""Issue warning if EXPRESSION doesn't hold, along with optional MESSAGE
Note: This is a "soft assertion" that doesn't raise an exception (n.b., provided the test doesn't do so)"""
if (not expression):
# Get source information for failed assertion
(_frame, filename, line_number, _function, _context, _index) = inspect.stack()[1]
# Read statement in file and extract assertion expression
# Calls to multi_assert to handle multiline
statement = multi_assert(str(filename), line_number + 1)
# If statement is from stdin, tries to get assert from ipython history
if statement == "<stdin>" and _context != None:
try:
ip = get_ipython()
statement = str(ip.history_manager.get_tail(1, raw=True, include_latest=True))
except:
statement = str(_context).replace(")\\n']", "")
return statement
It works, but it's too heavy, hacky and not specially ellegant, so i'm looking for any other way that return the gived assertion even if is multiline. Any kind of suggestion will be accepted and appreciated. Thanks

Prevent 'try' to catch an exception and pass to the next line in python

I have a python function that runs other functions.
def main():
func1(a,b)
func2(*args,*kwargs)
func3()
Now I want to apply exceptions on main function. If there was an exception in any of the functions inside main, the function should not stop but continue executing next line. In other words, I want the below functionality
def main():
try:
func1()
except:
pass
try:
func2()
except:
pass
try:
func3()
except:
pass
So is there any way to loop through each statement inside main function and apply exceptions on each line.
for line in main_function:
try:
line
except:
pass
I just don't want to write exceptions inside the main function.
Note : How to prevent try catching every possible line in python? this question comes close to solving this problem, but I can't figure out how to loop through lines in a function.
If you have any other way to do this other than looping, that would help too.

What you want is on option that exists in some languages where an exception handler can choose to proceed on next exception. This used to lead to poor code and AFAIK has never been implemented in Python. The rationale behind is that you must explicitely say how you want to process an exception and where you want to continue.
In your case, assuming that you have a function called main that only calls other function and is generated automatically, my advice would be to post process it between its generation and its execution. The inspect module can even allow to do it at run time:
def filter_exc(func):
src = inspect.getsource(func)
lines = src.split('\n')
out = lines[0] + "\n"
for line in lines[1:]:
m = re.match('(\s*)(.*)', line)
lead, text = m.groups()
# ignore comments and empty lines
if not (text.startswith('#') or text.strip() == ""):
out += lead + "try:\n"
out += lead + " " + text + "\n"
out += lead + "except:\n" + lead + " pass\n"
return out
You can then use the evil exec (the input in only the source from your function):
exec(filter_exc(main)) # replaces main with the filtered version
main() # will ignore exceptions
After your comment, you want a more robust solution that can cope with multi line statements and comments. In that case, you need to actually parse the source and modify the parsed tree. ast module to the rescue:
class ExceptFilter(ast.NodeTransformer):
def visit_Expr(self, node):
self.generic_visit(node)
if isinstance(node.value, ast.Call): # filter all function calls
# print(node.value.func.id)
# use a dummy try block
n = ast.parse("""try:
f()
except:
pass""").body[0]
n.body[0] = node # make the try call the real function
return n # and use it
return node # keep other nodes unchanged
With that example code:
def func1():
print('foo')
def func2():
raise Exception("Test")
def func3(x):
print("f3", x)
def main():
func1()
# this is a comment
a = 1
if a == 1: # this is a multi line statement
func2()
func3("bar")
we get:
>>> node = ast.parse(inspect.getsource(main))
>>> exec(compile(ExceptFilter().visit(node), "", mode="exec"))
>>> main()
foo
f3 bar
In that case, the unparsed node(*) write as:
def main():
try:
func1()
except:
pass
a = 1
if (a == 1):
try:
func2()
except:
pass
try:
func3('bar')
except:
pass
Alternatively it is also possible to wrap every top level expression:
>>> node = ast.parse(inspect.getsource(main))
>>> for i in range(len(node.body[0].body)): # process top level expressions
n = ast.parse("""try:
f()
except:
pass""").body[0]
n.body[0] = node.body[0].body[i]
node.body[0].body[i] = n
>>> exec(compile(node, "", mode="exec"))
>>> main()
foo
f3 bar
Here the unparsed tree writes:
def main():
try:
func1()
except:
pass
try:
a = 1
except:
pass
try:
if (a == 1):
func2()
except:
pass
try:
func3('bar')
except:
pass
BEWARE: there is an interesting corner case if you use exec(compile(... in a function. By default exec(code) is exec(code, globals(), locals()). At top level, local and global dictionary is the same dictionary, so the top level function is correctly replaced. But if you do the same in a function, you only create a local function with the same name that can only be called from the function (it will go out of scope when the function will return) as locals()['main'](). So you must either alter the global function by passing explicitely the global dictionary:
exec(compile(ExceptFilter().visit(node), "", mode="exec"), globals(), globals())
or return the modified function without altering the original one:
def myfun():
# print(main)
node = ast.parse(inspect.getsource(main))
exec(compile(ExceptFilter().visit(node), "", mode="exec"))
# print(main, locals()['main'], globals()['main'])
return locals()['main']
>>> m2 = myfun()
>>> m2()
foo
f3 bar
(*) Python 3.6 contains an unparser in Tools/parser, but a simpler to use version exists in pypi

You could use a callback, like this:
def main(list_of_funcs):
for func in list_of_funcs:
try:
func()
except Exception as e:
print(e)
if __name__ == "__main__":
main([func1, func2, func3])

Cleanup of iterators that have not been fully exhausted

My main usage of generators is processing of rows of CSV files stored on a remote server. It allows me to have consistent interfaces of linearly processing the data stored in them.
Now, I am using paramiko in order to access an SFTP server that stores the files - and paramiko has an outstanding issue of not properly closing connections if you did not close the file itself.
I've got a simple interface of accessing a single file on the sftp (this is obviously a pseudocode - I am omitting the connection error handling code and so on).
def sftp_read_file(filename):
with paramiko.open(filename) as file_obj:
for item in csv.reader(file_obj):
yield item
def csv_append_column(iter_obj, col_name, col_val):
# header
yield next(iter_obj) + (col_name, )
for item in iter_obj:
yield item + (col_val, )
Let's say I would like to test a set of transformations done to the file by running the script for a limited amount of rows:
def main():
for i, item in enumerate(csv_append_column(sftp_read_file('sftp://...'), 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
The script will exit, but the interpreter will never terminate without SIGINT. What are my possible solutions?

This isn’t the most elegant solution, but maybe we could build off #tadhg-mcdonald-jensen’s suggestion by wrapping the generator in an object:
class Stoppable(object):
def __init__(self, fn):
self.generator = fn
def __enter__(self):
return self.generator
def __exit__(self, type_, value, traceback):
self.generator.close()
And then use it like this:
def main():
with Stoppable(sftp_read_file('sftp://...')) as reader:
for i, item in enumerate(csv_append_column(reader, 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
Alternatively, we can just wrap the generator itself if we aren't using the generator methodology for streaming:
def stopit(fn):
rg = [ x for x in fn ]
for x in rg:
yield x
Now we can call it like:
def main():
for i, item in enumerate(csv_append_column(stopit(sftp_read_file('...')), 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
This will make sure the with block exits and paramiko closes the sftp connection but comes at the expense of reading all of the lines into memory at once.

Scope of nested function declarations

I have the following code:
def function_reader(path):
line_no = 0
with open(path, "r") as myfile:
def readline():
line_no +=1
return myfile.readline()
Python keeps returning:
UnboundLocalError: local variable 'line_no' referenced before assignment
when executing line_no +=1.
I understand that the problem is that nested function declarations have weird scoping in python (though I do not understand why it was programmed this way). I'm mostly wondering if there is a simple way to help python resolve the reference, since I really like the functionality this would provide.

Unfortunately, there is not a way to do this in Python 2.x. Nested functions can only read names in the enclosing function, not reassign them.
One workaround would be to make line_no a list and then alter its single item:
def function_reader(path):
line_no = [0]
with open(path, "r") as myfile:
def readline():
line_no[0] += 1
return myfile.readline()
You would then access the line number via line_no[0]. Below is a demonstration:
>>> def outer():
... data = [0]
... def inner():
... data[0] += 1
... inner()
... return data[0]
...
>>> outer()
1
>>>
This solution works because we are not reassigning the name line_no, only mutating the object that it references.
Note that in Python 3.x, this problem would be easily solved using the nonlocal statement:
def function_reader(path):
line_no = 0
with open(path, "r") as myfile:
def readline():
nonlocal line_no
line_no += 1
return myfile.readline()

It's hard to say what you're trying to achieve here by using closures. But the problem is that with this approach either you'll end with an ValueError: I/O operation on closed file when you return readline from the outer function or just the first line if you return readline() from the outer function.
If all you wanted to do is call readline() repeatedly or loop over the file and also remember the current line number then better use a class:
class FileReader(object):
def __init__(self, path):
self.line_no = 0
self.file = open(path)
def __enter__(self):
return self
def __iter__(self):
return self
def next(self):
line = next(self.file)
self.line_no += 1
return line
def readline(self):
return next(self)
def __exit__(self, *args):
self.file.close()
Usage:
with FileReader('file.txt') as f:
print next(f)
print next(f)
print f.readline()
print f.line_no # prints 3
for _ in xrange(3):
print f.readline()
print f.line_no # prints 6
for line in f:
print line
break
print f.line_no # prints 7

The more Pythonic way to get the next line and keep track of the line number is with the enumerate builtin:
with open(path, "r") as my file:
for no, line in enumerate(myfile, start=1):
# process line
This will work in all current Python versions.

How to use traceit to report function input variables in stack trace

I've been using the following code to trace the execution of my programs:
import sys
import linecache
import random
def traceit(frame, event, arg):
if event == "line":
lineno = frame.f_lineno
filename = frame.f_globals["__file__"]
if filename == "<stdin>":
filename = "traceit.py"
if (filename.endswith(".pyc") or
filename.endswith(".pyo")):
filename = filename[:-1]
name = frame.f_globals["__name__"]
line = linecache.getline(filename, lineno)
print "%s:%s:%s: %s" % (name, lineno,frame.f_code.co_name , line.rstrip())
return traceit
def main():
print "In main"
for i in range(5):
print i, random.randrange(0, 10)
print "Done."
sys.settrace(traceit)
main()
Using this code, or something like it, is it possible to report the values of certain function arguments? In other words, the above code tells me "which" functions were called and I would like to know "what" the corresponding values of the input variables for those function calls.
Thanks in advance.

The traceit function that you posted can be used to print information as each line of code is executed. If all you need is the function name and arguments when certain functions are called, I would suggest using this trace decorator instead:
import functools
def trace(f):
'''This decorator shows how the function was called'''
#functools.wraps(f)
def wrapper(*arg,**kw):
arg_str=','.join(['%r'%a for a in arg]+['%s=%s'%(key,kw[key]) for key in kw])
print "%s(%s)" % (f.__name__, arg_str)
return f(*arg, **kw)
return wrapper
You could use it as follows:
#trace
def foo(*args,**kws):
pass
foo(1)
# foo(1)
foo(y=1)
# foo(y=1)
foo(1,2,3)
# foo(1,2,3)
Edit: Here is an example using trace and traceit in conjunction:
Below, trace is used in 2 different ways. The normal way is to decorate functions you define:
#trace
def foo(i):
....
But you can also "monkey-patch" any function whether you defined it or not like this:
random.randrange=trace(random.randrange)
So here's the example:
import sys
import linecache
import random
import functools
def trace(f):
'''This decorator shows how the function was called'''
#functools.wraps(f)
def wrapper(*arg,**kw):
arg_str=','.join(['%r'%a for a in arg]+['%s=%s'%(key,kw[key]) for key in kw])
print "%s(%s)" % (f.__name__, arg_str)
return f(*arg, **kw)
return wrapper
def traceit(frame, event, arg):
if event == "line":
lineno = frame.f_lineno
filename = frame.f_globals["__file__"]
if filename == "<stdin>":
filename = "traceit.py"
if (filename.endswith(".pyc") or
filename.endswith(".pyo")):
filename = filename[:-1]
name = frame.f_globals["__name__"]
line = linecache.getline(filename, lineno)
print "%s:%s:%s: %s" % (name, lineno,frame.f_code.co_name , line.rstrip())
return traceit
random.randrange=trace(random.randrange)
#trace
def foo(i):
print i, random.randrange(0, 10)
def main():
print "In main"
for i in range(5):
foo(i)
print "Done."
sys.settrace(traceit)
main()

frame.f_locals will give you the values of the local variables, and I guess you could keep track of the last frame you've seen and if frame.f_back is not the lastframe dump frame.f_locals.
I'd predict though that you're pretty quickly going be snowed under with too much data doing this.
Here's your code modified to do this:
import sys
import linecache
import random
class Tracer(object):
def __init__(self):
self.lastframe = None
def traceit(self, frame, event, arg):
if event == "line":
lineno = frame.f_lineno
filename = frame.f_globals["__file__"]
if filename == "<stdin>":
filename = "traceit.py"
if (filename.endswith(".pyc") or
filename.endswith(".pyo")):
filename = filename[:-1]
name = frame.f_globals["__name__"]
line = linecache.getline(filename, lineno)
if frame.f_back is self.lastframe:
print "%s:%s:%s: %s" % (name, lineno,frame.f_code.co_name , line.rstrip())
else:
print "%s:%s:%s(%s)" % (name, lineno,frame.f_code.co_name , str.join(', ', ("%s=%r" % item for item in frame.f_locals.iteritems())))
print "%s:%s:%s: %s" % (name, lineno,frame.f_code.co_name , line.rstrip())
#print frame.f_locals
self.lastframe = frame.f_back
return self.traceit
def main():
print "In main"
for i in range(5):
print i, random.randrange(0, 10)
print "Done."
sys.settrace(Tracer().traceit)
main()

web.py had a method called "upvars" that did something similar, taking in the variables from the calling frame. Note the comment:
def upvars(level=2):
"""Guido van Rossum sez: don't use this function."""
return dictadd(
sys._getframe(level).f_globals,
sys._getframe(level).f_locals)

What is much more useful to me in tracing than dumping ALL the state of variables at the time of execution is to do an eval of each and every line of code, ie:
for modname in modnames: | for init in ., init, encoding
|
if not modname or '.' in modname: | if not init or '.' in init
continue | continue
|
try: |
ie: where real line of code is on the left and each line of code is on the right. I've implemented this in perl and it is a LIFESAVER there. I'm in the process of trying to implement this in python but I'm not as familiar with the language so it will take a bit of time.
Anyways, if anybody has ideas how to implement this, I'd love to hear them. As far as I can tell, it comes down to this function
interpolate_frame(frame, string)
where is the frame passed to the trace function, and the string is the line of code to be interpolated with variables in the current frame. Then, the above code becomes:
print "%s:%s:%s: %s|%s" % (name, lineno,frame.f_code.co_name,
padded(line.rstrip(),10),padded(interpolate_frame(frame, line.rstrip()),100)
I'm going to try to hack this up, but again, if anybody has ideas on this I'm welcome to hear them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to break from a Python generator with open file handles - python

Related

Getting a multiline expression from a function

Prevent 'try' to catch an exception and pass to the next line in python

Cleanup of iterators that have not been fully exhausted

Scope of nested function declarations

How to use traceit to report function input variables in stack trace

Categories

Resources