Python generators: Errors visible only after commenting - python

I was trying following python code to simulate 'tail' command of *nix systems.
import sys
def tail(f):
print 'in tail with ',f
f.seek(0,2)
while True:
line = f.readline()
if not line:
time.sleep(0.1)
continue
yield line
if(len(sys.argv) >= 2):
print 'calling tail'
tail(open(sys.argv[1],'r'))
else:
print 'Give file path.\n'
I did an error (missed importing time module). However, what's odd is that no error was getting thrown and program was quitting silently.
Output (before commenting):
$ python tail.py /var/log/dmesg
calling tail
However if I comment lines following the one using the time module, the error does get thrown.
import sys
def tail(f):
print 'in tail with ',f
f.seek(0,2)
while True:
line = f.readline()
if not line:
time.sleep(0.1)
# continue
# yield line
if(len(sys.argv) >= 2):
print 'calling tail'
tail(open(sys.argv[1],'r'))
else:
print 'Give file path.\n'
Output (after commenting)
$ python tail.py /var/log/dmesg
calling tail
in tail with <open file '/var/log/dmesg', mode 'r' at 0x7fc8fcf1e5d0>
Traceback (most recent call last):
File "tail.py", line 14, in <module>
tail(open(sys.argv[1],'r'))
File "tail.py", line 8, in tail
time.sleep(0.1)
NameError: global name 'time' is not defined
Can anyone please explain why the error was not getting thrown in case one (before commenting)? Shouldn't a error be thrown as soon as interpreter comes on that line?
Corrected program:
import sys
import time
def tail(f):
print 'in tail with ',f
f.seek(0,2)
while True:
line = f.readline()
if not line:
time.sleep(0.1)
continue
yield line
if(len(sys.argv) >= 2):
print 'calling tail'
t = tail(open(sys.argv[1],'r'))
for i in t:
print i
else:
print 'Give file path.\n'
Output:
$ python tail.py hello.txt
calling tail
in tail with <open file 'hello.txt', mode 'r' at 0x7fac576b95d0>
hello there 1
hello there 2
hello there 3
Thanks for the responses.

Short Answer
First one is instantiating a generator (but not assigning it to a variable) and second one is a function call.
Long Answer
This is because of dynamic type checking of python, when you have the yield statement, your function behaves as a generator and this line -
tail(open(sys.argv[1],'r'))
means that you are instantiating the generator not calling a function. You'll get that error when you assign this instance to some variable and call the next method for generator which actually fires it up i.e. -
t = tail(open(sys.argv[1],'r')) # t is a generator here
t.next()
The other case in which you removed the yield statement, it started behaving as a normal function which means - tail(open(sys.argv[1],'r')) is a function call now, and hence it threw an error.
What I meant by dynamic is python doesn't check these kind of errors until it reaches that statement, which in first case wasn't.

With yield in the function, it is a generator. Generator functions only execute their contained code when the next value is requested. Simply calling a generator function merely creates that generator object. If you do so without doing anything with that object, such as looping through it, nothing will happen.
Removing the yield makes the function evaluate eagerly, so its code is actually executed.
If you actually iterated over the generator, it would produce an error if/when readline() produced an empty line. Since such an empty line can only occur at the end of a file (what look like blank lines actually contain a single linefeed character), putting it in a loop doesn't make sense anyway. Instead of this:
while True:
line = f.readline()
if not line:
time.sleep(0.1)
continue
yield line
Use this:
for line in f:
yield line
And instead of this:
if(len(sys.argv) >= 2):
print 'calling tail'
tail(open(sys.argv[1],'r'))
You should actually execute the generator's contents, with something like this:
if(len(sys.argv) >= 2):
print 'calling tail'
for line in tail(open(sys.argv[1],'r')):
print line

Related

concurrent.futures.ThreadPoolExecutor doesn't print errors

I am trying to use concurrent.futures.ThreadPoolExecutor module to run a class method in parallel, the simplified version of my code is pretty much the following:
class TestClass:
def __init__(self, secondsToSleepFor):
self.secondsToSleepFor = secondsToSleepFor
def testMethodToExecInParallel(self):
print("ThreadName: " + threading.currentThread().getName())
print(threading.currentThread().getName() + " is sleeping for " + str(self.secondsToSleepFor) + " seconds")
time.sleep(self.secondsToSleepFor)
print(threading.currentThread().getName() + " has finished!!")
with concurrent.futures.ThreadPoolExecutor(max_workers = 2) as executor:
futuresList = []
print("before try")
try:
testClass = TestClass(3)
future = executor.submit(testClass.testMethodToExecInParallel)
futuresList.append(future)
except Exception as exc:
print('Exception generated: %s' % exc)
If I execute this code it seems to behave like it is intended to.
But if I make a mistake like specifying a wrong number of parameters in "testMethodToExecInParallel" like:
def testMethodToExecInParallel(self, secondsToSleepFor):
and then still submitting the function as:
future = executor.submit(testClass.testMethodToExecInParallel)
or trying to concatenate a string object with an integer object (without using str(.) ) inside a print statement in "testMethodToExecInParallel" method:
def testMethodToExecInParallel(self):
print("ThreadName: " + threading.currentThread().getName())
print("self.secondsToSleepFor: " + self.secondsToSleepFor) <-- Should report an Error here
the program doesn't return any error; just prints "before try" and ends execution...
Is trivial to understand that this makes the program nearly undebuggable... Could someone explain me why such behaviour happens?
(for the first case of mistake) concurrent.futures.ThreadPoolExecutor doesn't check for a function with the specified signature to submit and, eventually, throw some sort of "noSuchFunction" exception?
Maybe there is some sort of problem in submitting to ThreadPoolExecutor class methods instead of simple standalone functions and, so, such behaviour could be expected?
Or maybe the error is thrown inside the thread and for some reason I can't read it?
-- EDIT --
Akshay.N suggestion of inserting future.result() after submitting functions to ThreadPoolExecutor makes the program behave as expected: goes nice if the code is correct, prints the error if something in the code is wrong.
I thing users must be warned about this very strange behaviour of ThreadPoolExecutor:
if you only submit functions to ThreadPoolExecutor WITHOUT THEN CALLING future.result():
- if the code is correct, the program goes on and behaves as expected
- if something in the code is wrong seems the program doesn't call the submitted function, whatever it does: it doesn't report the errors in the code
As far as my knowledge goes which is "not so far", you have to call "e.results()" after "executor.submit(testClass.testMethodToExecInParallel)" in order to execute the threadpool .
I have tried what you said and it is giving me error, below is the code
>>> import concurrent.futures as cf
>>> executor = cf.ThreadPoolExecutor(1)
>>> def a(x,y):
... print(x+y)
...
>>> future = executor.submit(a, 2, 35, 45)
>>> future.result()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\username
\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\_base.py", line
425, in result
return self.__get_result()
File "C:\Users\username
\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\_base.py", line
384, in __get_result
raise self._exception
File "C:\Users\username
\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\thread.py", line
57, in run
result = self.fn(*self.args, **self.kwargs)
TypeError: a() takes 2 positional arguments but 3 were given
Let me know if it still doesn't work

Better debugging: expected a character buffer object

I am trying to pass a string section to the the python function below it
I am uncertain why I am seeing this error. My understanding of this is that it is not getting a string, where it is expected. I have tried casting, but that is not working either. How can I solve this or get more debug info?
section = str('[log]')
some_var = 'filename ='
edit_ini('./bench_config.ini', section, some_var, 'logs/ops_log_1')
The function causing the error
def edit_ini(filename, section, some_var, value):
section = False
flist = open(filename, 'r').readlines()
f = open(filename+'test', 'w')
for line in flist:
line = str(line)
print line
if line.startswith(section):
section = True
if( section == True ):
if( line.startswith(some_var) ):
modified = "%s = $s", variable, value
print >> f, modified
section = False
else:
print >> f, line
f.close()
However I see the error:
Traceback (most recent call last):
File "bench.py", line 89, in <module>
edit_ini('./config.ini', section, some_var, 'logs/log_1')
File "bench.py", line 68, in edit_ini
if line.startswith(section):
TypeError: expected a character buffer object
You overwrite the passed-in section with section=False. The error is because you cannot call string.startswith( False ).
A way to debug python is to use pdb. This would have helped you find your problem here. You should read the spec, but heres a quick guide on how to use pdb.
import pdb
# your code ...
pdb.set_trace() # put this right before line.startswith(section)
Then when you run your code, you will execute up to right before the failure. Then you can print section in pdb, and see that it is False, and then try to figure out why it is False.

Failures with Python multiprocessing.Pool when maxtasksperchild is set

I am using Python 2.7.8 on Linux and am seeing a consistent failure in a program that uses multiprocessing.Pool(). When I set maxtasksperchild to None, then all is well, when testing across a variety of values for processes. But if I set maxtasksperchild=n (n>=1), then I invariably end with an uncaught exception. Here is the main block:
if __name__ == "__main__":
options = parse_cmdline()
subproc = Sub_process(options)
lock = multiprocessing.Lock()
[...]
pool = multiprocessing.Pool(processes=options.processes,
maxtasksperchild=options.maxtasksperchild)
imap_it = pool.imap(recluster_block, subproc.input_block_generator())
#import pdb; pdb.set_trace()
for count, result in enumerate(imap_it):
print "Count = {}".format(count)
if result is None or len(result) == 0:
# presumably error was reported
continue
(interval, block_id, num_hpcs, num_final, retlist) = result
for c in retlist:
subproc.output_cluster(c, lock)
print "About to close_outfile."
subproc.close_outfile()
print "About to close pool."
pool.close()
print "About to join pool."
pool.join()
For debugging I have added a print statement showing the number of times through the loop. Here are a couple runs:
$ $prog --processes=2 --maxtasksperchild=2
Count = 0
Count = 1
Count = 2
Traceback (most recent call last):
File "[...]reclustering.py", line 821, in <module>
for count, result in enumerate(imap_it):
File "[...]/lib/python2.7/multiprocessing/pool.py", line 659, in next
raise value
TypeError: 'int' object is not callable
$ $prog --processes=2 --maxtasksperchild=1
Count = 0
Count = 1
Traceback (most recent call last):
[same message as above]
If I do not set maxtasksperchild, the program runs to completion successfully. Also, if I uncomment the "import pdb; pdb.set_trace()" line and enter the debugger, then the problem does not appear (Heisenbug). So, am I doing something wrong in the code here? Are there conditions on the code that generates the input (subproc.input_block_generator) or the code that processes it (recluster_block), that are known to cause issues like this? Thanks!
maxtasksperchild causes multiprocessing to respawn child processes. The idea is to get rid of any cruft that is building up. The problem is, you can get new cruft from the parent. When the child respawns, it gets the current state of the parent process, which is different than the orignal spawn. You are doing your work in the script's global namespace, so you are changing the environment the child will see quite a bit. Specifically, you use a variable called 'count' that masks a previous 'from itertools import count' statement.
To fix this:
use namespaces (itertools.count, like you said in the comment) to reduce name collisions
do your work in a function so that local variables aren't propagated to the child.

raw_input() and sys.stdin misbehaves on CTRL-C

I am trying to detect a KeyboardInterrupt exception when CTRL-C is pressed during a raw_input() prompt. Normally the following code works just fine to detect the command:
try:
input = raw_input("Input something: ")
except KeyboardInterrupt:
do_something()
The problem comes when trying to intercept the input from sys.stdin. After adding some code in between raw_input() and sys.stdin, the CTRL-C command now results in two exceptions: EOFError followed by KeyboardInterrupt a line or two later. This is the code used to test:
import sys
import traceback
class StdinReplacement(object):
def __init__(self):
self.stdin = sys.stdin
sys.stdin = self
def readline(self):
input = self.stdin.readline()
# here something could be done with input before returning it
return input
if __name__ == '__main__':
rep = StdinReplacement()
while True:
info = None
try:
try:
input = raw_input("Input something: ")
print input
except:
info = sys.exc_info()
print info
except:
print '\n'
print "0:", traceback.print_traceback(*info)
print "1:", traceback.print_exception(*sys.exc_info())
Which results in the following being printed out:
0:Traceback (most recent call last):
File "stdin_issues.py", line 19, in <module>
input = raw_input("Input something: ")
EOFError: EOF when reading a line
None
1:Traceback (most recent call last):
File "stdin_issues.py", line 23, in <module>
print info
KeyboardInterrupt
Am I missing something obvious? Maybe intercepting the input in a bad way?
Found this fairly old page which seems like the same issue. No solution though:
https://mail.python.org/pipermail/python-list/2009-October/555375.html
Some environment details:
Python 2.7.3 (64-bit),
Windows 7 SP1 (64-bit)
------------------------------------------------------------------------
EDIT:
An update to the readline method of StdinReplacement fixed the issue.
def readline(self):
input = self.stdin.readline()
# here something could be done with input before returning it
if len(input) > 0:
return input
else:
return '\n'
It seems like the problem is that your readline method returns an empty line, thus signaling the end of file:
import sys
class Test(object):
def readline(self):
return ''
sys.stdin = Test()
raw_input('') # EOFError!
However modifying it such that it doesn't return an empty line makes the code work:
import sys
class Test(object):
def readline(self):
return '\n' # or 'a', or whatever
sys.stdin = Test()
raw_input('') # no error
The readline method should return an empty string only when the file finished. The EOFError is used to mean exactly this: raw_input expected the file to contain a line, but the file ended.
This is due to the call to PyFile_GetLine found at the end of the implementation of raw_input:
return PyFile_GetLine(fin, -1);
According to the documentation of PyFile_GetLine(PyObject *p, int n):
If n is less than 0, however, one line is read regardless of length,
but EOFError is raised if the end of the file is reached immediately.
Since it passes -1 as n the EOFError is raised when the EOF is found (i.e. you return an empty string from readline).
As far as I can tell the behaviour you see is only possible if you insert the input and also create an interrupt. Pressing only Ctrl+C doesn't generate any EOFError (at least on linux).

Looping in Python and keeping current line after sub routine

I've been trying to nut out an issue when looping in python 3.
When returning from sub routine the "line" variable has not incremented.
How do I get the script to return the latest readline from the
subsroutine?
Code below
def getData(line):
#print(line)
#while line in sTSDP_data:
while "/service/content/test" not in line:
line = sTSDP_data.readline()
import os, sys
sFileTSDP = "d:/ess/redo/Test.log"
sTSDP_data = open(sFileTSDP, "r")
for line in sTSDP_data:
if "MOBITV" in line:
getData(line) #call sub routine
print(line)
I'm stepping through a large file and on a certain string I need to call
a sub routine to process the next 5 (or 100) lines of data. When the sub routine completes and returns to the main program, it would be better to have it continue on from the last
readline in the sub routine, not the last readline in the main program.
Daan's answer did the trick.
How about using a return statement?
def getData(line):
#print(line)
#while line in sTSDP_data:
while "/service/content/test" not in line:
line = sTSDP_data.readline()
return line
import os, sys
sFileTSDP = "d:/ess/redo/Test.log"
sTSDP_data = open(sFileTSDP, "r")
for line in sTSDP_data:
if "MOBITV" in line:
line = getData(line) #call sub routine
print(line)
Beware the scope of your variables. The 'line' in your getData function is not the same as the 'line' in your loop.
Well, assignment does not work by reference. This means that if you reassign a variable in one function, then it won't modify values in another function (unless there are specific exceptions like global and nonlocal). (Please note: that is reassign, not modify. If you modify a list all references to the list are "modified").
Simply place return line at the end of getData(line)
def getData(line):
#print(line)
#while line in sTSDP_data:
while "/service/content/test" not in line:
line = sTSDP_data.readline()
return line

Categories

Resources