The child process exception is not visible when BrokenProcessPool is raised - python

I have the following code:
import asyncio
from concurrent.futures import ProcessPoolExecutor
PROCESS_POOL_EXECUTOR = ProcessPoolExecutor(max_workers=2)
def run_in_process(blocking_task, *args) -> Awaitable:
event_loop = asyncio.get_event_loop()
return event_loop.run_in_executor(PROCESS_POOL_EXECUTOR, blocking_task, *args)
It has been working fine until I added an additional volume and mounted it in an EC2 instance. After I did that, it is raising the following exception:
File "/proc/self/fd/3/repo/utils/asyncio.py", line 63, in run_in_process
return event_loop.run_in_executor(PROCESS_POOL_EXECUTOR, blocking_task, *args)
File "/conda/lib/python3.8/asyncio/base_events.py", line 783, in run_in_executor
executor.submit(func, *args), loop=self)
File "/conda/lib/python3.8/concurrent/futures/process.py", line 629, in submit
raise BrokenProcessPool(self._broken)
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
There is nothing except this log. If I understood correctly, this means that the worker raised some exception and that's why the child process terminated. But I don't see that child process exception. That's why I have no idea what is going wrong.
It is probably related to that additional volume and mounting because it works without that new volume. I just don't know what exactly is going wrong.
I tried to run the code in ipython and it worked just fine there too.
I understand that this is a bad question that is not reproducible but maybe someone has seen this before and has some idea.

Related

Logging inside Threads When Logger is Already Configured

EDIT: Repo with all code (branch "daemon"). The question is regarding the code in the file linked to).
My main program configures logging like this (options have been simplified):
logging.basicConfig(level='DEBUG', filename="/some/directory/cstash.log")
Part of my application starts a daemon, for which I use the daemon package:
with daemon.DaemonContext(
pidfile=daemon.pidfile.PIDLockFile(self.pid_file),
stderr=self.log_file,
stdout=self.log_file
):
self.watch_files()
where self.log_file is a file I've opened for writing.
When I start the application, I get:
--- Logging error ---
Traceback (most recent call last):
File "/Users/afraz/.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 1038, in emit
self.flush()
File "/Users/afraz/.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 1018, in flush
self.stream.flush()
OSError: [Errno 9] Bad file descriptor
If I switch off the logging to a file in the daemon, the logging in my main application works, and if I turn off the logging to a file in my main application, the logging in the daemon works. If I set them up to log to a file (even different files), I get the error above.
After trying many things, here's what worked:
def process_wrapper():
with self.click_context:
self.process_queue()
def watch_wrapper():
with self.click_context:
self.watch_files()
with daemon.DaemonContext(
pidfile=daemon.pidfile.PIDLockFile(self.pid_file),
files_preserve=[logger.handlers[0].stream.fileno()],
stderr=self.log_file,
stdout=self.log_file
):
logging.info("Started cstash daemon")
while True:
threading.Thread(target=process_wrapper).start()
time.sleep(5)
threading.Thread(target=watch_wrapper).start()
There were two main things wrong:
daemon.DaemonContext needs files_preserve set to the file logging handler, so it doesn't close the file once the context is switched. This is the actual solution to the original problem.
Further however, both methods needed to be in separate threads, not just one. The while True loop in the main thread was stopping the other method from running, so putting them both into separate threads means they can both run

Timeout and Exception Function Getting Stuck

I am polling data using a python 2.7.10 function that I want to timeout if a device takes too long to respond, or catch a RuntimeError if that device is not available.
I am using this Timeout function:
class Timeout():
class Timeout(Exception):
pass
def __init__(self, sec):
self.sec = sec
def __enter__(self):
signal.signal(signal.SIGALRM, self.raise_timeout)
signal.alarm(self.sec)
def __exit__(self, *args):
signal.alarm(0)
def raise_timeout(self, *args):
raise Timeout.Timeout()
This is my loop to make the data polls (Modbus) and catch the exceptions. This loop is called every 60 seconds:
def getDeviceTags(name, tag_data):
global val_returns
for tag in tag_data[name]:
local_vals = []
local_vals.append(name+"."+tag)
try:
with Timeout(3):
value = modbus.read(str(name), str(tag))
local_vals.append(str(value.value()))
except RuntimeError:
print("RuntimeError on " + str(name))
local_vals.append(None)
except Timeout.Timeout:
print("Timeout on " + str(name))
local_vals.append(None)
val_returns.append(local_vals)
This will work for DAYS at a time with no issues, both RuntimeErrors and Timeouts being printed to the console, all data logged - GREAT.
However, recently its been getting stuck - and this is the only error I'm getting:
Traceback (most recent call last):
File "working_one_min_back.py", line 161, in <module>
job()
File "working_one_min_back.py", line 79, in job
getDeviceTags(str(key), data)
File "working_one_min_back.py", line 57, in getDeviceTags
print("RuntimeError on " + str(name))
File "working_one_min_back.py", line 30, in raise_timeout
raise Timeout.Timeout()
__main__.Timeout
There’s no guarantee that a “Python signal” isn’t delivered after a call to alarm(0). The actual (C) signal might already have been delivered, causing the Python handler to be invoked a few bytecode instructions later.
If you call signal.signal from __exit__, any such pending signal is discarded, which usefully prevents mistaking it for the next one requested. Using that to restore the handler to the value it had before the Timeout was created (as returned by the first signal.signal call) is a good idea anyway. (Reset it after calling alarm(0) to prevent SIG_DFL from killing the process.)
In Python 3, such a call delivers any pending signals instead of discarding them, which is an improvement in that it prevents losing a signal just because the handler changed. (This is no more documented than the Python 2 behavior, unfortunately.) You can try to suppress such a late signal by setting an attribute in __exit__ and ignoring any (Python) signal raised when it is set.
Of course, the signal could be delivered after __exit__ begins execution and before the signal is discarded (or marked to be ignored). You therefore have to handle an operation both completing and timing out, perhaps by having several assignments to a single variable that is then appended in just one place.

File cannot be found when i use #Fabric put# in multi-threading to copy a temp file created by tempfile.mkstemp to a remote file

Python Version: 2.6.4
Fabric Version:1.9.0
I have an automation testing framework to execute cases in parallel by using threading.Thread(3 threads in my case).
Each thread worker uses fabric put(we do some wrapper on this function though) to copy a temporary file created by tempfile.mkstemp to a remote file.
The question is that it always gives me an error that file cannot be found, the error happens during 'put' from the exception tips.
here is the code when do 'put':
MyShell.py(Parent class of MyFabShell)
def putFileContents(self, file, contents):
fd, tmpFile= tempfile.mkstemp()
os.system('chmod 777 %s' % tmpFile)
contentsFile = open(tmpFile, 'w')
contentsFile.write(contents)
contentsFile.close()
dstTmpFile = self.genTempFile()
localshell = getLocalShell()
self.get(localshell, tmpFile, dstTmpFile) # use local shell to put tmpFile to dstTmpFile
os.close(fd)
os.remove(tmpFile)
#os.remove(tmpFile)
MyFabShell.py:
def get( self, srcShell, srcPath, dstPath ):
srcShell.put( self, srcPath, dstPath )
def put(self, dstShell, srcpath, dstpath):
if not self.pathExists( srcPath ): # line 158
raise Exception( "Cannot put <%s>, does not exist." % srcPath )
# do fabric get/put in the following
# ...
The call of put results in an error:
...
self.shell.putFileContents(configFile, contents)
File "/path/to/MyShell.py", line 401, in putFileContents
self.get(localShell, tmpFile, dstTmpFile)
File "/path/to/MyFabShell.py", line 134, in get
srcShell.put( self, srcPath, dstPath )
File "/path/to/myFabShell.py", line 158, in put
raise Exception( "Cannot put <%s>, does not exist." % srcPath )
Exception: Cannot put </tmp/tmpwt3hoO>, does not exist.
I initially doubt the file could be removed during put, so I commented os.remove. However, I got the same error again.
From the exception log, it should not be the problem of 'fabric put' since exception throws before the execution of fabric get/put
Is mkstemp NOT safe when multithreading is involved? but the document says that "There are no race conditions in the file’s creation" or does my case fail because of GIL? I suspect this is because when I use only 1 thread, everything will be fine.
Could anyone give me some clue on my error? I have being struggling with the problem for a while:(
My problem is solved when i 'join' the thread explicitly. all my threads are not daemon threads, and each of thread has pretty much of I/O operation(e.g. file write/read). 'join' explicitly will make sure each thread's job is completed.
I am still not sure the root cause of my problem...the temp file is actually there but the thread complains "cannot find" when multiple thread are working together, the only guess i can give is:
when a thread A does I/O operation, the thread A will release GIL so that thread B(or C, D...) can acquired GIL during A's I/O operation time. the problem might happen during the I/O time because Thread A is not in the Python Interpreter any more...that's the reason "file cannot be found" by A, however, When we join A explicitly, A will always make sure to complete its job by reentering the GI(Global Interpreter).

"WindowsError: Access is denied" on calling Process.terminate

I enforce a timeout for a block of code using the multiprocessing module. It appears that with certain sized inputs, the following error is raised:
WindowsError: [Error 5] Access is denied
I can replicate this error with the following code. Note that the code completes with '467,912,040' but not with '517,912,040'.
import multiprocessing, Queue
def wrapper(queue, lst):
lst.append(1)
queue.put(lst)
queue.close()
def timeout(timeout, lst):
q = multiprocessing.Queue(1)
proc = multiprocessing.Process(target=wrapper, args=(q, lst))
proc.start()
try:
result = q.get(True, timeout)
except Queue.Empty:
return None
finally:
proc.terminate()
return result
if __name__ == "__main__":
# lst = [0]*417912040 # this works fine
# lst = [0]*467912040 # this works fine
lst = [0] * 517912040 # this does not
print "List length:",len(lst)
timeout(60*30, lst)
The output (including error):
List length: 517912040
Traceback (most recent call last):
File ".\multiprocessing_error.py", line 29, in <module>
print "List length:",len(lst)
File ".\multiprocessing_error.py", line 21, in timeout
proc.terminate()
File "C:\Python27\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Python27\lib\multiprocessing\forking.py", line 306, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied
Am I not permitted to terminate a Process of a certain size?
I am using Python 2.7 on Windows 7 (64bit).
While I am still uncertain regarding the precise cause of the problem, I have some additional observations as well as a workaround.
Workaround.
Adding a try-except block in the finally clause.
finally:
try:
proc.terminate()
except WindowsError:
pass
This also seems to be the solution arrived at in a related (?) issue posted here on GitHub (you may have to scroll down a bit).
Observations.
This error is dependent on the size of the object passed to the Process/Queue, but it is not related to the execution of the Process itself. In the OP, the Process completes before the timeout expires.
proc.is_alive returns True before and after the execution of proc.terminate() (which then throws the WindowsError). A second or two later, proc.is_alive() returns False and a second call to proc.terminate() succeeds.
Forcing the main thread to sleep time.sleep(1) in the finally block also prevents the throwing of the WindowsError. Thanks, #tdelaney's comment in the OP.
My best guess is that proc is in the process of freeing memory (?, or something comparable) while being killed by the OS (having completed execution) when the call to proc.terminate() attempts to kill it again.

cPickle - unpickling error ONLY when the unpickled stuff is used later on

The code I write now works fine, I can even print the deserialized objects with no mistakes whatsoever, so I do know exactly what is in there.
#staticmethod
def receiveData(self):
'''
This method has to be static, as it is the argument of a Thread.
It receives Wrapperobjects from the server (as yet containing only a player)
and resets the local positions accordingly
'''
logging.getLogger(__name__).info("Serverinformationen werden nun empfangen")
from modules.logic import game
sock = self.sock
time.sleep(10)
self.myPlayer = game.get_player()
while (True):
try:
wrapPacked = sock.recv(4096)
self.myList = cPickle.loads(wrapPacked)
# self.setData(self.myList)
except Exception as eload:
print eload
However, if I try to actually use the line that is in comments here (self.setData(self.myList),
I get
unpickling stack underflow
and
invalid load key, ' '.
Just for the record, the code of setData is:
def setData(self, list):
if (list.__sizeof__()>0):
first = list [0]
self.myPlayer.setPos(first[1])
self.myPlayer.setVelocity(first[2])
I have been on this for 3 days now, and really, I have no idea what is wrong.
Can you help me?
Full Traceback:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "mypath/client.py", line 129, in receiveData
self.myList = cPickle.loads(wrapPacked)
UnpicklingError: unpickling stack underflow –
The fact that your exceptions always happen when you try to access the pickled data seem to indicate that you are hitting a bug in the cPickle library instead.
What can happen is that a C library forgets to handle an exception. The exception info is stored, not handled, and is sitting there in the interpreter until another exception happens or another piece of C code does check for an exception. At this point the old, unhandled exception is thrown instead.
Your error is clearly cPickle related, it is very unhappy about the data you feed it, but the exception itself is thrown in unrelated locations. This could be threading related, it could be a regular non-threading-related bug.
You need to see if you can load the data in a test setting. Write wrapPacked to a file for later testing. Load that file in a interpreter shell session, load it with cPickle.loads() and see what happens. Do the same with the pickle module.
If you do run into similar problems in this test session, and you can reproduce it (weird exceptions being thrown at a later point in the session) you need to file a bug with the Python project to have this looked at.

Categories

Resources