On some occasions, my python program won't response because there seems to be a deadlock. Since I have no idea where this deadlock happens, I'd like to set a breakpoint or dump the stack of all threads after 10 seconds in order to learn what my program is waiting for.
Use the logging module and put e.g. Logger.debug() calls in strategic places through your program. You can disable these messages by one single setting (Logger.setLevel) if you want to. And you can choose if you want to write them to e.g. stderr or to a file.
import pdb
from your_test_module import TestCase
testcase = TestCase()
testcase.setUp()
pdb.runcall(testcase.specific_test)
And then ctrl-c at your leisure. The KeyboardInterupt will cause pdb to drop into debugger prompt.
Well, as it turns out, it was because my database was locked (a connection wasn't closed) and when the tests were tearing down (and the database schema was being erased so that the database is clean for the next tests), psycopg2 just ignored the KeyboardInterrupt exception.
I solved me problem using the faulthandler module (for earlier versions, there is a pypi repo). Fault handler allows me to dump the stack trace to any file (including sys.stderr) after a period of time (repeatingly) using faulthandler.dump_traceback_later(3, repeat=True). That allowed me to set the breakpoint where my program stopped responding and tackle the issue down effectively.
Related
After enabling Write-Ahead Log mode for a volatile SQLite3 database (PRAGMA journal_mode = WAL), my concurrency tests began raising this error. I discovered that this happens when the Python process is forked and a connection is left open to a database in WAL mode. Any subsequent execute() on that database, even with a new connection, throws this 'locking protocol' exception.
Disabling WAL mode (PRAGMA journal_mode = DELETE) makes the problem disappear, and neither does any 'database is locked' error occur either. The 'locking protocol' exception seems reflect the SQLITE_PROTOCOL code underneath which is documented as:
The SQLITE_PROTOCOL result code indicates a problem with the file locking protocol used by SQLite.
I'm using Python 2.7.10 on Mac OS X 10.12.6 Sierra. I think the problem is in Python's sqlite3 module and how it deals with being forked, rather than an issue in SQLite3 itself. I know now how to work around the issue but as per the main question, what is the root cause of this issue?
P.S. - I'm not using any threads and am forking by spawning a daemon child.
SQLite3 is obviously not thread-safe as per the FAQ but, as CL pointed out in the comments to my question, there is a line relating to forking there:
Under Unix, you should not carry an open SQLite database across a fork() system call into the child process.
This doesn't exactly provide an answer as to the cause, however it does point out a solution: close ALL SQLite connections in (or before) a fork() process! Holding onto forked connections prevents new connections from taking place across any process!
I am using a simple python with statement such as below to write to a log file.
with open(filename, 'a+') as f:
do_stuff1()
f.write('stuff1 complete. \n')
do_stuff2()
f.write('stuff2 complete. \n')
do_stuff3()
f.write('stuff3 complete. \n')
I am finding that my script fails intermittently at do_stuff2() however in the log file I do not find the line "stuff1 complete" as I would expect if the file was closed correctly as should happen when using with. The only reason I do know the script is failing at do_stuff2() without my log working is because this function calls an API that does its own logging and that other log file tells me that 2 has been executed even if it did not complete.
My question is what sort of error would have to occur inside the with statement that would not only stop execution but also prevent the file from being closed correctly?
Some additional information:
The script is a scheduled task that runs late at night.
I have never been able to reproduce by running the process interactively.
The problem occurs once in every 2-3 nights.
I do see errors in the Windows event logs that point to a dll file, .NET Framework and a 0xC0000005 error which is a memory violation. The API used by do_stuff2() does use this DLL which in turn uses the .NET Framework.
Obviously I am going to try to fix the problem itself but at this point my question is focused on what could happen inside the with (potentially some number of layers below my code) that could break its intended functionality of closing the file properly regardless of whether the content of the with is executed successfully.
The with can only close the file if there is an exception. If there is a segfault inside an extension, no exception might be raised and the process dies without giving Python a chance to close the file. You can try to use f.flush() at several places to force Python to write to the file.
I read How do you create a daemon in Python? and also this topic, and tried to write a very simple daemon :
import daemon
import time
with daemon.DaemonContext():
while True:
with open('a.txt', 'a') as f:
f.write('Hi')
time.sleep(2)
Doing python script.py works and returns immediately to terminal (that's the expected behaviour). But a.txt is never written and I don't get any error message. What's wrong with this simple daemon?
daemon.DaemonContext() has option working_directory that has default fault value / i.e. your program probably doesn't have permission to create a new file there.
The problem described here is solved by J.J. Hakala's answer.
Two additional (important) things :
Sander's code (mentioned here) is better than python-daemon. It is more reliable. Just one example: try to start two times the same daemon with python-daemon : big ugly error. With Sander's code : a nice notice "Daemon already running."
For those who want to use python-daemon anyway: DaemonContext() only makes a daemon. DaemonRunner() makes a daemon + control tool, allowing to do python script.py start or stop, etc.
One thing that's wrong with it, is it has no way to tell you what's wrong with it :-)
A daemon process is, by definition, detached from the parent process and from any controlling terminal. So if it's got something to say – such as error messages – it will need to arrange that before becoming a daemon.
From the python-daemon FAQ document:
Why does the output stop after opening the daemon context?
The specified behaviour in PEP 3143_ includes the requirement to
detach the process from the controlling terminal (to allow the process
to continue to run as a daemon), and to close all file descriptors not
known to be safe once detached (to ensure any files that continue to
be used are under the control of the daemon process).
If you want the process to generate output via the system streams
‘sys.stdout’ and ‘sys.stderr’, set the ‘DaemonContext’'s ‘stdout’
and/or ‘stderr’ options to a file-like object (e.g. the ‘stream’
attribute of a ‘logging.Handler’ instance). If these objects have file
descriptors, they will be preserved when the daemon context opens.
Set up a working channel of communication, such as a log file. Ensure the files you open aren't closed along with everything else, using the files_preserve option. Then log any errors to that channel.
I have a possibly long running program that currently has 4 processes, but could be configured to have more. I have researched logging from multiple processes using python's logging and am using the SocketHandler approach discussed here. I never had any problems having a single logger (no sockets), but from what I read I was told it would fail eventually and unexpectedly. As far as I understand its unknown what will happen when you try to write to the same file at the same time. My code essentially does the following:
import logging
log = logging.getLogger(__name__)
def monitor(...):
# Spawn child processes with os.fork()
# os.wait() and act accordingly
def main():
log_server_pid = os.fork()
if log_server_pid == 0:
# Create a LogRecordSocketServer (daemon)
...
sys.exit(0)
# Add SocketHandler to root logger
...
monitor(<configuration stuff>)
if __name__ == "__main__":
main()
So my questions are: Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
If someone could provide an answer or links to help me out, that would be great. Thanks.
FYI:
I am testing on Mac OS X Lion and the code will probably end up running on a CentOS 6 VM on a Windows machine (if that matters). Whatever solution I use does not need to work on Windows, but should work on a Unix based system.
UPDATE: This question has started to move away from logging specific behavior and is more in the realm of what does linux do with file descriptors during forks. I pulled out one of my college textbooks and it seems that if you open a file in append mode from two processes (not before a fork) that they will both be able to write to the file properly as long as your write doesn't exceed the actual kernel buffer (although line buffering might need to be used, still not sure on that one). This creates 2 file table entries and one v-node table entry. Opening a file then forking isn't supposed to work, but it seems to as long as you don't exceed the kernel buffer as before (I've done it in a previous program).
So I guess, if you want platform independent multi-processing logging you use sockets and create a new SocketHandler after each fork to be safe as Vinay suggested below (that should work everywhere). For me, since I have strong control over what OS my software is being run on, I think I'm going to go with one global log object with a FileHandler (opens in append mode by default and line buffered on most OSs). The documentation for open says "A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used." or I could just create my own logging stream to be sure of line buffering. And just to be clear, I'm ok with:
# Process A
a_file.write("A\n")
a_file.write("A\n")
# Process B
a_file.write("B\n")
producing...
A\n
B\n
A\n
as long as it doesn't produce...
AB\n
\n
A\n
Vinay(or anyone else), how wrong am I? Let me know. Thanks for any more clarity/sureness you can provide.
Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
AFAIK the global log object remains pointing to the same logger in parent and child processes. So you shouldn't need to create a new one. However, I think you should create and add the SocketHandler after the fork() in monitor(), so that the socket server has four distinct connections, one to each child process. If you don't do this, then the child processes forked off in monitor() will inherit the SocketHandler and its socket handle from their parent, but I don't know for sure that it will misbehave. The behaviour is likely to be OS-dependent and you may be lucky on OSX.
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
I wouldn't expect failure if you create the socket connection to the socket server after the last fork() as I suggested above, but I am not sure the behaviour is well-defined in any other case. You refer to multiple open files but I see no reference to opening files in your pseudocode snippet, just opening sockets.
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
I think the behaviour is not well-defined, but one would expect failure modes to present as interspersed log messages from different processes in the file, e.g.
Process A writes first part of its message
Process B writes its message
Process A writes second part of its message
Update: If you use a FileHandler in the way you described in your comment, things will not be so good, due to the scenario I've described above: process A and B both begin pointing at the end of file (because of the append mode), but thereafter things can get out of sync because (e.g. on a multiprocessor, but even potentially on a uniprocessor), one process can (preempt another and) write to the shared file handle before another process has finished doing so.
I've ran into a strange situation. I'm writing some test cases for my program. The program is written to work on sqllite or postgresqul depending on preferences. Now I'm writing my test code using unittest. Very basically what I'm doing:
def setUp(self):
"""
Reset the database before each test.
"""
if os.path.exists(root_storage):
shutil.rmtree(root_storage)
reset_database()
initialize_startup()
self.project_service = ProjectService()
self.structure_helper = FilesHelper()
user = model.User("test_user", "test_pass", "test_mail#tvb.org",
True, "user")
self.test_user = dao.store_entity(user)
In the setUp I remove any folders that exist(created by some tests) then I reset my database (drop tables cascade basically) then I initialize the database again and create some services that will be used for testing.
def tearDown(self):
"""
Remove project folders and clean up database.
"""
created_projects = dao.get_projects_for_user(self.test_user.id)
for project in created_projects:
self.structure_helper.remove_project_structure(project.name)
reset_database()
Tear down does the same thing except creating the services, because this test module is part of the same suite with other modules and I don't want things to be left behind by some tests.
Now all my tests run fine with sqllite. With postgresql I'm running into a very weird situation: at some point in the execution, which actually differs from run to run by a small margin (ex one or two extra calls) the program just halts. I mean no error is generated, no exception thrown, the program just stops.
Now only thing I can think of is that somehow I forget a connection opened somewhere and after I while it timesout and something happens. But I have A LOT of connections so before I start going trough all that code, I would appreciate some suggestions/ opinions.
What could cause this kind of behaviour? Where to start looking?
Regards,
Bogdan
PostgreSQL based applications freeze because PG locks tables fairly aggressively, in particular it will not allow a DROP command to continue if any connections are open in a pending transaction, which have accessed that table in any way (SELECT included).
If you're on a unix system, the command "ps -ef | grep 'post'" will show you all the Postgresql processes and you'll see the status of current commands, including your hung "DROP TABLE" or whatever it is that's freezing. You can also see it if you select from the pg_stat_activity view.
So the key is to ensure that no pending transactions remain - this means at a DBAPI level that any result cursors are closed, and any connection that is currently open has rollback() called on it, or is otherwise explicitly closed. In SQLAlchemy, this means any result sets (i.e. ResultProxy) with pending rows are fully exhausted and any Connection objects have been close()d, which returns them to the pool and calls rollback() on the underlying DBAPI connection. you'd want to make sure there is some kind of unconditional teardown code which makes sure this happens before any DROP TABLE type of command is emitted.
As far as "I have A LOT of connections", you should get that under control. When the SQLA test suite runs through its 3000 something tests, we make sure we're absolutely in control of connections and typically only one connection is opened at a time (still, running on Pypy has some behaviors that still cause hangs with PG..its tough). There's a pool class called AssertionPool you can use for this which ensures only one connection is ever checked out at a time else an informative error is raised (shows where it was checked out).
One solution I found to this problem was to call db.session.close() before any attempt to call db.drop_all(). This will close the connection before dropping the tables, preventing Postgres from locking the tables.
See a much more in-depth discussion of the problem here.