I have a python application with multiple threads, with threads 2 to n potentially opening an arbitrary number of files. I want to make sure that when thread 1 tries to open a file, it will definitely not fail because of file descriptor exhaustion. In short, I want to reserve file descriptors without opening them.
I can only control the code being run from thread 1, which is spawned at a time when there's still plenty of file descriptors left.
(As an example, I imagine this could be done by 'reserving' fds by opening /dev/null a couple of times from thread 1, and closing it when thread 1 needs to open a file, thus making sure that there's at least one unused fd, but this ends up introducing a race condition.)
Is there a way to make sure thread 1 will have fds available when it needs them without modifying what threads 2-n do?
You need to use a mutex. For python 2.x that's the mutex or thread module.
In your "thread 1" you will access (obtain/lock) the mutex first, then close the reserved fd and open the real one, then release the mutex.
In your other threads, you simply wait till you can get the mutex, then open the file, then release the mutex.
For python3 it's the Lock from the threading module.
(Note: I'm not commenting on whether or not opening /dev/null achieves what you want in terms of reserving an fd, because I'm not confident about that. I'm just providing you a solution for how to avoid the race that you askeded about :) ).
Related
How to correctly fork a child process in twisted that does not use anything from twisted (but uses data from the parent process) (e.g. to process a “snapshot” of some data from the parent process and write it to file, without blocking)?
It seems if I do anything like clean shutdown in the child process after os.fork(), it closes some of the sockets / descriptors in the parent process; the only way to avoid that that I see is to do os.kill(os.getpid(), signal.SIGKILL), which does seem like a bad idea (though not directly problematic).
(additionally, if a dict is changed in the parent process, can it be that it will change in the child process too? Quick test shows that it doesn't change, though. OS/kernels are debian stable / sid)
IReactorProcess.spawnProcess (usually available as from twisted.internet import reactor; reactor.spawnProcess) can spawn a process running any available executable on your system. The subprocess does not need to use Twisted, or, indeed, even be in Python.
Do not call os.fork yourself. As you've discovered, it has lots of very peculiar interactions with process state, that spawnProcess will manage for you.
Among the problems with os.fork are:
Forking copies your current process state, but doesn't copy the state of threads. This means that any thread in the middle of modifying some global state will leave things half-broken, possibly holding some locks which will never be released. Don't run any threads in your application? Have you audited every library you use, every one of its dependencies, to ensure that none of them have ever or will ever use a background thread for anything?
You might think you're only touching certain areas of your application memory, but thanks to Python's reference counting, any object which you even peripherally look at (or is present on the stack) may have reference counts being incremented or decremented. Incrementing or decrementing a refcount is a write operation, which means that whole page (not just that one object) gets copied back into your process. So forked processes in Python tend to accumulate a much larger copied set than, say, forked C programs.
Many libraries, famously all of the libraries that make up the systems on macOS and iOS, cannot handle fork() correctly and will simply crash your program if you attempt to use them after fork but before exec.
There's a flag for telling file descriptors to close on exec - but no such flag to have them close on fork. So any files (including log files, and again, any background temp files opened by libraries you might not even be aware of) can get silently corrupted or truncated if you don't manage access to them carefully.
i think i have a problem with a ttyUSB device that caused from having 2 open ttyUSB fd's at the same time from different processes.
it goes like this:
i have a main python process, which opens several ttyUSB fd's, read, write, close, and open processes (with popen) to handle each ttyUSB (of course after the fd was closed).
when i do 'lsof | grep ttyUSB' it looks as if all the fd's that were opened in the main process when the child process started, associated to the child process even though they were already closed by the main process. (btw, the fd's are not associated to the main process)
is that behavior normal? (tinycore, kernal 2.6.33.3), do i have a way to prevent it?
thanks.
By default, any file descriptors that a process has open when it forks/execs (which happens during a popen()) are inherited by the child process. If this isn't what you want to happen, you will need to either manually close the file descriptors after forking, or set the fds as close-on-exec using fcntl(fd, F_SETFD, FD_CLOEXEC). (This makes the kernel automatically close the file descriptor when it execs the new process.)
I have a question about how flock() works, particularly in python. I have a module that opens a serial connection (via os.open()). I need to make this thread safe. It's easy enough making it thread safe when working in the same module using threading.Lock(), but if the module gets imported from different places, it breaks.
I was thinking of using flock(), but I'm having trouble finding enough information about how exactly flock works. I read that flock() unlocks the file once the file is closed. But is there a situation that will keep the file open if python crashes?
And what exactly is allowed to use the locked file if LOCK_EX is set? Just the module that locked the file? Any module that was imported from the script that was originally run?
When a process dies the OS should clean up any open file resources (with some caveats, I'm sure). This is because the advisory lock is released when the file is closed, an operation which occurs as part of the OS cleanup when the python process exits.
Remember, flock(2) is merely advisory:
Advisory locks allow cooperating processes to perform consistent operations on files, but [other, poorly behaved] processes may still access those files without using advisory locks.
flock(2) implements a readers-writer lock. You can't flock the same file twice with LOCK_EX, but any number of people can flock it with LOCK_SH simultaneously (as long as nobody else has a LOCK_EX on it).
The locking mechanism allows two types of locks: shared locks and exclusive locks. At any time multiple shared locks may be applied to a file, but at no time are multiple exclusive, or both shared and exclusive, locks allowed simultaneously on a file.
flock works at the OS/process level and is independent of python modules. One module may request n locks, or n locks could be requested across m modules. However, only one process can hold a LOCK_EX lock on a given file at a given time.
YMMV on a "non-UNIX" system or a non-local filesystem.
I'm using the Jython 2.51 implementation of Python to write a script that repeatedly invokes another process via subprocess.Popen and uses PIPE to pipe stdout and stderr to the parent process and stdin to the child process. After several hundred loop iterations, I seem to run out of file descriptors.
The Python subprocess documentation mentions very little about freeing file descriptors, other than the close_fds option, which isn't described very clearly (Why should there be any file descriptors besides 0, 1 and 2 open in the first place?). I'm assuming that in CPython, reference counting takes care of the resource freeing issue. What's the proper way to make sure all descriptors get freed when one is done with a Popen object in Jython?
Edit: Just in case it makes a difference, this is a multithreaded program, so there are several Popen processes running simultaneously.
This only answers part of your question, but my understanding is that, when you spawn a new process, it normally inherits all the handles of the parent process. That includes such things as open files and sockets that you're listening on.
On UNIX, that's a side-effect of using 'fork', which duplicates the current process and all of its handles before loading the new executable. On Windows it's more explicit, but Python does it anyway, to try to match the behavior across platforms as much as possible.
The close_fds option, when True, closes all these inherited handles after spawning the subprocess, so the new executable starts with a clean slate. But if your subprocesses are run one at a time, and terminating when they're done, then this shouldn't be the problem.
Is there an easy way write to a file asynchronously in Python?
I know the file io that comes with Python is blocking; which is fine in most cases. For this particular case, I need writes not to block the application at all, or at least as minimally as possible.
As I understand things, asynchronous I/O is not quite the same as non-blocking I/O.
In the case of non-blocking I/O, once a file descriptor is setup to be "non-blocking", a read() system call (for instance) will return EWOULDBLOCK (or EAGAIN) if the read operation would block the calling process in order to complete the operation. The system calls select(), poll(), epoll(), etc. are provided so that the process can ask the OS to be told when one or more file descriptors become available for performing some I/O operation.
Asynchronous I/O operates by queuing a request for I/O to the file descriptor, tracked independently of the calling process. For a file descriptor that supports asynchronous I/O (raw disk devcies typically), a process can call aio_read() (for instance) to request a number of bytes be read from the file descriptor. The system call returns immediately, whether or not the I/O has completed. Some time later, the process then polls the operating system for the completion of the I/O (that is, buffer is filled with data).
A process (single-threaded) that only performs non-blocking I/O will be able to read or write from one file descriptor that is ready for I/O when another is not ready. But the process must still synchronously issue the system calls to perform the I/O to all the ready file descriptors. Whereas, in the asynchronous I/O case, the process is just checking for the completion of the I/O (buffer filled with data). With asynchronous I/O, the OS has the freedom to operate in parallel as much as possible to service the I/O, if it so chooses.
With that, are there any wrappers for the POSIX aio_read/write etc. system calls for Python?
Twisted has non-blocking writes on file descriptors. If you're writing async code, I'd expect you to be using twisted, anyway. :)
You can try to use Thread:
from threading import Thread
for file in list_file:
tr = Thread(target=file.write, args=(data,))
tr.start()
This is more or less a pseudocode, but i hope you'd get the idea. Be aware the threads here are left open.
In my experience it worked well, although interpreter continue to work some time while the main script have finished (need to use join()), so the speed gain is so big than it seems
I'm developing aio.h bidings to python: pyaio
It runs on linux only..
Python 3 seems to have such functionality. See PEP 3116.