Asynchronous file writing possible in python?

Asynchronous file writing possible in python? - python

Is there an easy way write to a file asynchronously in Python?
I know the file io that comes with Python is blocking; which is fine in most cases. For this particular case, I need writes not to block the application at all, or at least as minimally as possible.

As I understand things, asynchronous I/O is not quite the same as non-blocking I/O.
In the case of non-blocking I/O, once a file descriptor is setup to be "non-blocking", a read() system call (for instance) will return EWOULDBLOCK (or EAGAIN) if the read operation would block the calling process in order to complete the operation. The system calls select(), poll(), epoll(), etc. are provided so that the process can ask the OS to be told when one or more file descriptors become available for performing some I/O operation.
Asynchronous I/O operates by queuing a request for I/O to the file descriptor, tracked independently of the calling process. For a file descriptor that supports asynchronous I/O (raw disk devcies typically), a process can call aio_read() (for instance) to request a number of bytes be read from the file descriptor. The system call returns immediately, whether or not the I/O has completed. Some time later, the process then polls the operating system for the completion of the I/O (that is, buffer is filled with data).
A process (single-threaded) that only performs non-blocking I/O will be able to read or write from one file descriptor that is ready for I/O when another is not ready. But the process must still synchronously issue the system calls to perform the I/O to all the ready file descriptors. Whereas, in the asynchronous I/O case, the process is just checking for the completion of the I/O (buffer filled with data). With asynchronous I/O, the OS has the freedom to operate in parallel as much as possible to service the I/O, if it so chooses.
With that, are there any wrappers for the POSIX aio_read/write etc. system calls for Python?

Twisted has non-blocking writes on file descriptors. If you're writing async code, I'd expect you to be using twisted, anyway. :)

You can try to use Thread:
from threading import Thread
for file in list_file:
tr = Thread(target=file.write, args=(data,))
tr.start()
This is more or less a pseudocode, but i hope you'd get the idea. Be aware the threads here are left open.
In my experience it worked well, although interpreter continue to work some time while the main script have finished (need to use join()), so the speed gain is so big than it seems

I'm developing aio.h bidings to python: pyaio
It runs on linux only..

Python 3 seems to have such functionality. See PEP 3116.

Related

Is there anyway to terminate a running function from a thread?

I've tried lately to write my own Socket-Server in python.
While i was writing a thread to handle server commands (sort of command line in the server), I've tried to implement a code that will restart the server when the raw_input() receives specific command.
Basically, i want to restart the server as soon as the "Running" variable changes its state from True to False, and when it does, i would like to stop the function (The function that called the thread) from running (get back to main function) and then run it again. Is there a way to do it?
Thank you very much, and i hope i was clear about my problem,
Idan :)

Communication between threads can be done with Events, Queues, Semaphores, etc. Check them out and choose the one, that fits your problem best.

You can't abort a thread, or raise an exception into it asynchronously, in Python.
The standard Unix solution to this problem is to use a non-blocking socket, create a pipe with pipe, replace all your blocking sock.recv calls with a blocking r, _, _ = select.select([sock, pipe], [], []), and then the other thread can write to the pipe to wake up the other thread.
To make this portable to Windows you'll need to create a UDP localhost socket instead of a pipe, which makes things slightly more complicated, but it's still not hard.
Or, of course, you can use a higher-level framework, like asyncio in 3.4+, or twisted or another third-party lib, which will wrap this up for you. (Most of them are already running the equivalent of a loop around select to service lots of clients in one thread or a small thread pool, so it's trivial to toss in a stop pipe.)
Are there other alternatives? Yes, but all less portable and less good in a variety of other ways.
Most platforms have a way to asynchronously kill or signal another thread, which you can access via, e.g., ctypes. But this is a bad idea, because it will prevent Python from doing any normal cleanup. Even if you don't get a segfault, this could mean files never get flushed and end up with incomplete/garbage data, locks are left acquired to deadlock your program somewhere completely unrelated a short time later, memory gets leaked, etc.
If you're specifically trying to interrupt the main thread, and you only care about CPython on Unix, you can use a signal handler and the kill function. The signal will take effect on the next Python bytecode, and if the interpreter is blocked on any kind of I/O (or most other syscalls, e.g., inside a sleep), the system will return to the interpreter with an EINTR, allowing it to interrupt immediately. If the interpreter is blocked on something else, like a call to a C library that blocks signals or just does nothing but CPU work for 30 seconds, then you'll have to wait 30 seconds (although that doesn't come up that often, and you should know if it will in your case). Also, threads and signals don't play nice on some older *nix platforms. And signals don't work the same way on Windows, or in some other Python implementations like Jython.
On some platforms (including Windows--but not most modern *nix plafforms), you can wake up a blocking socket call just by closing the socket out from under the waiting thread. On other platforms, this will not unblock the thread, or will do it sometimes but not other times (and theoretically it could even segfault your program or leave the socket library in an unusable state, although I don't think either of those will happen on any modern platform).

As far as I understand the documentation, and some experiments I've over the last weeks, there is no way to really force another thread to 'stop' or 'abort'. Unless the function is aware of the possibility of being stopped and has a foolproof method of avoiding getting stuck in some of the I/O functions. Then you can use some communication method such as semaphores. The only exception is the specialized Timer function, which has a Cancel method.
So, if you really want to stop the server thread forcefully, you might want to think about running it in a separate process, not a thread.
EDIT: I'm not sure why you want to restart the server - I just thought it was in case of a failure. Normal procedure in a server is to loop waiting for connections on the socket, and when a connection appears, attend it and return to that loop.
A better way, is to use the GIO library (part of glib), and connect methods to the connection event, to attend the connection even asynchronously. This avoids the loop completely. I don't have any real code for this in Python, but here's an example of a client in Python (which uses GIO for reception events) and a server in C, which uses GIO for connections.
Use of GIO makes life so much easier...

What happens if two python scripts want to write in the same file?

I have a pipeline which at some point splits work into various sub-processes that do the same thing in parallel. Thus their output should go into the same file.
Is it too risky to say all of those processes should write into the same file? Or does python try and retry if it sees that this resource is occupied?

This is system dependent. In Windows, the resource is locked and you get an exception. In Linux you can write the file with two processes (written data could be mixed)
Ideally in such cases you should use semaphores to synchronize access to shared resources.
If using semaphores is too heavy for your needs, then the only alternative is to write in separate files...
Edit: As pointed out by eye in a later post, a resource manager is another alternative to handle concurrent writers

In general, this is not a good idea and will take a lot of care to get right. Since the writes will have to be serialized, it might also adversely affect scalability.
I'd recommend writing to separate files and merging (or just leaving them as separate files).

A better solution is to implement a resource manager (writer) to avoid opening the same file twice. This manager could use threading synchronization mechanisms (threading.Lock) to avoid simultaneous access on some platforms.

How about having all of the different processes write their output into a queue, and have a single process that reads that queue, and writes to the file?

Use multiprocessing.Lock() instead of threading.Lock(). Just a word of caution! might slow down your concurrent processing ability because one process just waits for the lock to be released

Twisted: Making code non-blocking

I'm a bit puzzled about how to write asynchronous code in python/twisted. Suppose (for arguments sake) I am exposing a function to the world that will take a number and return True/False if it is prime/non-prime, so it looks vaguely like this:
def IsPrime(numberin):
for n in range(2,numberin):
if numberin % n == 0: return(False)
return(True)
(just to illustrate).
Now lets say there is a webserver which needs to call IsPrime based on a submitted value. This will take a long time for large numberin.
If in the meantime another user asks for the primality of a small number, is there a way to run the two function calls asynchronously using the reactor/deferreds architecture so that the result of the short calc gets returned before the result of the long calc?
I understand how to do this if the IsPrime functionality came from some other webserver to which my webserver would do a deferred getPage, but what if it's just a local function?
i.e., can Twisted somehow time-share between the two calls to IsPrime, or would that require an explicit invocation of a new thread?
Or, would the IsPrime loop need to be chunked into a series of smaller loops so that control can be passed back to the reactor rapidly?
Or something else?

I think your current understanding is basically correct. Twisted is just a Python library and the Python code you write to use it executes normally as you would expect Python code to: if you have only a single thread (and a single process), then only one thing happens at a time. Almost no APIs provided by Twisted create new threads or processes, so in the normal course of things your code runs sequentially; isPrime cannot execute a second time until after it has finished executing the first time.
Still considering just a single thread (and a single process), all of the "concurrency" or "parallelism" of Twisted comes from the fact that instead of doing blocking network I/O (and certain other blocking operations), Twisted provides tools for performing the operation in a non-blocking way. This lets your program continue on to perform other work when it might otherwise have been stuck doing nothing waiting for a blocking I/O operation (such as reading from or writing to a socket) to complete.
It is possible to make things "asynchronous" by splitting them into small chunks and letting event handlers run in between these chunks. This is sometimes a useful approach, if the transformation doesn't make the code too much more difficult to understand and maintain. Twisted provides a helper for scheduling these chunks of work, cooperate. It is beneficial to use this helper since it can make scheduling decisions based on all of the different sources of work and ensure that there is time left over to service event sources without significant additional latency (in other words, the more jobs you add to it, the less time each job will get, so that the reactor can keep doing its job).
Twisted does also provide several APIs for dealing with threads and processes. These can be useful if it is not obvious how to break a job into chunks. You can use deferToThread to run a (thread-safe!) function in a thread pool. Conveniently, this API returns a Deferred which will eventually fire with the return value of the function (or with a Failure if the function raises an exception). These Deferreds look like any other, and as far as the code using them is concerned, it could just as well come back from a call like getPage - a function that uses no extra threads, just non-blocking I/O and event handlers.
Since Python isn't ideally suited for running multiple CPU-bound threads in a single process, Twisted also provides a non-blocking API for launching and communicating with child processes. You can offload calculations to such processes to take advantage of additional CPUs or cores without worrying about the GIL slowing you down, something that neither the chunking strategy nor the threading approach offers. The lowest level API for dealing with such processes is reactor.spawnProcess. There is also Ampoule, a package which will manage a process pool for you and provides an analog to deferToThread for processes, deferToAMPProcess.

Jython: subprocess.Popen runs out of file descriptors

I'm using the Jython 2.51 implementation of Python to write a script that repeatedly invokes another process via subprocess.Popen and uses PIPE to pipe stdout and stderr to the parent process and stdin to the child process. After several hundred loop iterations, I seem to run out of file descriptors.
The Python subprocess documentation mentions very little about freeing file descriptors, other than the close_fds option, which isn't described very clearly (Why should there be any file descriptors besides 0, 1 and 2 open in the first place?). I'm assuming that in CPython, reference counting takes care of the resource freeing issue. What's the proper way to make sure all descriptors get freed when one is done with a Popen object in Jython?
Edit: Just in case it makes a difference, this is a multithreaded program, so there are several Popen processes running simultaneously.

This only answers part of your question, but my understanding is that, when you spawn a new process, it normally inherits all the handles of the parent process. That includes such things as open files and sockets that you're listening on.
On UNIX, that's a side-effect of using 'fork', which duplicates the current process and all of its handles before loading the new executable. On Windows it's more explicit, but Python does it anyway, to try to match the behavior across platforms as much as possible.
The close_fds option, when True, closes all these inherited handles after spawning the subprocess, so the new executable starts with a clean slate. But if your subprocesses are run one at a time, and terminating when they're done, then this shouldn't be the problem.

Python multiple threads accessing same file

I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?

Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).

When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).

Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.