My application that uses websockets also makes use of several third-party Python modules that appear to be written in way that blocks the rest of the application when called. For example, I use xlrd to parse Excel files a user has uploaded.
I've monkey patched the builtins like this in the first lines of the application:
import os
import eventlet
if os.name == 'nt':
eventlet.monkey_patch(os=False)
else:
eventlet.monkey_patch()
Then I use the following to start the task that contains calls to xlrd.
socketio.start_background_task(my_background_task)
What is the appropriate way to now call these other modules so that my application runs smoothly? Is the multiprocessing module to start another process within the greened thread the right way?
First you should try thread pool [1].
If that doesn't work as well as you want, please submit an issue [2] and go with multiprocessing as workaround.
eventlet.tpool.execute(xlrd_read, file_path, other=arg)
Execute meth in a Python thread, blocking the current coroutine/ greenthread until the method completes.
The primary use case for this is to wrap an object or module that is not amenable to monkeypatching or any of the other tricks that Eventlet uses to achieve cooperative yielding. With tpool, you can force such objects to cooperate with green threads by sticking them in native threads, at the cost of some overhead.
[1] http://eventlet.net/doc/threading.html
[2] https://github.com/eventlet/eventlet/issues
Related
I am correct in understand that if I use the default worker type (sync) then if the app blocks for any reason, say while waiting fo the result of a database query, the aaociated worker process will not be able to handle any further requests during this time?
I am looking for a model which doesn't require too much special coding in my app code. I understand there are two async worker types, gevent and gthread, which can solve this problem. What is the difference between these two and and does my app need to be thread safe to use these?
UPDATE - I did some reading on gevent it seems it works by monkey patching std library functions so I would think that in the case of a database query in general it probably wouldn't patch whatever db library I am using so if I would need to program my app to cooperatively yield control when I waiting on the database. Is this correct?
If you use threads, you must write your application to behave well, e.g. by always using locks to coordinate access to shared resources.
If you use events (e.g. gevent), then you generally don't need to worry about accessing shared resources, because your application is effectively single-threaded.
To answer your second question: if you use a pure python library to access your database, then gevent's monkey patching should successfully render that library nonblocking, which is what you want. But if you use a C library wrapped in Python, then monkey patching is of no use and your application will block when accessing the database.
I'm into threads now and exploring thread and threading libraries. When I started with them, I wrote 2 basic programs. The following are the 2 programs with their corresponding outputs:
threading_1.py :
import threading
def main():
t1=threading.Thread(target=prints,args=(3,))
t2=threading.Thread(target=prints,args=(5,))
t1.start()
t2.start()
t1.join()
t2.join()
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
output :
i=3
i=2
i=5
i=4
i=1
i=3
i=2
i=1
thread_1.py
import thread
import threading
def main():
t1=thread.start_new_thread(prints,(3,))
t2=thread.start_new_thread(prints,(5,))
t1.start()
t2.start()
t1.join()
t2.join()
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
output :
Traceback (most recent call last):
i=3
File "thread_1.py", line 19, in <module>
i=2
i=1
main()
i=5
i=4
i=3
i=2
i=1
File "thread_1.py", line 8, in main
t1.start()
AttributeError: 'int' object has no attribute 'start'
My desired output is as in threading_1.py where interleaved prints makes it a convincing example of thread executions. My understanding is that "threading" is a higher-class library compared to "thread". And the AttributeError I get in thread_1.py is because I am operating on a thread started from thread library and not threading.
So, now my question is - how do I achieve an output similar to the output of threading_1.py using thread_1.py. Can the program be modified or tuned to produce the same result?
Short answer: ignore the thread module and just use threading.
The thread and threading module serve quite different purposes. The thread module is a low-level module written in C, designed to abstract away platform differences and provide a minimal cross-platform set of primitives (essentially, threads and simple locks) that can serve as a foundation for higher-level APIs. If you were porting Python to a new platform that didn't support existing threading APIs (like POSIX threads, for example), then you'd have to edit the thread module source so that you could wrap the appropriate OS-level calls to provide those same primitives on your new platform.
As an example, if you look at the current CPython implementation, you'll see that a Python Lock is based on unnamed POSIX semaphores on Linux, on a combination of a POSIX condition variable and a POSIX mutex on OS X (which doesn't support unnamed semaphores), and on an Event and a collection of Windows-specific library calls providing various atomic operations on Windows. As a Python user, you don't want to have to care about those details. The thread module provides the abstraction layer that lets you build higher-level code without worrying about platform-level details.
As such, the thread module is really there as a convenience for those developing Python, rather than for those using it: it's not something that normal Python users are expected to need to deal with. For that reason, the module has been renamed to _thread in Python 3: the leading underscore indicates that it's private, and that users shouldn't rely on its API or behaviour going forward.
In contrast, the threading-module is a Java-inspired module written in Python. It builds on the foundations laid by the thread module to provide a convenient API for starting and joining threads, and a broad set of concurrency primitives (re-entrant locks, events, condition variables, semaphores, barriers and so on) for users. This is almost always the module that you as a Python user want to be using. If you're interested in what's going on behind the scenes, it's worth taking some time to look at the threading source: you can see how the threading module pulls in the primitives it needs from the thread module and puts everything together to provide that higher-level API.
Note that there are different tradeoffs here, from the perspective of the Python core developers. On the one hand, it should be easy to port Python to a new platform, so the thread module should be small: you should only have to implement a few basic primitives to get up and running on your new platform. In contrast, Python users want a wide variety of concurrency primitives, so the threading library needs to be extensive to support the needs of those users. Splitting the threading functionality into two separate layers is a good way of providing what the users need while not making it unnecessarily hard to maintain Python on a variety of platforms.
To answer your specific question: if you must use the thread library directly (despite all I've said above), you can do this:
import thread
import time
def main():
t1=thread.start_new_thread(prints,(3,))
t2=thread.start_new_thread(prints,(5,))
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
# Give time for the output to show up.
time.sleep(1.0)
But of course, using a time.sleep is a pretty shoddy way of handling things in the main thread: really, we want to wait until both child threads have done their job before exiting. So we'd need to build some functionality where the main thread can wait for child threads. That functionality doesn't exist directly in the thread module, but it does in threading: that's exactly the point of the threading module: it provides a rich, easy-to-use API in place of the minimal, hard-to-use thread API. So we're back to the summary line: don't use thread, use threading.
I am using multiprocessing package to spawn multiple processes that execute a function, say func (with different arguments). func imports numpy package and I was wondering if every process would import the package. In fact, the main thread, or rather main process also imports numpy and that can be easily shared between different func executing processes.
There would be a major performance hit due to multiple imports of a library.
I was wondering if every process would import the package.
Assuming the import occurs after you've forked the process, then, yes. You could avoid this by doing the import before the fork, though.
There would be a major performance hit due to multiple imports of a library.
Well, there would a performance hit if you do the import after the fork, but probably not a "major" one. The OS would most likely have all the necessary files in its cache, so it would only be reading from RAM, not disk.
Update
Just noticed this...
In fact, the main thread, or rather main process also imports numpy...
If you're already importing numpy before forking, then the imports in the subprocesses will only create a reference to the existing imported module. This should take less than a millisecond, so I wouldn't worry about it.
The answer to that question is in the documentation of the multiprocessing library: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
The summary is that it depends on which start method you choose. There are 3 methods available (defaults to fork on Unix, to spawn on Windows/Mac):
spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method.
fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process.
forkserver: When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.
You must set the start method if you want to change it. Example:
import multiprocessing as mp
mp.set_start_method('spawn')
My understanding is that once I have called gevent.monkey.patch_all(), the standard threading module is modified to use greenlets instead of python threads. So if I write my application in terms of python threads, locks, semaphores etc, and then call patch_all, am I getting the full benefit of gevent, or am I losing out on something compared with using the explicit gevent equivalents?
The motivation behind this question is that I am writing a module which uses some threads/greenlets, and I am deciding whether it is useful to have an explicit switch between using gevent and using threading, or whether I can just use threading+patch_all without losing anything.
To put it in code, is this...
def myfunction():
print 'ohai'
Greenlet.spawn(myfunction)
...any different to this?
import gevent.monkey
gevent.monkey.patch_all()
def mythread(threading.Thread):
def run(self):
print 'ohai'
mythread().start()
At least your will loose some of greenlet-specific methods: link, kill, join etc.
Also you can't use threads with, for example, gevent.pool module, that can be very useful.
And there is a very little overhead for creating Thread object.
I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.