Need help understanding python threading flavors - python

I'm into threads now and exploring thread and threading libraries. When I started with them, I wrote 2 basic programs. The following are the 2 programs with their corresponding outputs:
threading_1.py :
import threading
def main():
t1=threading.Thread(target=prints,args=(3,))
t2=threading.Thread(target=prints,args=(5,))
t1.start()
t2.start()
t1.join()
t2.join()
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
output :
i=3
i=2
i=5
i=4
i=1
i=3
i=2
i=1
thread_1.py
import thread
import threading
def main():
t1=thread.start_new_thread(prints,(3,))
t2=thread.start_new_thread(prints,(5,))
t1.start()
t2.start()
t1.join()
t2.join()
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
output :
Traceback (most recent call last):
i=3
File "thread_1.py", line 19, in <module>
i=2
i=1
main()
i=5
i=4
i=3
i=2
i=1
File "thread_1.py", line 8, in main
t1.start()
AttributeError: 'int' object has no attribute 'start'
My desired output is as in threading_1.py where interleaved prints makes it a convincing example of thread executions. My understanding is that "threading" is a higher-class library compared to "thread". And the AttributeError I get in thread_1.py is because I am operating on a thread started from thread library and not threading.
So, now my question is - how do I achieve an output similar to the output of threading_1.py using thread_1.py. Can the program be modified or tuned to produce the same result?

Short answer: ignore the thread module and just use threading.
The thread and threading module serve quite different purposes. The thread module is a low-level module written in C, designed to abstract away platform differences and provide a minimal cross-platform set of primitives (essentially, threads and simple locks) that can serve as a foundation for higher-level APIs. If you were porting Python to a new platform that didn't support existing threading APIs (like POSIX threads, for example), then you'd have to edit the thread module source so that you could wrap the appropriate OS-level calls to provide those same primitives on your new platform.
As an example, if you look at the current CPython implementation, you'll see that a Python Lock is based on unnamed POSIX semaphores on Linux, on a combination of a POSIX condition variable and a POSIX mutex on OS X (which doesn't support unnamed semaphores), and on an Event and a collection of Windows-specific library calls providing various atomic operations on Windows. As a Python user, you don't want to have to care about those details. The thread module provides the abstraction layer that lets you build higher-level code without worrying about platform-level details.
As such, the thread module is really there as a convenience for those developing Python, rather than for those using it: it's not something that normal Python users are expected to need to deal with. For that reason, the module has been renamed to _thread in Python 3: the leading underscore indicates that it's private, and that users shouldn't rely on its API or behaviour going forward.
In contrast, the threading-module is a Java-inspired module written in Python. It builds on the foundations laid by the thread module to provide a convenient API for starting and joining threads, and a broad set of concurrency primitives (re-entrant locks, events, condition variables, semaphores, barriers and so on) for users. This is almost always the module that you as a Python user want to be using. If you're interested in what's going on behind the scenes, it's worth taking some time to look at the threading source: you can see how the threading module pulls in the primitives it needs from the thread module and puts everything together to provide that higher-level API.
Note that there are different tradeoffs here, from the perspective of the Python core developers. On the one hand, it should be easy to port Python to a new platform, so the thread module should be small: you should only have to implement a few basic primitives to get up and running on your new platform. In contrast, Python users want a wide variety of concurrency primitives, so the threading library needs to be extensive to support the needs of those users. Splitting the threading functionality into two separate layers is a good way of providing what the users need while not making it unnecessarily hard to maintain Python on a variety of platforms.
To answer your specific question: if you must use the thread library directly (despite all I've said above), you can do this:
import thread
import time
def main():
t1=thread.start_new_thread(prints,(3,))
t2=thread.start_new_thread(prints,(5,))
def prints(i):
while(i>0):
print "i="+str(i)+"\n"
i=i-1
if __name__=='__main__':
main()
# Give time for the output to show up.
time.sleep(1.0)
But of course, using a time.sleep is a pretty shoddy way of handling things in the main thread: really, we want to wait until both child threads have done their job before exiting. So we'd need to build some functionality where the main thread can wait for child threads. That functionality doesn't exist directly in the thread module, but it does in threading: that's exactly the point of the threading module: it provides a rich, easy-to-use API in place of the minimal, hard-to-use thread API. So we're back to the summary line: don't use thread, use threading.

Related

Use cases for threading and asyncio in python

I've read quite a few articles on threading and asyncio modules in python and the major difference I can seem to draw (correct me if I'm wrong) is that in,
threading: multiple threads can be used to execute the python program and these threads are juggled by the OS itself. Further only when non blocking I/O is happening on a thread the GIL lock can be released to allow another thread to use it (since GIL makes python interpreter single threaded). This is also more resource intensive than asyncio io, since multiple threads will be utilising multiple resources.
asyncio: one single thread can have multiple tasks/coroutines that multitask cooperatively to achieve concurrency. Here, the issue of GIL doesn't arise since it is on a single thread anyway and whenever one non blocking I/O bound task is happening, python interpreter can be used by another coroutine - and all of this is managed by asyncio's event loop.
Also, one article: http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/
says,
if io_bound:
if io_very_slow:
print("Use Asyncio")
else:
print("Use Threads")
else:
print("Multi Processing")
I'd like to understand, just for better clarity, why exactly we can't use asyncio and threading as substitutes for each other, given we have sufficient resources available. Use cases of when to use what would help understand better. Further, since this topic is very new for me, there might be gaps in my understanding, so any kind of resources, explanations and corrections would be really appreciated.

Downside of using patched threading vs native gevent greenlets?

My understanding is that once I have called gevent.monkey.patch_all(), the standard threading module is modified to use greenlets instead of python threads. So if I write my application in terms of python threads, locks, semaphores etc, and then call patch_all, am I getting the full benefit of gevent, or am I losing out on something compared with using the explicit gevent equivalents?
The motivation behind this question is that I am writing a module which uses some threads/greenlets, and I am deciding whether it is useful to have an explicit switch between using gevent and using threading, or whether I can just use threading+patch_all without losing anything.
To put it in code, is this...
def myfunction():
print 'ohai'
Greenlet.spawn(myfunction)
...any different to this?
import gevent.monkey
gevent.monkey.patch_all()
def mythread(threading.Thread):
def run(self):
print 'ohai'
mythread().start()
At least your will loose some of greenlet-specific methods: link, kill, join etc.
Also you can't use threads with, for example, gevent.pool module, that can be very useful.
And there is a very little overhead for creating Thread object.

Parallelize my python program

I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,
Creating n multiple python processes each handling a total number of records/n
Creating n threads in a single python process for every input record and each thread processing a record.
Creating a pool of n threads in a single python process, each executing a input record.
I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?
The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.
The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):
import multiprocessing
def process_line(line):
# This function is executed in your worker processes. Manipulate the
# line and return the results.
return manipulate(line)
if __name__ == '__main__':
with open('input.txt') as fin, open('output.txt', 'w') as fout:
# This creates a pool of N worker processes, where N is the number
# of CPUs in your machine.
pool = multiprocessing.Pool()
# Let the workers do the manipulation and write the results to
# the output file:
for manipulated_line in pool.imap(process_line, fin):
fout.write(manipulated_line)
Number one is the right answer.
First of all, it is easier to create and manage multiple processes than multiple threads. You can use the multiprocessing module or something like pyro to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.
By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use Jython or IronPython instead of CPython.
Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?
While Python has the GIL both Jython and IronPython hasn't that limitation.
Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.

Python Queue - Threads bound to only one core

I wrote a python script that:
1. submits search queries
2. waits for the results
3. parses the returned results(XML)
I used the threading and Queue modules to perform this in parallel (5 workers).
It works great for the querying portion because i can submit multiple search jobs and deal with the results as they come in.
However, it appears that all my threads get bound to the same core. This is apparent when it gets to the part where it processes the XML(cpu intensive).
Has anyone else encountered this problem? Am i missing something conceptually?
Also, i was pondering the idea of having two separate work queues, one for making the queries and one for parsing the XML. As it is now, one worker will do both in serial. I'm not sure what that will buy me, if anything. Any help is greatly appreciated.
Here is the code: (proprietary data removed)
def addWork(source_list):
for item in source_list:
#print "adding: '%s'"%(item)
work_queue.put(item)
def doWork(thread_id):
while 1:
try:
gw = work_queue.get(block=False)
except Queue.Empty:
#print "thread '%d' is terminating..."%(thread_id)
sys.exit() # no more work in the queue for this thread, die quietly
##Here is where i make the call to the REST API
##Here is were i wait for the results
##Here is where i parse the XML results and dump the data into a "global" dict
#MAIN
producer_thread = Thread(target=addWork, args=(sources,))
producer_thread.start() # start the thread (ie call the target/function)
producer_thread.join() # wait for thread/target function to terminate(block)
#start the consumers
for i in range(5):
consumer_thread = Thread(target=doWork, args=(i,))
consumer_thread.start()
thread_list.append(consumer_thread)
for thread in thread_list:
thread.join()
This is a byproduct of how CPython handles threads. There are endless discussions around the internet (search for GIL) but the solution is to use the multiprocessing module instead of threading. Multiprocessing is built with pretty much the same interface (and synchronization structures, so you can still use queues) as threading. It just gives every thread its own entire process, thus avoiding the GIL and forced serialization of parallel workloads.
Using CPython, your threads will never actually run in parallel in two different cores. Look up information on the Global Interpreter Lock (GIL).
Basically, there's a mutual exclusion lock protecting the actual execution part of the interpreter, so no two threads can compute in parallel. Threading for I/O tasks will work just fine, because of blocking.
edit: If you want to fully take advantage of multiple cores, you need to use multiple processes. There's a lot of articles about this topic, I'm trying to look one up for you I remember was great, but can't find it =/.
As Nathon suggested, you can use the multiprocessing module. There are tools to help you share objects between processes (take a look at POSH, Python Object Sharing).

Terminate long running python threads

What is the recommended way to terminate unexpectedly long running threads in python ? I can't use SIGALRM, since
Some care must be taken if both
signals and threads are used in the
same program. The fundamental thing to
remember in using signals and threads
simultaneously is: always perform
signal() operations in the main thread
of execution. Any thread can perform
an alarm(), getsignal(), pause(),
setitimer() or getitimer(); only the
main thread can set a new signal
handler, and the main thread will be
the only one to receive signals
(this is enforced by the Python signal
module, even if the underlying thread
implementation supports sending
signals to individual threads). This
means that signals can’t be used as a
means of inter-thread
communication.Use locks instead.
Update: each thread in my case blocks -- it is downloading a web page using urllib2 module and sometimes operation takes too many time on an extremely slow sites. That's why I want to terminate such slow threads
Since abruptly killing a thread that's in a blocking call is not feasible, a better approach, when possible, is to avoid using threads in favor of other multi-tasking mechanisms that don't suffer from such issues.
For the OP's specific case (the threads' job is to download web pages, and some threads block forever due to misbehaving sites), the ideal solution is twisted -- as it generally is for networking tasks. In other cases, multiprocessing might be better.
More generally, when threads give unsolvable issues, I recommend switching to other multitasking mechanisms rather than trying heroic measures in the attempt to make threads perform tasks for which, at least in CPython, they're unsuitable.
As Alex Martelli suggested, you could use the multiprocessing module. It is very similar to the Threading module so that should get you off to a start easily. Your code could be like this for example:
import multiprocessing
def get_page(*args, **kwargs):
# your web page downloading code goes here
def start_get_page(timeout, *args, **kwargs):
p = multiprocessing.Process(target=get_page, args=args, kwargs=kwargs)
p.start()
p.join(timeout)
if p.is_alive():
# stop the downloading 'thread'
p.terminate()
# and then do any post-error processing here
if __name__ == "__main__":
start_get_page(timeout, *args, **kwargs)
Of course you need to somehow get the return values of your page downloading code. For that you could use multiprocessing.Pipe or multiprocessing.Queue (or other ways available with multiprocessing). There's more information, as well as samples you could check here.
Lastly, the multiprocessing module is included in python 2.6. It is also available for python 2.5 and 2.4 at pypi (you can use easy_install multiprocessing) or just visit pypi and download and install the packages manually.
Note: I realize this has been posted awhile ago. I was having a similar problem to this and stumbled here and saw Alex Martelli's suggestion. Had it implemented for my problem and decided to share it. (I'd like to thank Alex for pointing me in the right direction.)
Use synchronization objects and ask the thread to terminate. Basically, write co-operative handling of this.
If you start yanking out the thread beneath the python interpreter, all sorts of odd things can occur, and it's not just in Python either, most runtimes have this problem.
For instance, let's say you kill a thread after it has opened a file, there's no way that file will be closed until the application terminates.
If you are trying to kill a thread whose code you do not have control over, it depends if the thread is in a blocking call or not. In my experience if the thread is properly blocking, there is no recommended and portable way of doing this.
I've run up against this when trying to work with code in the standard library (multiprocessing.manager I'm looking at you) with loops coded with no exit condition: nice!
There are some interuptable thread implementations out there (see here for an example), but then, if you have the control of the threaded code yourself, you should be able to write them in a manner where you can interupt them with a condition variable of some sort.

Categories

Resources