All tasks in Queue done, but program not continuing

All tasks in Queue done, but program not continuing - python

I have a thread class defined like this:
#!/usr/bin/python
import threading
import subprocess
class PingThread (threading.Thread):
ipstatus = ''
def __init__(self, ip):
threading.Thread.__init__(self)
self.ipaddress = ip
def ping(self, ip):
print 'Pinging ' + ip + '...'
ping_response = subprocess.Popen(["ping", "-c", "1", ip], stdout=subprocess.PIPE).stdout.read()
if '100.0% packet loss' not in str(ping_response):
return True
return False
def set_ip_status(self, status):
self.ipstatus = status
def get_ip_status(self):
return self.ipstatus
def run(self):
self.ipaddress = self.ipaddress.strip('\n\t')
pingResponse = self.ping(self.ipaddress)
if pingResponse:
self.set_ip_status(self.ipaddress + ' is up!')
else:
self.set_ip_status(self.ipaddress + ' is down!')
I am going through a list of ip addresses and sending it to the PingThread and having this class ping the ip address. When these threads are all done I want it to go through and get the status of each one by calling get_ip_status(). I have q.join() in my code, which is supposed to wait until all items in the queue are complete (from my understanding, correct me if I'm wrong, still new to threading) but my code never gets passed the q.join. I tested and all threads do get completed and all ip addresses get pinged, but q.join() isn't recognizing that. Why is this? What am I doing wrong? I am creating the threads like this:
q = Queue.Queue()
for ip in trainips:
thread = PingThread(ip)
thread.start()
q.put(thread)
q.join()
while not q.empty():
print q.get().get_ip_status()

You're misunderstanding how Queue.join works. Queue.join is meant to be used with Queue.task_done; On the producer end, you put items into the Queue on one end, then call Queue.join to wait for all the items you've put to be processed. Then on the consumer end, you get an item from the Queue, process it, then call Queue.task_done when you're done. Once task_done has been called for all the items that have been put into the Queue, Queue.join will unblock.
But you're not doing that. You're just starting a bunch of threads, adding them to aQueue, and then calling join on it. You're not using task_done at all, and you're only calling Queue.get after Queue.join, and it looks like you're just using it to fetch the thread objects after they've completed. But that's not really how it works; The Queue has no idea there a Thread objects in it, and simply calling Queue.join won't wait for the Thread objects inside it to complete.
Really, it looks like all you need to do is put the threads in a list, then call join on each thread.
threads = []
for ip in trainips:
thread = PingThread(ip)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
print thread.get_ip_status()

As the docs say, Queue.join
Blocks until all items in the queue have been gotten and processed.
But you don't every try to get the items until after the join (and even then, you don't mark them processed).
So, you can't get past the join until you finish the while loop, which you can't get to until you get past the join, so you block forever.
To make that join work, you'd have to change those last three lines to something like:
while not q.empty():
print q.get().get_ip_status()
q.task_done()
q.join()
However, a much simpler solution is to just not join the queue. Instead, you could join all of the threads; then you know it's safe to get all the values. But note that if you do this, there's no reason for the queue to be a Queue; it can just be a plain old list. At which point you've effectively got dano's answer.
Alternatively, you could change your code to actually make use of the queue. Instead of putting the threads in the queue, pass the queue to the thread function, and have it put its results on the queue, instead of storing it as an attribute. Then, you can just loop over the get() as you're doing, and it will automatically take care of all the blocking you need. The example for Queue.join in the docs shows how to do almost exactly what you'd want to do.
The advantage of the latter solution is that you no longer need your tasks and threads to map one-to-one—e.g., use a pool of 16 threads running 128 tasks, and you're still going to end up with 128 values on the queue.*
* But if you want to do that, you probably may to use multiprocessing.dummy.Pool or (from the concurrent.futures backport on PyPI) futures.ThreadPoolExecutor instead of building it yourself.

Related

Can a python Multiprocessing queue be passed to the child process?

I have a big dataset in a data acquisition system I wrote in python that takes infinitely long to pass over a queue from the child process to the parent. I want to save the data acquired at the end of the acquisition and tried this using the queue function in Multiprocessing. Instead of doing it this way I would prefer it if I could instead pass a message over the queue from the parent to the child to save my data before I kill the child process. Is this possible? An example of what I thought it might look like is:
def acquireData(self, var1, queue):
import h5py
# Put my acquisition code here
queue.get()
if queue == True:
f = h5py.File("FileName","w")
f.create_dataset('Data',data=data)
f.close()
if __name__ == '__main__':
from multiprocessing import Process, Queue
queue = Queue()
inter_thread = Process(target=acquireData, args=(var1,queue))
queue.put(False)
inter_thread.start()
while True:
if not args.automate:
# Let c++ threads run for given amount of time
# Wait for stop from OP GUI
else:
queue.put(True)
break
print("Acquisition finished, cleaning up...")
sleep(2)
inter_thread.terminate()
Is this allowed? If this type of interfacing between processes is allowed then do I have the right notation? For some reference I have on the order of 9e7 data points in the array I'm trying to save and I have 7 arrays which is simply not being passed to my parent process in a timely manner by putting these arrays into the queue. Thank you.

First, yes, passing a queue to a child is not only legal, but the main use case for queues. See the first example in the docs, which does exactly that.
However, you've got some problems with your code:
queue.get()
if queue == True:
First, your queue is never going to be the boolean value True, it's going to be a Queue. You almost never want to check if x == True: in Python; you want to check if x:. For example, if [1, 2]: will pass, while if [1, 2] == True: will not.
Second, your queue isn't even the thing you want to check in the first place. It isn't truthy or falsey (or it isn't relevant whether it is); it's the value the main process put on the queue and you pulled off that's either truthy or falsey. Which you discarded as soon as you retrieved it.
So, do this:
flag = queue.get()
if flag:
Or, more simply:
if queue.get():
I'm not sure whether this is exactly what you want or not. That queue.get() will block forever until the main process puts something there. Is that what you wanted? If so, great; you're done with this part of your code. If not, you need to think about what you wanted instead.
As designed, the parent will always wait 2 seconds, even if the child finished long before that. A better solution is to join the child with a timeout of 2 seconds. Then you can terminate it if times out.
Plus, are you sure the termination behavior you've designed is what you want? You're doing a "soft kill request" with the queue, then waiting 2 seconds, then doing a "medium-hard kill request" with terminate, and never doing a "hard kill" with kill. That could be a perfectly reasonable design—but if it's not your design, you've implemented the wrong thing.

How to start a new thread when old one finishes?

I have a large dataset in a list that I need to do some work on.
I want to start x amounts of threads to work on the list at any given time, until everything in that list has been popped.
I know how to start x amounts of threads (lets say 20) at a given time (by using thread1....thread20.start())
but how do I make it start a new thread when one of the first 20 threads finish? so at any given time there are 20 threads running, until the list is empty.
what I have so far:
class queryData(threading.Thread):
def __init__(self,threadID):
threading.Thread.__init__(self)
self.threadID = threadID
def run(self):
global lst
#Get trade from list
trade = lst.pop()
tradeId=trade[0][1][:6]
print tradeId
thread1 = queryData(1)
thread1.start()
Update
I have something going with the following code:
for i in range(20):
threads.append(queryData(i))
for thread in threads:
thread.start()
while len(lst)>0:
for iter,thread in enumerate(threads):
thread.join()
lock.acquire()
threads[iter] = queryData(i)
threads[iter].start()
lock.release()
Now it starts 20 threads in the beginning...and then keeps starting a new thread when one finishes.
However, it is not efficient, as it waits for the first one in the list to finish, and then the second..and so on.
Is there a better way of doing this?
Basically I need:
-Start 20 threads:
-While list is not empty:
-wait for 1 of the 20 threads to finish
-reuse or start a new thread

As I suggested in a comment, I think using a multiprocessing.pool.ThreadPool would be appropriate — because it would handle much of the thread management you're manually doing in your code automatically. Once all the threads are queued-up for processing via ThreadPool's apply_async() method calls, the only thing that needs to be done is wait until they've all finished execution (unless there's something else your code could be doing, of course).
I've translated the code in my linked answer to another related question so it's more similar to what you appear to be doing to make it easier to understand in the current context.
from multiprocessing.pool import ThreadPool
from random import randint
import threading
import time
MAX_THREADS = 5
print_lock = threading.Lock() # Prevent overlapped printing from threads.
def query_data(trade):
trade_id = trade[0][1][:6]
time.sleep(randint(1, 3)) # Simulate variable working time for testing.
with print_lock:
print(trade_id)
def process_trades(trade_list):
pool = ThreadPool(processes=MAX_THREADS)
results = []
while(trade_list):
trade = trade_list.pop()
results.append(pool.apply_async(query_data, (trade,)))
pool.close() # Done adding tasks.
pool.join() # Wait for all tasks to complete.
def test():
trade_list = [[['abc', ('%06d' % id) + 'defghi']] for id in range(1, 101)]
process_trades(trade_list)
if __name__ == "__main__":
test()

You can wait for a thread to complete with : thread.join(). This call will block until that thread completes, at which point you can create a new one.
However, instead of respawning a Thread each time, why not recycle your existing threads ?
This can be done by the use of tasks for example. You keep a list of tasks in a shared collection, and when one of your threads finishes a task, it retrieves another one from that collection.

Python Gevent Shared Queue (Listener Process)

I am trying to get some code working where I can implement logging into a multi-threaded program using gevent. What I'd like to do is set up custom logging handlers to put log events into a Queue, while a listener process is continuously watching for new log events to handle appropriately. I have done this in the past with Multiprocessing, but never with Gevent.
I'm having an issue where the program is getting caught up in the infinite loop (listener process), and not allowing the other threads to "do work"...
Ideally, after the worker processes have finished, I can pass an arbitrary value to the listener process to tell it to break the loop, and then join all the processes together. Here's what I have so far:
import gevent
from gevent.pool import Pool
import Queue
import random
import time
def listener(q):
while True:
if not q.empty():
num = q.get()
print "The number is: %s" % num
if num <= 100:
print q.get()
# got passed 101, break out
else:
break
else:
continue
def worker(pid,q):
if pid == 0:
listener(q)
else:
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue.Queue()
all_threads = []
all_threads = [gevent.spawn(worker, pid,q) for pid in xrange(10)]
gevent.wait(all_threads[1:])
q.put(101)
gevent.joinall(all_threads)
if __name__ == '__main__':
main()
As I said, the program seems to be getting hung up on that first process and does not allow the other workers to do their thing. I have also tried spawning the listener process completely separately itself (which is actually how I would rather do it), but that didn't seem to work either so I tried this way.
Any help would be appreciated, feel like I am probably just missing something obvious about gevent's back end.
Thanks

The first problem is that your listener is never yielding if the queue is initially empty. The first task you spawn is your listener. When it starts, there's a while True:, the q will be empty, so you go to the else branch, which just continues, looping back to the start of the while loop, and then the q is still empty. So you just sit in the first thread constantly checking the q is empty.
The key thing here is that gevent does not use "native" threads or processes. Unlike "real" threads, which can be switched to at any time by something behind the scenes (like your OS scheduler), gevent uses 'greenlets', which require that you do something to "yield control" to another task. That something is whatever gevent thinks would block, such as read from the network, disk, or use one of the blocking gevent operations.
One crude fix would be to start your listener when pid == 9 rather than 0. By making it spawn last, there will be items in the q, and it will go into the main if branch. The downside is that this doesn't fix the logic problem, so the first time the queue is empty, you'll get stuck in your infinite loop again.
A more correct fix would be to put gevent.sleep() instead of continue. sleep is a blocking operation, so your other tasks will get a chance to run. Without arguments, it waits for no time, but still gives gevent the chance to decide to switch to another task if it is ready to run. This still isn't very efficient, though, as if the Queue is empty, it's going to spend a lot of pointless time checking that over and over and asking to run again as soon as it can. sleep'ing for longer than the default of 0 will be more efficient, but would delay processing your log messages.
However, you can instead take advantage of the fact that many of gevent's types, such as Queue, can be used in more Pythonic ways and make your code a lot simpler and easier to understand, as well as more efficient.
import gevent
from gevent.queue import Queue
def listener(q):
for msg in q:
print "the number is %d" % msg
def worker(pid,q):
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue()
listener_task = gevent.spawn(listener, q)
worker_tasks = [gevent.spawn(worker, pid, q) for pid in xrange(1, 10)]
gevent.wait(worker_tasks)
q.put(StopIteration)
gevent.join(listener_task)
Here, Queue can operate as an iterator in a for loop. As long as there are messages, it will get an item, run the loop, and then wait for another item. If there are no items, it will just block and hang around until the next one arrives. Since it blocks, though, gevent will switch to one of your other tasks to run, avoiding the infinite loop problem your example code has.
Because this version is using the Queue as a for loop iterator, there's also automatically a nice sentinel value we can put in the queue to make the listener task quit. If a for loop gets StopIteration from its iterator, it will exit cleanly. So when our for loop that's reading from q gets StopIteration from the q, it exits, and then the function exits, and the spawned task is finished.

multithreading spawn new process when worker has finished

I would like to define a pool of n workers and have each execute tasks held in a rabbitmq queue. When this task finished (fails or succeeds) I want the worker execute another task from the queue.
I can see in docs how to spawn a pool of workers and have them all wait for their siblings to complete. I would something like different though: I would like to have a buffer of n tasks where when one worker finishes it adds another tasks to the buffer (so no more than n tasks are in the bugger). Im having difficulty searching for this in docs.
For context, my non-multithreading code is this:
while True:
message = get_frame_from_queue() # get message from rabbit mq
do_task(message.body) #body defines urls to download file
acknowledge_complete(message) # tell rabbitmq the message is acknowledged
At this stage my "multithreading" implementation will look like this:
#recieves('ask_for_a_job')
def get_a_task():
# this function is executed when `ask_for_a_job` signal is fired
message = get_frame_from_queue()
do_task(message)
def do_tasks(task_info):
try:
# do stuff
finally:
# once the "worker" has finished start another.
fire_fignal('ask_for_a_job')
# start the "workers"
for i in range(5):
fire_fignal('ask_for_a_job')
I don't want to reinvent the wheel. Is there a more built in way to achieve this?
Note get_frame_from_queue is not thread safe.

You should be able to have each subprocess/thread consume directly from the queue, and then within each thread, simply process from the queue exactly as you would synchronously.
from threading import Thread
def do_task(msg):
# Do stuff here
def consume():
while True:
message = get_frame_from_queue()
do_task(message.body)
acknowledge_complete(message)
if __name __ == "__main__":
threads = []
for i in range(5):
t = Thread(target=consume)
t.start()
threads.append(t)
This way, you'll always have N messages from the queue being processed simultaneously, without any need for signaling to occur between threads.
The only "gotcha" here is the thread-safety of the rabbitmq library you're using. Depending on how it's implemented, you may need a separate connection per thread, or possibly one connection with a channel per thread, etc.

One solution is to leverage the multiprocessing.Pool object. Use an outer loop to get N items from RabbitMQ. Feed the items to the Pool, waiting until the entire batch is done. Then loop through the batch, acknowledging each message. Lastly continue the outer loop.
source
import multiprocessing
def worker(word):
return bool(word=='whiskey')
messages = ['syrup', 'whiskey', 'bitters']
BATCHSIZE = 2
pool = multiprocessing.Pool(BATCHSIZE)
while messages:
# take first few messages, one per worker
batch,messages = messages[:BATCHSIZE],messages[BATCHSIZE:]
print 'BATCH:',
for res in pool.imap_unordered(worker, batch):
print res,
print
# TODO: acknowledge msgs in 'batch'
output
BATCH: False True
BATCH: False

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.

I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)

Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)

Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.

You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.