I have 2 questions regarding thread and queue in python.
What does max_size arg do in queue.Queue()?
Are there any performance improvement depending on the # of threads(num_worker_threads)? I can't find any improvements on it.
-> If there's no improvements depending on the # of threads, why do we need this?
import time
import queue
import threading
num_worker_threads = 10
store = []
def worker():
while True:
item = q.get()
if item is None:
break
store.append(item)
q.task_done()
start = time.time()
q = queue.Queue() # what does max_size do in queue?
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in range(1000000):
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
end = time.time()
print('running time: {}'.format(end - start))
You set the max size on a queue in order to provide a throttling mechanism for your workers. Assuming you have X producers and Y consumers using the same queue, if any of the consumers throws a Queue.Full exception, you can use this as a signal to slow down on producing.
In your case, since work doesn't actually do anything computationally expensive (just appends to a list, in an unsafe manner i might add), and you only have 1 producer and 10 consumers, the queue will most likely always be empty and instead the consumers are idling (and would throw a Queue.Empty exception if the get method had a timeout attached.
As a side note, in order to figure out which of the consumer and producers are idling, you should use the timeout parameter for both the put and get methods of the queue
Related
I am trying to understand how Queue works
from queue import Queue
from threading import Thread
q = Queue()
urls = ['http://www.linkedin.com', 'http://www.amazon.com', 'http://www.facebook.com', 'http://www.uber.com']
def worker():
item = q.get()
print(item)
q.task_done()
for i in range(1):
t = Thread(target=worker)
t.daemon = True
t.start()
for url in urls:
q.put(url)
q.join()
I was expecting it to print out all of the URL's but only the first one is being printed out.
I thought that the worker would get the first item, print it out, then go back to grab the next item. In this case, I'm just creating 1 thread but can add more threads once I understand what is going on.
Why is it only printing the first URL?
Your worker only runs its code once -- grabbing one item from the queue, printing it, then exiting. To grab everything: you'll need a loop.
Since you've started this thread as a daemon, it's easy to just loop forever. You're essentially spinning off a thread that says "Grab something out of the queue if there's something there. If not, wait 'till there is. Print that thing, then repeat until the program exits."
def worker():
while True:
item = q.get()
print(item)
q.task_done()
What a queue is usually used for is either an easy FIFO stack (for which you could arguably recommend collections.deque in its place) or as a means of coordinating a whole group of workers to do distributed work. Imagine you have a group of 4:
NUM_WORKERS = 4
for _ in range(NUM_WORKERS):
t = Thread(daemon=True, target=worker)
t.start()
and wanted to handle a whole bunch of items
for i in range(1, 1000001):
# 1..1000000
q.put(i)
Now the work will be distributed among all four workers, without any worker grabbing the same item as another. This serves to coordinate your concurrency.
I have a Pool of workers and use apply_async to submit work to them.
I do not care for the result of the function applied to each item.
The pool seems to accept any number of apply_async calls, no matter how large the data or how quickly the workers can keep up with the work.
Is there a way to make apply_async block as soon as a certain number of items are waiting to be processed? I am sure internally, the pool is using a Queue, so it would be trivial to just use a maximum size for the Queue?
If this is not supported, would it make sense to submit a big report because this look like very basic functionality and rather trivial to add?
It would be a shame if one had to essentially re-implement the whole logic of Pool just to make this work.
Here is some very basic code:
from multiprocessing import Pool
dowork(item):
# process the item (for side effects, no return value needed)
pass
pool = Pool(nprocesses)
for work in getmorework():
# this should block if we already have too many work waiting!
pool.apply_async(dowork, (work,))
pool.close()
pool.join()
So something like this?
import multiprocessing
import time
worker_count = 4
mp = multiprocessing.Pool(processes=worker_count)
workers = [None] * worker_count
while True:
try:
for i in range(worker_count):
if workers[i] is None or workers[i].ready():
workers[i] = mp.apply_async(dowork, args=next(getmorework()))
except StopIteration:
break
time.sleep(1)
I dunno how fast you're expecting each worker to finish, the time.sleep may or may not be necessary or might need to be a different time or whatever.
an alternative might be to use Queue's directly:
from multiprocessing import Process, JoinableQueue
from time import sleep
from random import random
def do_work(i):
print(f"worker {i}")
sleep(random())
print(f"done {i}")
def worker():
while True:
item = q.get()
if item is None:
break
do_work(item)
q.task_done()
def generator(n):
for i in range(n):
print(f"gen {i}")
yield i
# 1 = allow generator to get this far ahead
q = JoinableQueue(1)
# 2 = maximum amount of parallelism
procs = [Process(target=worker) for _ in range(2)]
# and get them running
for p in procs:
p.daemon = True
p.start()
# schedule 10 items for processing
for item in generator(10):
q.put(item)
# wait for jobs to finish executing
q.join()
# signal workers to finish up
for p in procs:
q.put(None)
# wait for workers to actually finish
for p in procs:
p.join()
mostly stolen from example Python's queue module:
https://docs.python.org/3/library/queue.html#queue.Queue.join
I'm doing some threads expirements, and noticed that my code works even without q.task_done() statement.
import Queue, threading
queue = Queue.Queue()
def get_url(url):
queue.put({url: len(urllib2.urlopen(url).read())})
def read_from_queue():
m = queue.get()
print m.items()
queue.task_done() # <-- this can be removed and still works
def use_threads():
threads = []
for u in urls:
t = threading.Thread(target=get_url, args=(u,))
threads.append(t)
t.start()
for t in threads:
t.join()
threads = []
for r in urls:
t = threading.Thread(target=read_from_queue)
threads.append(t)
t.start()
for t in threads:
t.join()
This is a simple program that loops over list of urls, reading their content and sums it up to the len of bytes. It then puts in the queue a dict containing the url name and its size.
I have timeit.timeit tested both cases; the results are mixed but that make sense because most of the work happens on network.
How the queue knows a task is done? How the t.join() returns without task_done() is being called on the queue?
queue.task_done only affect queue.join
queue.task_done doesn't affect thread.join
You are calling thread.join and never call queue.join, so queue.task_done doesn't matter
Zang MingJie got it right. I was join() the threads, not the queue itself.
When the threads complete, the join() returns.
That's the piece I was missing:
The whole idea of task_done() is when the threads are daemons, or never returns until killed. Then you can't join() the threads, because it will deadlock.
So, when you have such a scenario - you join() the queue. This will return when the queue is empty of tasks (indicating there is currently no more work).
A python multi-producer & multi-consumer threading pseudocode:
def threadProducer():
while upstreams_not_done:
data = do_some_work()
queue_of_data.put(data)
def threadConsumer():
while True:
data = queue_of_data.get()
do_other_work()
queue_of_data.task_done()
queue_of_data = queue.Queue()
list_of_producers = create_and_start_producers()
list_of_consumers = create_and_start_consumers()
queue_of_data.join()
# is now all work done?
In which queue_of_data.task_done() is called for each item in queue.
When producers work slower then consumers, is there a possibility queue_of_data.join() non-blocks at some moment when no producer generates data yet, but all consumers finish their tasks by task_done()?
And if Queue.join() is not reliable like this, how can I check if all work done?
The usual way is to put a sentinel value (like None) on the queue, one for each consumer thread, when producers are done. Then consumers are written to exit the thread when it pulls None from the queue.
So, e.g., in the main program:
for t in list_of_producers:
t.join()
# Now we know all producers are done.
for t in list_of_consumers:
queue_of_data.put(None) # tell a consumer we're done
for t in list_of_consumers:
t.join()
and consumers look like:
def threadConsumer():
while True:
data = queue_of_data.get()
if data is None:
break
do_other_work()
Note: if producers can overwhelm consumers, create the queue with a maximum size. Then queue.put() will block when the queue reaches that size, until a consumer removes something from the queue.
The python example gives anexample of how to wait for enqueued tasks to be completed but I am not sure how the order of retrieval is determined. Here is the code:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
As I interpret it, this code starts whatever the range of threads is, the puts an however many items are in the source in the queue.
So if you start 20 threads, and put 30 items in the queue, it seems like you will have 20 worker threads all calling
while True:
item = q.get()
do_work(item)
So the first time an item is put on a queue, which of the 20 threads actually gets the item just put on the queue?
Generally speaking, there isn't going to be a guaranteed order, only guaranteed mutual exclusion. Assuming you are using something like queue.Queue (Python 3), it uses synchronization primitives to ensure only one thread can get() an item at a time. But the order in which the threads get their chance will be affected by the vagaries of the OS scheduler - load, priorities, etc.