I am trying to understand how Queue works
from queue import Queue
from threading import Thread
q = Queue()
urls = ['http://www.linkedin.com', 'http://www.amazon.com', 'http://www.facebook.com', 'http://www.uber.com']
def worker():
item = q.get()
print(item)
q.task_done()
for i in range(1):
t = Thread(target=worker)
t.daemon = True
t.start()
for url in urls:
q.put(url)
q.join()
I was expecting it to print out all of the URL's but only the first one is being printed out.
I thought that the worker would get the first item, print it out, then go back to grab the next item. In this case, I'm just creating 1 thread but can add more threads once I understand what is going on.
Why is it only printing the first URL?
Your worker only runs its code once -- grabbing one item from the queue, printing it, then exiting. To grab everything: you'll need a loop.
Since you've started this thread as a daemon, it's easy to just loop forever. You're essentially spinning off a thread that says "Grab something out of the queue if there's something there. If not, wait 'till there is. Print that thing, then repeat until the program exits."
def worker():
while True:
item = q.get()
print(item)
q.task_done()
What a queue is usually used for is either an easy FIFO stack (for which you could arguably recommend collections.deque in its place) or as a means of coordinating a whole group of workers to do distributed work. Imagine you have a group of 4:
NUM_WORKERS = 4
for _ in range(NUM_WORKERS):
t = Thread(daemon=True, target=worker)
t.start()
and wanted to handle a whole bunch of items
for i in range(1, 1000001):
# 1..1000000
q.put(i)
Now the work will be distributed among all four workers, without any worker grabbing the same item as another. This serves to coordinate your concurrency.
Related
Regarding example from documentation:
https://docs.python.org/2/library/queue.html#Queue.Queue.get
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
How actually the worker would know all work is done, the queue is empty and we can exit? I don't understand it...
Your worker hangs in a while True: loop which means that the function/thread will never return.
The "magic" lies in the code you didn't show:
t = Thread(target=worker)
t.daemon = True
t.start()
The daemon parameter controls when the uppper thread can exit
The entire Python program exits when no alive non-daemon threads are left.
Which means, that the program will exit, because the main thread exists.
The worker thread thenically lives, but will be destroyed when the main thread ends (because "there are no non-daemon threads left").
The main thread exit condition is
q.join()
The documentatino for join will show, when it stops blocking the execution.
[...] When the count of unfinished tasks drops to zero, join() unblocks.
I'll keep it simple. Queue is basically collections of items like list for instance, difference it doesn't allow random access of elements. You insert and delete items in certain manner. The default type of queue is FIFO( first in first out). As you can figure from its name, its like a normal queue you see at any supermarket(or any place) the first person who entered the line will leave first.
There are three types of queue:
FIFO
LIFO
PRIORITY
FIFO like i said has rule of first in first out:
import queue #importing the library
q=queue.Queue() #create a queue object
for i in range(5):
print(q.put(i)) #adding elements into our queue
while not q.empty():
print(q.get()) #to delete item and printing it
LIFO works on the principle of last in first out:
import queue #importing the library
q=queue.LifoQueue() #create a queue object
for i in range(5):
print(q.put(i)) #adding elements into our queue
while not q.empty():
print(q.get()) #to delete item and printing it
PRIORTY queue gives out data in ascending order, as in, the smallest one will exit the queue first.
import queue #importing the library
q=queue.LifoQueue() #create a queue object
q.put(3)
q.put(7)
q.put(2)
q.put(7)
q.put(1)
while not q.empty():
print(q.get()) #to delete item and printing it
To answer your last question, as you can see in example, you can use q.empty() to check if your queue is empty or not.
If you have any further doubt feel free to ask.
I have 2 questions regarding thread and queue in python.
What does max_size arg do in queue.Queue()?
Are there any performance improvement depending on the # of threads(num_worker_threads)? I can't find any improvements on it.
-> If there's no improvements depending on the # of threads, why do we need this?
import time
import queue
import threading
num_worker_threads = 10
store = []
def worker():
while True:
item = q.get()
if item is None:
break
store.append(item)
q.task_done()
start = time.time()
q = queue.Queue() # what does max_size do in queue?
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in range(1000000):
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
end = time.time()
print('running time: {}'.format(end - start))
You set the max size on a queue in order to provide a throttling mechanism for your workers. Assuming you have X producers and Y consumers using the same queue, if any of the consumers throws a Queue.Full exception, you can use this as a signal to slow down on producing.
In your case, since work doesn't actually do anything computationally expensive (just appends to a list, in an unsafe manner i might add), and you only have 1 producer and 10 consumers, the queue will most likely always be empty and instead the consumers are idling (and would throw a Queue.Empty exception if the get method had a timeout attached.
As a side note, in order to figure out which of the consumer and producers are idling, you should use the timeout parameter for both the put and get methods of the queue
I'm doing some threads expirements, and noticed that my code works even without q.task_done() statement.
import Queue, threading
queue = Queue.Queue()
def get_url(url):
queue.put({url: len(urllib2.urlopen(url).read())})
def read_from_queue():
m = queue.get()
print m.items()
queue.task_done() # <-- this can be removed and still works
def use_threads():
threads = []
for u in urls:
t = threading.Thread(target=get_url, args=(u,))
threads.append(t)
t.start()
for t in threads:
t.join()
threads = []
for r in urls:
t = threading.Thread(target=read_from_queue)
threads.append(t)
t.start()
for t in threads:
t.join()
This is a simple program that loops over list of urls, reading their content and sums it up to the len of bytes. It then puts in the queue a dict containing the url name and its size.
I have timeit.timeit tested both cases; the results are mixed but that make sense because most of the work happens on network.
How the queue knows a task is done? How the t.join() returns without task_done() is being called on the queue?
queue.task_done only affect queue.join
queue.task_done doesn't affect thread.join
You are calling thread.join and never call queue.join, so queue.task_done doesn't matter
Zang MingJie got it right. I was join() the threads, not the queue itself.
When the threads complete, the join() returns.
That's the piece I was missing:
The whole idea of task_done() is when the threads are daemons, or never returns until killed. Then you can't join() the threads, because it will deadlock.
So, when you have such a scenario - you join() the queue. This will return when the queue is empty of tasks (indicating there is currently no more work).
I have two queues for different tasks, the first crawl will start to crawl the links in the list and then it will generate more links to crawl to the queue one and also will generate new links to a different task on queue two, my program is working but the problem is: When the workers for the queue two start running it stops the workers from queue one, they are basically not running in parallel they are waiting each other finish their tasks. How can I make them run in parallel?
import threading
from queue import Queue
queue = Queue()
queue_two = Queue()
links = ['www.example.com', 'www.example.com', 'www.example.com',
'www.example.com', 'www.example.com', 'www.example.com',
'www.example.com', 'www.example.com', 'www.example.com']
new_links = []
def create_workers():
for _ in range(4):
t = threading.Thread(target=work)
t.daemon = True
t.start()
for _ in range(2):
t = threading.Thread(target=work_two)
t.daemon = True
t.start()
def work():
while True:
work = queue.get()
#do something
queue.task_done()
def work_two():
while True:
work = queue_two.get()
#do something
queue_two.task_done()
def create_jobs():
for link in links:
queue.put(link)
queue.join()
crawl_two()
crawl()
def create_jobs_two():
for link in new_links:
queue_two.put(link)
queue_two.join()
crawl_two()
def crawl():
queued_links = links
if len(queued_links) > 0:
create_jobs()
def crawl_two():
queued_links = new_links
if len(queued_links) > 0:
create_jobs_two()
create_workers()
crawl()
That is because your processing does not seem to be parallel between work and work two.
This is what happens:
You create workers for work and work_two
Crawl is called
Create_jobs is called - "work" workers start processing them
Create_jobs waits in queue.join() until all of them have completed
Crawl_two is called
Create_jobs_two is called - "work_two" workers start processing them
Create_jobs_two waits in queue_two.join() until all of them have completed
Crawl is called (start from 2. again)
Basically you never enter in a situation where work and work_two would run in parallel, as you use queue.join() to wait until all of the currently running tasks have finished. Only then will you assign tasks to the "other" queue. Your work and work_two do run in parallel within themselves, but the control structure ensures work and work_two are mutually exclusive. You need to redesign the loops and queues if you want both of them run in parallel.
You will probably also want to investigate the use of threading.Lock() to protect your global new_links variable, as I assume you will be appending things to it in your worker threads. This is absolutely fine but you need a lock to ensure two threads are not trying to do this simultaneously. But this is not related to your current problem. This only helps you avoid the next problem.
I of course do not know what you try to achieve here, but you might try solving your problem and avoiding the next one by scrapping the global new_links completely. What if your work and work_two just fed the queue of the other worker if they needed to submit tasks to them, instead of putting items into a global variable and then feeding them to the queue in the main thread?
Or you could build an "orchestration" thread that would queue tasks to workers, process responses from them and then act on that response accordingly, either queuing it back to one of the queues or accepting the result if it is "ready".
The python example gives anexample of how to wait for enqueued tasks to be completed but I am not sure how the order of retrieval is determined. Here is the code:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
As I interpret it, this code starts whatever the range of threads is, the puts an however many items are in the source in the queue.
So if you start 20 threads, and put 30 items in the queue, it seems like you will have 20 worker threads all calling
while True:
item = q.get()
do_work(item)
So the first time an item is put on a queue, which of the 20 threads actually gets the item just put on the queue?
Generally speaking, there isn't going to be a guaranteed order, only guaranteed mutual exclusion. Assuming you are using something like queue.Queue (Python 3), it uses synchronization primitives to ensure only one thread can get() an item at a time. But the order in which the threads get their chance will be affected by the vagaries of the OS scheduler - load, priorities, etc.