store results ThreadPoolExecutor

store results ThreadPoolExecutor - python

I am fairly new to parallel processing with "concurrent.futures" and I am testing some simple experiments. The code I have written seems to work, but I am not sure how to store the results. I have tried to create a list ("futures") and append the results to that, but that considerably slow down the procedure. I am wondering if there is a better way to do that. Thank you.
import concurrent.futures
import time
couple_ods= []
futures=[]
dtab={}
for i in range(100):
for j in range(100):
dtab[i,j]=i+j/2
couple_ods.append((i,j))
avg_speed=100
def task(i):
origin=i[0]
destination=i[1]
time.sleep(0.01)
distance=dtab[origin,destination]/avg_speed
return distance
start1=time.time()
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
for number in couple_ods:
future=executor.submit(task,number)
futures.append(future.result())
if __name__ == '__main__':
main()
end1=time.time()

When you call future.result(), that blocks until the value is ready. So, you’re not getting any benefits out of parallelism here—you start one task, wait for it to finish, start another, wait for it to finish, and so on.
Of course your example won’t benefit from threading in the first place. Your tasks are doing nothing but CPU-bound Python computation, which means that (at least in CPython, MicroPython, and PyPy, which are the only complete implementations that come with concurrent.futures), the GIL (Global Interpreter Lock) will prevent more than one of your threads from progressing at a time.
Hopefully your real program is different. If it’s doing I/O-bound stuff (making network requests, reading files, etc.), or using an extension library like NumPy that releases the GIL around heavy CPU work, then it will work fine. But otherwise, you’ll want to use ProcessPoolExecutor here.
Anyway, what you want to do is append future itself to a list, so you get a list of all of the futures before waiting for any of them:
for number in couple_ods:
future=executor.submit(task,number)
futures.append(future)
And then, after you’ve started all of the jobs, you can start waiting for them. There are three simple options, and one complicated one when you need more control.
(1) You can just directly loop over them to wait for them in the order they were submitted:
for future in futures:
result = future.result()
dostuff(result)
(2) If you need to wait for them all to be finished before doing any work, you can just call wait:
futures, _ = concurrent.futures.wait(futures)
for future in futures:
result = future.result()
dostuff(result)
(3) If you want to handle each one as soon as it’s ready, even if they come out of order, use as_completed:
for future in concurrent.futures.as_completed(futures):
dostuff(future.result())
Notice that the examples that use this function in the docs provide some way to identify which task is finished. If you need that, it can be as simple as passing each one an index, then return index, real_result, and then you can for index, result in … for the loop.
(4) If you need more control, you can loop over waiting on whatever’s done so far:
while futures:
done, futures = concurrent.futures.wait(concurrent.futures.FIRST_COMPLETED)
for future in done:
result = future.result()
dostuff(result)
That example does the same thing as as_completed, but you can write minor variations on it to do different things, like waiting for everything to be done but canceling early if anything raises an exception.
For many simple cases, you can just use the map method of the executor to simplify the first option. This works just like the builtin map function, calling a function once for each value in the argument and then giving you something you can loop over to get the results in the same order, but it does it in parallel. So:
for result in executor.map(task, couple_ods):
dostuff(result)

Related

Are these two python codes equivalent using concurrent futures?

I have two pieces of code, representative of a more complex scenario I am trying to debug. I am wondering if they are technically equivalent, and if not, why.
First one:
import time
from concurrent.futures import ThreadPoolExecutor
def cb(res):
print("done", res)
def foo():
time.sleep(3)
res = 5
cb(res)
return res
with ThreadPoolExecutor(max_workers=2) as executor:
future = executor.submit(foo)
print(future.result())
Second one:
def cb2(fut):
print("done", fut.result())
def foo2():
time.sleep(3)
return 5
with ThreadPoolExecutor(max_workers=2) as executor:
future = executor.submit(foo2)
future.add_done_callback(cb2)
print(future.result())
The core of the issue is the following: I need to call a sync, slow operation (here, represented by the sleep). When that operation completes, I have to perform subsequent fast operations. In the first code, I put these operations immediately after the sync slow one. In the second code, I put it in the callback.
In terms of implementation, I suspect the future creates a secondary thread, runs the code in the secondary thread, and this secondary thread will stop at the sync slow operation. Once this operation is completed, the secondary thread will keep going, and it can keep going either by executing the subsequent code or by calling the callbacks. I see no difference in these two pieces of code (apart from the fact that adding the callback allows injecting code from outside, an added flexibility), but I might be wrong, hence the question.
Note that I do understand that in the first case, the print is called when the future is still not resolved and in the second one it is, but it is assumed that the status is not relevant.

These two examples are not equal in terms of events ordering.
Let’s look through the lifecycle of a Future. It is roughly like that (reverse engineered from cpython’s source):
a Future is created
it is added to executor’s queue
it is popped from the queue by some free/idle thread from the threadpool
the function provided to submit() is called in that thread
the future is marked as FINISHED
the future broadcasts the ‘state changed’ event to all its waiters
callbacks are invoked (still in the same worker thread)
the worker thread becomes free/idle and may take another future from the queue
When you execute the statement print(future.result()), your main thread blocks and becomes the future’s waiter. It becomes unblocked right after the future switches to FINISHED, but right before callbacks start to execute. That means that you cannot predict what print goes first in the console - print in any of your callbacks, or print(future(result)) - they now are executing in parallel. If you deal with same data in your callbacks and in the main thread after waiting for future.result() to complete, you are likely to get data corruption.

https://gist.github.com/mangecoeur/9540178
https://docs.python.org/3.4/library/concurrent.futures.html
with concurrent.futures.ProcessPoolExecutor() as executor:
result = executor.map(function, iterable)
executor.map(fun, [data] * 10)
pool = multiprocessing.Pool()
pool.map(…)
with concurrent.futures.ThreadPoolExecutor() as executor:
result = executor.map(function, iterable)

with concurrent.futures.ThreadPoolExecutor() as executor: ... does not wait

I am trying to use the ThreadPoolExecutor() in a method of a class to create a pool of threads that will execute another method within the same class. I have the with concurrent.futures.ThreadPoolExecutor()... however it does not wait, and an error is thrown saying there was no key in the dictionary I query after the "with..." statement. I understand why the error is thrown because the dictionary has not been updated yet because the threads in the pool did not finish executing. I know the threads did not finish executing because I have a print("done") in the method that is called within the ThreadPoolExecutor, and "done" is not printed to the console.
I am new to threads, so if any suggestions on how to do this better are appreciated!
def tokenizer(self):
all_tokens = []
self.token_q = Queue()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for num in range(5):
executor.submit(self.get_tokens, num)
executor.shutdown(wait=True)
print("Hi")
results = {}
while not self.token_q.empty():
temp_result = self.token_q.get()
results[temp_result[1]] = temp_result[0]
print(temp_result[1])
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
return all_tokens
def get_tokens(self, thread_index):
print("!!!!!!!")
switch = {
0: self.zettels[:(len(self.zettels)/5)],
1: self.zettels[(len(self.zettels)/5): (len(self.zettels)/5)*2],
2: self.zettels[(len(self.zettels)/5)*2: (len(self.zettels)/5)*3],
3: self.zettels[(len(self.zettels)/5)*3: (len(self.zettels)/5)*4],
4: self.zettels[(len(self.zettels)/5)*4: (len(self.zettels)/5)*5],
}
new_tokens = []
for zettel in switch.get(thread_index):
tokens = re.split('\W+', str(zettel))
tokens = list(filter(None, tokens))
new_tokens.append(tokens)
print("done")
self.token_q.put([new_tokens, thread_index])
'''
Expected to see all print("!!!!!!") and print("done") statements before the print ("Hi") statement.
Actually shows the !!!!!!! then the Hi, then the KeyError for the results dictionary.

As you have already found out, the pool is waiting; print('done') is never executed because presumably a TypeError raises earlier.
The pool does not directly wait for the tasks to finish, it waits for its worker threads to join, which implicitly requires the execution of the tasks to complete, one way (success) or the other (exception).
The reason you do not see that exception raising is because the task is wrapped in a Future. A Future
[...] encapsulates the asynchronous execution of a callable.
Future instances are returned by the executor's submit method and they allow to query the state of the execution and access whatever its outcome is.
That brings me to some remarks I wanted to make.
The Queue in self.token_q seems unnecessary
Judging by the code you shared, you only use this queue to pass the results of your tasks back to the tokenizer function. That's not needed, you can access that from the Future that the call to submit returns:
def tokenizer(self):
all_tokens = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(get_tokens, num) for num in range(5)]
# executor.shutdown(wait=True) here is redundant, it is called when exiting the context:
# https://github.com/python/cpython/blob/3.7/Lib/concurrent/futures/_base.py#L623
print("Hi")
results = {}
for fut in futures:
try:
res = fut.result()
results[res[1]] = res[0]
except Exception:
continue
[...]
def get_tokens(self, thread_index):
[...]
# instead of self.token_q.put([new_tokens, thread_index])
return new_tokens, thread_index
It is likely that your program does not benefit from using threads
From the code you shared, it seems like the operations in get_tokens are CPU bound, rather than I/O bound. If you are running your program in CPython (or any other interpreter using a Global Interpreter Lock), there will be no benefit from using threads in that case.
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.
That means for any Python process, only one thread can execute at any given time. This is not so much of an issue if your task at hand is I/O bound, i.e. frequently pauses to wait for I/O (e.g. for data on a socket). If your tasks need to constantly execute bytecode in a processor, there's no benefit for pausing one thread to let another execute some instructions. In fact, the resulting context switches might even prove detrimental.
You might want to go for parallelism instead of concurrency. Take a look at ProcessPoolExecutor for this.However, I recommend to benchmark your code running sequentially, concurrently and in parallel. Creating processes or threads comes at a cost and, depending on the task to complete, doing so might take longer than just executing one task after the other in a sequential manner.
As an aside, this looks a bit suspicious:
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
results seems to always have five items, because for num in range(5). If the length of self.zettels is greater than five, I'd expect a KeyError to raise here.If self.zettels is guaranteed to have a length of five, then I'd see potential for some code optimization here.

You need to loop over concurrent.futures.as_completed() as shown here. It will yield values as each thread completes.

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.

I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res

Simply create the pool within your loop and close it at the end of the loop with
pool.close().

Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.

I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache

You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link

I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Dumping a multiprocessing.Queue into a list

I wish to dump a multiprocessing.Queue into a list. For that task I've written the following function:
import Queue
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
# START DEBUG CODE
initial_size = queue.qsize()
print("Queue has %s items initially." % initial_size)
# END DEBUG CODE
while True:
try:
thing = queue.get(block=False)
result.append(thing)
except Queue.Empty:
# START DEBUG CODE
current_size = queue.qsize()
total_size = current_size + len(result)
print("Dumping complete:")
if current_size == initial_size:
print("No items were added to the queue.")
else:
print("%s items were added to the queue." % \
(total_size - initial_size))
print("Extracted %s items from the queue, queue has %s items \
left" % (len(result), current_size))
# END DEBUG CODE
return result
But for some reason it doesn't work.
Observe the following shell session:
>>> import multiprocessing
>>> q = multiprocessing.Queue()
>>> for i in range(100):
... q.put([range(200) for j in range(100)])
...
>>> q.qsize()
100
>>> l=dump_queue(q)
Queue has 100 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 99 items left
>>> l=dump_queue(q)
Queue has 99 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 3 items from the queue, queue has 96 items left
>>> l=dump_queue(q)
Queue has 96 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 95 items left
>>>
What's happening here? Why aren't all the items being dumped?

Try this:
import Queue
import time
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
time.sleep(.1)
return result
import multiprocessing
q = multiprocessing.Queue()
for i in range(100):
q.put([range(200) for j in range(100)])
q.put('STOP')
l=dump_queue(q)
print len(l)
Multiprocessing queues have an internal buffer which has a feeder thread which pulls work off a buffer and flushes it to the pipe. If not all of the objects have been flushed, I could see a case where Empty is raised prematurely. Using a sentinel to indicate the end of the queue is safe (and reliable). Also, using the iter(get, sentinel) idiom is just better than relying on Empty.
I don't like that it could raise empty due to flushing timing (I added the time.sleep(.1) to allow a context switch to the feeder thread, you may not need it, it works without it - it's a habit to release the GIL).

# in theory:
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
# in practice this might be more resilient:
def dump_queue(q):
q.put(None)
return list(iter(lambda : q.get(timeout=0.00001), None))
# but neither case handles all the ways things can break
# for that you need 'managers' and 'futures' ... see Commentary
I prefer None for sentinels, but I would tend to agree with jnoller that mp.queue could use a safe and simple sentinel. His comments on risks of getting empty raised early is also valid, see below.
Commentary:
This is old and Python has changed, but, this does come up has a hit if you're having issues with lists <-> queue in MP Python. So, let's look a little deeper:
First off, this is not a bug, it's a feature: https://bugs.python.org/issue20147. To save you some time from reading that discussion and more details in the documentation, here are some highlights (kind of philosophical but I think it might help some who are starting with MP/MT in Python):
MP Queues are structures capable of being communicated with from different threads, different processes on the same system, and in fact can be different (networked) computers
In general with parallel/distributed systems, strict synchronization is expensive, so every time you use part of the API for any MP/MT datastructures, you need to look at the documentation to see what it promises to do, or not. Hint: if a function doesn't include the word "lock" or "semaphore" or "barrier" etc, then it will be some mixture of "asynchronous" and "best effort" (approximate), or what you might call "flaky."
Specific to this situation: Python is an interpreted language, with a famous single interpreter thread with it's famous "Global Interpreter Lock" (GIL). If your entire program is single-process, single threaded, then everything is hunky dory. If not (and with MP it's egregiously not), you need to give the interpreter some breathing room. time.sleep() is your friend. In this case, timeouts.
In your solution you are only using flaky functions - get() and qsize(). And the code is in fact worse than you might think - dial up the size of the queue and the size of the objects and you're likely to break things:
Now, you can work with flaky routines, but you need to give them room to maneuver. In your example you're just hammering that queue. All you need to do is change the line thing = queue.get(block=False) to instead be thing = queue.get(block=True,timeout=0.00001) and you should be fine.
The time 0.00001 is chosen carefully (10^-5), it's about the smallest that you can safely make it (this is where art meets science).
Some comments on why you need the timout: this relates to the internals of how MP queues work. When you 'put' something into an MP queue, it's not actually put into the queue, it's queued up to eventually be there. That's why qsize() happens to give you a correct result - that part of the code knows there's a pile of things "in" the queue. You just need to realize that an object "in" the queue is not the same thing as "i can now read it." Think of MP queues as sending a letter with USPS or FedEx - you might have a receipt and a tracking number showing that "it's in the mail," but the recipient can't open it yet. Now, to be even more specific, in your case you get '0' items accessible right away. That's because the single interpreter thread you're running hasn't had any chance to process stuff that's "queued up", so your first loop just queues up a bunch of stuff for the queue, but you're immediately forcing your single thread to try to do a get() before it's even had a chance to line up even a single object for you.
One might argue that it slows code down to have these timeouts. Not really - MP queues are heavy-weight constructs, you should only be using them to pass pretty heavy-weight "things" around, either big chunks of data, or at least complex computation. the act of adding 10^-5 seconds actually does is give the interpreter a chance to do thread scheduling - at which point it will see your backed-up put() operations.
Caveat
The above is not completely correct, and this is (arguably) an issue with the design of the get() function. The semantics of setting timeout to non-zero is that the get() function will not block for longer than that before returning Empty. But it might not actually be Empty (yet). So if you know your queue has a bunch of stuff to get, then the second solution above works better, or even with a longer timeout. Personally I think they should have kept the timeout=0 behavior, but had some actual built-in tolerance of 1e-5, because a lot of people will get confused about what can happen around gets and puts to MP constructs.
In your example code, you're not actually spinning up parallel processes. If we were to do that, then you'd start getting some random results - sometimes only some of the queue objects will be removed, sometimes it will hang, sometimes it will crash, sometimes more than one thing will happen. In the below example, one process crashes and the other hangs:
The underlying problem is that when you insert the sentinel, you need to know that the queue is finished. That should be done has part of the logic around the queue - if for example you have a classical master-worker design, then the master would need to push a sentinel (end) when the last task has been added. Otherwise you end up with race conditions.
The "correct" (resilient) approach is to involve managers and futures:
import multiprocessing
import concurrent.futures
def fill_queue(q):
for i in range(5000):
q.put([range(200) for j in range(100)])
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
with multiprocessing.Manager() as manager:
q = manager.Queue()
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.submit(fill_queue, q) # add stuff
executor.submit(fill_queue, q) # add more stuff
executor.submit(fill_queue, q) # ... and more
# 'step out' of the executor
l = dump_queue(q)
# 'step out' of the manager
print(f"Saw {len(l)} items")
Let the manager handle your MP constructs (queues, dictionaries, etc), and within that let the futures handle your processes (and within that, if you want, let another future handle threads). This assures that things are cleaned up as you 'unravel' the work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.