Python: Reading from Queue slows the ability to write to Queue? - python

I encountered a very puzzling issue while working with Python's multiprocessing module.
The setup is pretty typical. My machine has 32 cores and 244 GB of RAM (thank you AWS). One process to write to an ingestion queue. N processes to do the work I need done, process_data(). M processes to do some preaggregation, preaggregate_results(). One process to do the final aggregation and write the output.
If N is 'large' and M is only 1 or 2, then process_data() is very fast. It basically keeps up with the ingestion process. But since M is very small, the preaggregation is relatively slow and the intermediate_results queue bloats.
Here is the heart of the issue. Every increase in M results in a MARKED decrease in process_data()'s ability to write to the intermediate_results queue. In fact, if N==M==12, the process is so slow that it's not even reasonable to wait for the job to finish. process_data() goes from pacing with the ingestion queue to getting left in the dust.
I included some skeleton code below that just outlines the work flow I'm talking about. It's not to be taken literally. I'm curious if anyone else has encountered this issue before and knows how to solve it. I've talked to many of my coworkers (including code review) and they are just as stumped as I am.
I use multiprocessing all the time with success. This is the first time I've encountered this issue. Any thoughts would be greatly appreciated.
from multiprocessing import Process, Queue
import pandas as pd
import csv
KILL_TOKEN = 'STOP'
NUM_PROCESS_DATA = 14
NUM_PROCESS_PREAGGREGATE = 1
def ingest_data(ingestion_queue):
...pandas data munging
for blah in univariate_data.itertuples():
... write to ingestion_queue
def process_data(ingestion_queue, intermediate_results):
while True:
data = ingestion_queue.get()
if data == KILL_TOKEN:
break
... process data
... write to intermediate_results
def preaggregate_results(intermediate_results, output_queue):
while True:
data = intermediate_results.get()
if data == KILL_TOKEN:
break
... preaggregation
... write to output_queue after kill token is received
def process_output(output_queue):
while True:
data = output_queue.get()
if data == KILL_TOKEN:
break
... final aggregation
... write results
if __name__ == '__main__':
... the usual

Related

dask.distributed: wait for all tasks to finish before shutdown (without futures)

Tldr:
I'm using fire_and_forget to execute tasks on a dask.distributed cluster, so I don't maintain a future for each task. How can I wait until they are all done before the cluster gets shut down?
Details:
I have a workflow that creates a xarray dataset which is persisted on the cluster. Once the computations are done, I want to save the time slices individually and move on to the next dataset.
Until now, I've been using a delayed function and collected a list of delayed tasks which I then passed on to client.compute - this way I was sure everything was done before I moved on to the next dataset. The downside is, that all is blocked until every last file got written.
Now I'm looking into fire_and_forget to be able to start the computations on the next dataset while the files of the previous one are still being written.
I'm planning to wait for each dataset to be completed before I start the fire_and_forget tasks, so they should have plenty of time to complete.
The only issue I've encountered is, that when processing the last dataset, there's no more waiting and the cluster gets shut down after the last fire_and_forget call, even though the processes are still running.
So is there any way to tell the client it needs to block until all is completed?
Or am I maybe not properly understanding the use of fire_and_forget and should stay with my previous approach?
Here's an example code that simulates the workflow - it does 10 iterations (simulating the different datasets) and then writes the first 10 time slices to pickle files. So in the end I'm expecting 100 pickle files on disk, which is not the case.
import pickle
import random
from time import sleep
from dask import delayed
from dask.distributed import LocalCluster, Client, wait, fire_and_forget
import xarray as xr
#delayed
def dump_delayed(x, fn):
with open(fn, "wb") as f:
random.seed(42)
sleep(random.randint(1,2))
pickle.dump(x, f)
TARGET = "/home/jovyan/"
def main():
cluster = LocalCluster(n_workers=2, ip="0.0.0.0")
client = Client(cluster)
ds = xr.tutorial.open_dataset("rasm")
for it in range(1,10):
print("Iteration %s" % it)
# simulating the processing and persisting
ds2 = (ds*it).chunk({"time": 1}).persist()
_ = wait(ds2)
for ii in range(10):
fn = TARGET + f"temp{ii}_{it}.pkl"
xx = ds2.isel(time=ii)
f = client.persist(dump_delayed(xx, fn))
fire_and_forget(f)
if __name__ == "__main__":
main()
Not sure if this qualifies for a solution, but fire_and_forget is for a specific use case where you do not want to track the status of the task. If you are interested in the status of the tasks, it's better to use the regular future.

Separate computation from socket work in Python

I'm serializing column data and then sending it over a socket connection.
Something like:
import array, struct, socket
## Socket setup
s = socket.create_connection((ip, addr))
## Data container setup
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
## Binarize data
columns['col1'] = array.array('i', range(10000))
columns['col2'] = array.array('f', [float(num) for num in range(10000)])
.
.
.
## Send away
chunk = b''.join(columns[col_name] for col_name in ordered_col_list]
s.sendall(chunk)
s.recv(1000) #get confirmation
I wish to separate the computation from the sending, put them on separate threads or processes, so I can keep doing computations while data is sent away.
I've put the binarizing part as a generator function, then sent the generator to a separate thread, which then yielded binary chunks via a queue.
I collected the data from the main thread and sent it away. Something like:
import array, struct, socket
from time import sleep
try:
import thread
from Queue import Queue
except:
import _thread as thread
from queue import Queue
## Socket and queue setup
s = socket.create_connection((ip, addr))
chunk_queue = Queue()
def binarize(num_of_chunks):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
.
.
yield b''.join((columns[col_name] for col_name in ordered_col_list))
def chunk_yielder(queue):
''' Generate binary chunks and put them on a queue. To be used from a thread '''
while True:
try:
data_gen = queue.get_nowait()
except:
sleep(0.1)
continue
else:
for chunk in data_gen:
queue.put(chunk)
## Setup thread and data generator
thread.start_new_thread(chunk_yielder, (chunk_queue,))
num_of_chunks = 100
data_gen = binarize(num_of_chunks)
queue.put(data_gen)
## Get data back and send away
while True:
try:
binary_chunk = queue.get_nowait()
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) #Get confirmation
However, I did not see and performance imporovement - it did not work faster.
I don't understand threads/processes too well, and my question is whether it is possible (at all and in Python) to gain from this type of separation, and what would be a good way to go about it, either with threads or processess (or any other way - async etc).
EDIT:
As far as I've come to understand -
Multirpocessing requires serializing any sent data, so I'm double-sending every computed data.
Sending via socket.send() should release the GIL
Therefore I think (please correct me if I am mistaken) that a threading solution is the right way. However I'm not sure how to do it correctly.
I know cython can release the GIL off of threads, but since one of them is just socket.send/recv, my understanding is that it shouldn't be necessary.
You have two options for running things in parallel in Python, either use the multiprocessing (docs) library , or write the parallel code in cython and release the GIL. The latter is significantly more work and less applicable generally speaking.
Python threads are limited by the Global Interpreter Lock (GIL), I won't go into detail here as you will find more than enough information online on it. In short, the GIL, as the name suggests, is a global lock within the CPython interpreter that ensures multiple threads do not modify objects, that are within the confines of said interpreter, simultaneously. This is why, for instance, cython programs can run code in parallel because they can exist outside the GIL.
As to your code, one problem is that you're running both the number crunching (binarize) and the socket.send inside the GIL, this will run them strictly serially. The queue is also connected very strangely, and there is a NameError but let's leave those aside.
With the caveats already pointed out by Jeremy Friesner in mind, I suggest you re-structure the code in the following manner: you have two processes (not threads) one for binarising the data and the other for sending data. In addition to those, there is also the parent process that started both children, and a queue connecting child 1 to child 2.
Subprocess-1 does number crunching and produces crunched data into a queue
Subprocess-2 consumes data from a queue and does socket.send
in code the setup would look something like
from multiprocessing import Process, Queue
work_queue = Queue()
p1 = Process(target=binarize, args=(100, work_queue))
p2 = Process(target=send_data, args=(ip, port, work_queue))
p1.start()
p2.start()
p1.join()
p2.join()
binarize can remain as it is in your code, with the exception that instead of a yield at the end, you add elements into the queue
def binarize(num_of_chunks, q):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
data = b''.join((columns[col_name] for col_name in ordered_col_list))
q.put(data)
send_data should just be the while loop from the bottom of your code, with the connection open/close functionality
def send_data(ip, addr, q):
s = socket.create_connection((ip, addr))
while True:
try:
binary_chunk = q.get(False)
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) # Get confirmation
# maybe remember to close the socket before killing the process
Now you have two (three actually if you count the parent) processes that are processing data independently. You can force the two processes to synchronise their operations by setting the max_size of the queue to a single element. The operation of these two separate processes is also easy to monitor from the process manager on your computer top (Linux), Activity Monitor (OsX), don't remember what it's called under Windows.
Finally, Python 3 comes with the option of using co-routines which are neither processes nor threads, but something else entirely. Co-routines are pretty cool from a CS point of view, but a bit of a head scratcher at first. There is plenty of resources to learn from though, like this post on Medium and this talk by David Beazley.
Even more generally, you might want to look into the producer/consumer pattern, if you are not already familiar with it.
If you are trying to use concurrency to improve performance in CPython I would strongly recommend using multiprocessing library instead of multithreading. It is because of GIL (Global Interpreter Lock), which can have a huge impact on execution speed (in some cases, it may cause your code to run slower than single threaded version). Also, if you would like to learn more about this topic, I recommend reading this presentation by David Beazley. Multiprocessing bypasses this problem by spawning a new Python interpreter instance for each process, thus allowing you to take full advantage of multi core architecture.

Python Multiprocessing pool.map unresponsive with too many worker processes

first question on stack overflow so please bear with. I am looking to calculate the variance for group ratings (long numpy arrays). Running the program without parallel processing works fine, but given each process can run independently and there are 32 groups I am looking to make use of multiprocessing to speed things up. This works OK for small numbers of groups < 10, but after this the program will often just seemingly stop running with no error messages at an unspecified number of groups ( usually between 20 and 30 ) although less frequently will run all the way through. The arrays are quite large ( 21451 x 11462 user item ratings) and so I am wondering if the problem is caused by not enough memory, although no error messages are printed.
import numpy as np
from functools import partial
import multiprocessing
def variance_parallel(extra_matrices, group_num):
# do some variation calculation
# print confirmation that we have entered function, and group number
return single_group_var
def variance(extra_matrices, num_groups):
variance_partial = partial(variance_parallel, extra_matrices)
for g in list(range(num_groups)):
group_var = pool.map(variance_partial,range(g))
return(group_var)
num_cores = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(processes=num_cores)
variance(extra_matrices, num_groups)
Running the above code shows the program progressively building the number of groups it is checking variance on ([0],[0,1],[0,1,2],...) before eventually printing nothing.
Thanks in advance for any help and apologies if my formatting / question is a bit off!
Multiple processes do not share data
Data sent to processes needs to be copied
Since the arrays are large, the issue is very likely to do with said copying of large arrays to the processes. Further more in Python's multiprocessing, sending data to processes is done by serialisation which is (a) CPU intensive and (b) takes extra memory in and by it self.
In short multi processing is not a good fit for your use case. Since numpy is a native code extension (where GIL does not apply) and is thread safe, best to use threading instead of multiprocessing. With threading, the worker threads can share data via their parent process's address space which makes away with having to copy.
That should stop the program from running out of memory.
However, for threads to share address space the data they share needs to be bound to an object, like in a python class.
Something like the below - untested as the code sample is incomplete.
import numpy as np
from functools import partial
from threading import Thread
from multiprocessing import cpu_count
class Variance(Thread):
def __init__(self, extra_matrices, group_num):
Thread.__init__(self)
self.extra_matrices = extra_matrices
self.group_num = group_num
self.output = None
def run(self):
# do some variation calculation
# print confirmation that we have entered function, and group number
self.output = single_group_var
num_cores = cpu_count() - 1
results = []
for g in list(range(num_groups)):
workers = [Variance(extra_matrices, range(g))
for _ in range(num_cores)]
# Start threads
for worker in workers:
worker.start()
# Wait for completion
for worker in workers:
worker.join()
results.extend([w.output for w in workers])
print results

multiprocessing.Pool.imap_unordered with fixed queue size or buffer?

I am reading data from large CSV files, processing it, and loading it into a SQLite database. Profiling suggests 80% of my time is spent on I/O and 20% is processing input to prepare it for DB insertion. I sped up the processing step with multiprocessing.Pool so that the I/O code is never waiting for the next record. But, this caused serious memory problems because the I/O step could not keep up with the workers.
The following toy example illustrates my problem:
#!/usr/bin/env python # 3.4.3
import time
from multiprocessing import Pool
def records(num=100):
"""Simulate generator getting data from large CSV files."""
for i in range(num):
print('Reading record {0}'.format(i))
time.sleep(0.05) # getting raw data is fast
yield i
def process(rec):
"""Simulate processing of raw text into dicts."""
print('Processing {0}'.format(rec))
time.sleep(0.1) # processing takes a little time
return rec
def writer(records):
"""Simulate saving data to SQLite database."""
for r in records:
time.sleep(0.3) # writing takes the longest
print('Wrote {0}'.format(r))
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
writer(pool.imap_unordered(process, data, chunksize=5))
This code results in a backlog of records that eventually consumes all memory because I cannot persist the data to disk fast enough. Run the code and you'll notice that Pool.imap_unordered will consume all the data when writer is at the 15th record or so. Now imagine the processing step is producing dictionaries from hundreds of millions of rows and you can see why I run out of memory. Amdahl's Law in action perhaps.
What is the fix for this? I think I need some sort of buffer for Pool.imap_unordered that says "once there are x records that need insertion, stop and wait until there are less than x before making more." I should be able to get some speed improvement from preparing the next record while the last one is being saved.
I tried using NuMap from the papy module (which I modified to work with Python 3) to do exactly this, but it wasn't faster. In fact, it was worse than running the program sequentially; NuMap uses two threads plus multiple processes.
Bulk import features of SQLite are probably not suited to my task because the data need substantial processing and normalization.
I have about 85G of compressed text to process. I'm open to other database technologies, but picked SQLite for ease of use and because this is a write-once read-many job in which only 3 or 4 people will use the resulting database after everything is loaded.
As I was working on the same problem, I figured that an effective way to prevent the pool from overloading is to use a semaphore with a generator:
from multiprocessing import Pool, Semaphore
def produce(semaphore, from_file):
with open(from_file) as reader:
for line in reader:
# Reduce Semaphore by 1 or wait if 0
semaphore.acquire()
# Now deliver an item to the caller (pool)
yield line
def process(item):
result = (first_function(item),
second_function(item),
third_function(item))
return result
def consume(semaphore, result):
database_con.cur.execute("INSERT INTO ResultTable VALUES (?,?,?)", result)
# Result is consumed, semaphore may now be increased by 1
semaphore.release()
def main()
global database_con
semaphore_1 = Semaphore(1024)
with Pool(2) as pool:
for result in pool.imap_unordered(process, produce(semaphore_1, "workfile.txt"), chunksize=128):
consume(semaphore_1, result)
See also:
K Hong - Multithreading - Semaphore objects & thread pool
Lecture from Chris Terman - MIT 6.004 L21: Semaphores
Since processing is fast, but writing is slow, it sounds like your problem is
I/O-bound. Therefore there might not be much to be gained from using
multiprocessing.
However, it is possible to peel off chunks of data, process the chunk, and
wait until that data has been written before peeling off another chunk:
import itertools as IT
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
chunksize = ...
for chunk in iter(lambda: list(IT.islice(data, chunksize)), []):
writer(pool.imap_unordered(process, chunk, chunksize=5))
It sounds like all you really need is to replace the unbounded queues underneath the Pool with bounded (and blocking) queues. That way, if any side gets ahead of the rest, it'll just block until they're ready.
This would be easy to do by peeking at the source, to subclass or monkeypatch Pool, something like:
class Pool(multiprocessing.pool.Pool):
def _setup_queues(self):
self._inqueue = self._ctx.Queue(5)
self._outqueue = self._ctx.Queue(5)
self._quick_put = self._inqueue._writer.send
self._quick_get = self._outqueue._reader.recv
self._taskqueue = queue.Queue(10)
But that's obviously not portable (even to CPython 3.3, much less to a different Python 3 implementation).
I think you can do it portably in 3.4+ by providing a customized context, but I haven't been able to get that right, so…
A simple workaround might be to use psutil to detect the memory usage in each process and say if more than 90% of memory are taken, than just sleep for a while.
while psutil.virtual_memory().percent > 75:
time.sleep(1)
print ("process paused for 1 seconds!")

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.
I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res
Simply create the pool within your loop and close it at the end of the loop with
pool.close().
Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.
I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache
You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link
I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Categories

Resources