Enqueuing a tf.RandomShuffleQueue from multiple processes using multiprocessing - python

I would like to use multiple processes (not threads) to do some preprocessing and enqueue the results to a tf.RandomShuffleQueue which can be used by my main graph for training.
Is there a way to do that ?
My actual problem
I have converted my dataset into TFRecords split across 256 shards. I want to start 20 processes using multiprocessing and let each process a range of shards. Each process should read images and then augment them and push them into a tf.RandomShuffleQueue from which the input can be given to a graph for training.
Some people advised me to go through the inception example in tensorflow. However, it is a very different situation because there only reading of the data shards is done by multiple threads (not processes), while the preprocessing (e.g - augmentation) takes place in the main thread.

(This aims to solve your actual problem)
In another topic, someone told you that Python has the global interpreter lock (GIL) and therefore there would be no speed benefits from multi-core, unless you used multiple processes.
This was probably what prompted your desire to use multiprocessing.
However, with TF, Python is normally used only to construct the "graph". The actual execution happens in native code (or GPU), where GIL plays no role whatsoever.
In light of this, I recommend simply letting TF use multithreading. This can be controlled using the intra_op_parallelism_threads argument, such as:
with tf.Session(graph=graph,
intra_op_parallelism_threads=20)) as sess:
# ...
(Side note: if you have, say, a 2-CPU, 32-core system, the best argument may very well be intra_op_parallelism_threads=16, depending on a lot of factors)

Comment: The pickling of TFRecords is not that important.
I can pass a list of lists containing names of ranges of sharded TFRecord files.
Therebe I have to restart Decision process!
Comment: I can pass it to a Pool.map() as an argument.
Verify, if a multiprocesing.Queue() can handle this.
Results of Tensor functions are a Tensor object.
Try the following:
tensor_object = func(TFRecord)
q = multiprocessing.Manager().Queue()
data = q.get()
Comment: how do I make sure that all the processes enqueue to the same queue ?
This is simple done enqueue the results from Pool.map(...
after all process finished.
Alternate we can enqueue parallel, queueing data from all processes.
But doing so, depends on pickleabel data as described above.
For instance:
import multiprocessing as mp
def func(filename):
TFRecord = read(filename)
tensor_obj = tf.func(TFRecord)
return tensor_obj
def main_Tensor(tensor_objs):
tf = # ... instantiat Tensor Session
rsq = tf.RandomShuffleQueue(...)
for t in tensor_objs:
if __name__ == '__main__':
sharded_TFRecords = ['file1', 'file2']
with mp.Pool(20) as pool:
tensor_objs = pool.map(func, sharded_TFRecords)

It seems the recommended way to run TF with multiprocessing is via creating a separate tf.Session for each child as sharing it across processes is unfeasible.
You can take a look at this example, I hope it helps.
[EDIT: Old answer]
You can use a multiprocessing.Pool and rely on its callback mechanism to put results in the tf.RandomShuffleQueue as soon as they are ready.
Here's a very simple example on how to do it.
from multiprocessing import Pool
class Processor(object):
def __init__(self, random_shuffle_queue):
self.queue = random_shuffle_queue
self.pool = Pool()
def schedule_task(self, task):
self.pool.apply_async(processing_function, args=[task], callback=self.task_done)
def task_done(self, results):
This assumes Python 2, for Python 3 I'd recommend to use a concurrent.futures.ProcessPoolExecutor.


How do I run two looping functions parallel to each other? [duplicate]

Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?
If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
def loop_b():
while 1:
if __name__ == '__main__':
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.
There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.
Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.
from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
# Do work indepedent of loopA and loopB
You could use threading or multiprocessing.
How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.
I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!

Write python with joblib in parallel in the list

I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
for i in range(0, 1000):
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.

How to have a multi-procsesing function return and store values in python?

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
for core in cores:
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.

python3 multiprocess shared numpy array(read-only)

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.
My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).
I've written some pseudo code to demonstrate my idea:
import multiprocessing
def worker_func(data, args):
# do sth...
return res
def compute(data, process_num, niter):
result = []
args = init()
for iter in range(niter):
args_chunk = split_args(args, process_num)
pool = multiprocessing.Pool()
for i in range(process_num):
result.append(pool.apply_async(worker_func,(data, args_chunk[i])))
# aggregate result and update args
for res in result:
args = update_args(res.get())
if __name__ == "__main__":
compute(data, 4, 100)
The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.
I've come up with two potential solutions:
share data among processes (it's ndarray), that's the title of this question.
Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.
So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.
Thanks in advance.
If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.
If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.
Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.
Here is a modified version of your example that runs.
import numpy as np
import ray
def worker_func(data, i):
# Do work. This function will have read-only access to
# the data array.
return 0
data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)
# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
result_ids.append(worker_func.remote(data_id, i))
# Get the results.
results = ray.get(result_ids)
Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.
Conceptually for your problem, using mmap is a standard way.
This way, the information can be retrieved from mapped memory by multiple processes
Basic understanding of mmap:
Python has "mmap" module(import mmap)
The documentation of python standard and some examples are in below link

Creating a Queue delay in a Python pool without blocking

I have a large program (specifically, a function) that I'm attempting to parallelize using a JoinableQueue and the multiprocessing map_async method. The function that I'm working with does several operations on multidimensional arrays, so I break up each array into sections, and each section evaluates independently; however I need to stitch together one of the arrays early on, but the "stitch" happens before the "evaluate" and I need to introduce some kind of delay in the JoinableQueue. I've searched all over for a workable solution but I'm very new to multiprocessing and most of it goes over my head.
This phrasing may be confusing- apologies. Here's an outline of my code (I can't put all of it because it's very long, but I can provide additional detail if needed)
import numpy as np
import multiprocessing as mp
from multiprocessing import Pool, Pipe, JoinableQueue
def main_function(section_number):
#define section sizes
array_this_section = array[:,start:end+1,:]
histogram_this_section = np.zeros((3, dataset_size, dataset_size))
#start and end are defined according to the size of the array
#dataset_size is to show that the histogram is a different size than the array
for m in range(1,num_iterations+1):
#do several operations- each section of the array
#corresponds to a section on the histogram
#each process sends their own part of the histogram outside of the pool
#to be combined with every other part- later operations
#in this function must use the full histogram
full_histogram = full_hist_queue.get()
#do many more operations
hist_queue = JoinableQueue()
full_hist_queue = JoinableQueue()
if __name__ == '__main__':
pool = mp.Pool(num_sections)
args = np.arange(num_sections)
pool.map_async(main_function, args, chunksize=1)
#I need the map_async because the program is designed to display an output at the
#end of each iteration, and each output must be a compilation of all processes
#a few variable definitions go here
for m in range(1,num_iterations+1):
for i in range(num_sections):
temp_hist = hist_queue.get() #the code hangs here because the queue
#is attempting to get before anything
#has been put
hist_full += temp_hist
for i in range(num_sections):
for i in range(num_sections):
full_hist_queue.put(hist_full) #the full histogram is sent back into
#the pool
#etc etc
I'm pretty sure that your issue is how you're creating the Queues and trying to share them with the child processes. If you just have them as global variables, they may be recreated in the child processes instead of inherited (the exact details depend on what OS and/or context you're using for multiprocessing).
A better way to go about solving this issue is to avoid using multiprocessing.Pool to spawn your processes and instead explicitly create Process instances for your workers yourself. This way you can pass the Queue instances to the processes that need them without any difficulty (it's technically possible to pass the queues to the Pool workers, but it's awkward).
I'd try something like this:
def worker_function(section_number, hist_queue, full_hist_queue): # take queues as arguments
# ... the rest of the function can work as before
# note, I renamed this from "main_function" since it's not running in the main process
if __name__ == '__main__':
hist_queue = JoinableQueue() # create the queues only in the main process
full_hist_queue = JoinableQueue() # the workers don't need to access them as globals
processes = [Process(target=worker_function, args=(i, hist_queue, full_hist_queue)
for i in range(num_sections)]
for p in processes:
# ...
If the different stages of your worker function are more or less independent of one another (that is, the "do many more operations" step doesn't depend directly on the "do several operations" step above it, just on full_histogram), you might be able to keep the Pool and instead split up the different steps into separate functions, which the main process could call via several calls to map on the pool. You don't need to use your own Queues in this approach, just the ones built in to the Pool. This might be best especially if the number of "sections" you're splitting the work up into doesn't correspond closely with the number of processor cores on your computer. You can let the Pool match the number of cores, and have each one work on several sections of the data in turn.
A rough sketch of this would be something like:
def worker_make_hist(section_number):
# do several operations, get a partial histogram
return histogram_this_section
def worker_do_more_ops(section_number, full_histogram):
# whatever...
return some_result
if __name__ == "__main__":
pool = multiprocessing.Pool() # by default the size will be equal to the number of cores
for temp_hist in pool.imap_unordered(worker_make_hist, range(number_of_sections)):
hist_full += temp_hist
some_results = pool.starmap(worker_do_more_ops, zip(range(number_of_sections),

