Write python with joblib in parallel in the list - python

I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
data[i].append(ldata)
for i in range(0, 1000):
data.append([])
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
resultlist.extend(data[i])

You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
...
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.

Related

Exchanging objects between processes while running Python

I would like to create in Python a process that run constantly in parallell while the main execution of my code is running. It should provide a way to deal with the sequential execution of Python that prevent me to do an asynchronous execution.
So I would like that a function RunningFunc run while my main code is doing some other operation.
I tried to use the threading module. However the computation is not in parralell and RunningFunc is an highly intensive computation and slow down heavily my main code.
I also tried using the multiprocessing module and I guess this should be my answer using a multiprocessing.Manager() doing some computation on a first process while accessing via a shared memory the data computed over time. But I didn't figure out a way to do that.
For exemple the RunningFunc is incrementing the Compteur variable.
def RunningFunc(x):
boolean = True
Compteur = 0
while boolean:
Compteur +=1
While in my main code some computation are running and I call sometime (not necessarily each while other_bool iteration), the Compteur variable of RunningFunc.
other_bool = True
Value = 0
while other_bool:
## MAKING SOME COMPUTATION
Value = Compteur # Call the variable compteur that is constantly running
## MAKING SOME COMPUTATION
There are many ways to do processing in child processes. Which is best depends on questions such as the size of the data to be shared verses the time spent in the calculation. Following is an example much like your simple increment of a variable, but flushed out to a slightly larger list of integers to highlight some of the issues you'll bump into.
A multiprocessing.Manager is a convenient way to share data among processes, but its not particularly fast because it needs to synchronize data among its processes. If the data you want to share is fairly modest and doesn't change that often, its a good choice. But I will just focus on shared memory here.
Most python objects cannot be created in shared memory. Things like the object header, reference count or the memory heap are not shareable. Some objects, notably numpy arrays can be shared, but that is a different answer.
What you can do, is serialize and write/read to shared memory. This could be done with any serialization mechanism, but converting to fundamental types via struct is a good way to do it.
That means that you have to write your code to save its data periodically. You also need to worry about synchronization if you are saving anything bigger than a single CPU level word to memory. The parent could read while the child is writing, giving you inconsistent data.
The following example shows one way to handle shared memory:
import multiprocessing as mp
import multiprocessing.shared_memory
import time
import struct
data_format = struct.Struct("3Q") # will share 3 longlong ints
def main():
# lock keeps shared memory readers from getting intermediate data
shared_lock = mp.Lock()
shared = mp.shared_memory.SharedMemory(create=True, size=8*3)
buf = shared.buf
try:
print(shared)
child = mp.Process(target=running_func, args=(shared.name, shared_lock))
child.start()
try:
print("read for 20 seconds")
for i in range(20):
with shared_lock:
my_list = data_format.unpack_from(buf, 0)
print(my_list)
time.sleep(1)
finally:
child.terminate()
child.join()
finally:
shared.close()
shared.unlink()
def running_func(shared_memory_name, lock):
shared = mp.shared_memory.SharedMemory(name=shared_memory_name)
buf = shared.buf
try:
my_list = [1,2,3]
while True:
my_list = [val+1 for val in my_list]
with lock:
data_format.pack_into(buf, 0, *my_list)
finally:
shared.close()
if __name__ == "__main__":
main()

How to implement multiprocessing on a specific function?

I am new to this multiprocessing concept. I am trying to implement multiprocessing to a spelling function to make it run faster. I tried as below but did not get results in previous order, token here is the huge list of tokenized sentences.
from spellchecker import SpellChecker
from wordsegment import load, segment
from timeit import default_timer as timer
from multiprocessing import Process, Pool, Queue, Manager
def text_similarity_spellings(self, token):
"""Uses spell checker to separate incorrect spellings and correct them"""
spell = SpellChecker()
unknown_words = [list(spell.unknown(word)) for word in token]
known_words = [list(spell.known(word)) for word in token]
load()
segmented = [[segment(word) for word in sub] for sub in unknown_words]
flat_list = list(self.unpacker(segmented))
new_list = [[known_words[x], flat_list[x]] for x in range(len(known_words))]
new_list = list(self.unpacker(new_list))
newlist = [sorted(set(mylist), key=lambda x: mylist.index(x)) for mylist in new_list]
return newlist
def run_all(self):
tread_vta = Manager().list()
processes = []
arg_split = np.array_split(np.array(token),10)
arg_tr_cl = []
finds = []
trdclean1 = []
for count, k in enumerate(arg_split):
arg_tr_cl.append((k, [], tread_vta, token[t]))
for j in range(len(arg_tr_cl)):
p = Process(target= self.text_similarity_spellings, args=arg_tr_cl[j])
p.start()
processes.append(p)
for p in processes:
p.join()
Can anyone suggest me a better way to apply multiprocessing to a specific function and get results in correct order?
First, there is a certain amount of overhead in creating processes and then again more overhead in passing arguments from the main process to a subprocess, which "lives" in another address space, and getting return values back (by the way, you have made no provisions for actually getting return values back from worker function text_similarity_spellings). So for you to profit from using multiprocessing, the gains from performing your tasks (invocations of your worker function) in parallel must be enough to offset the additional aforementioned costs. All of this is just a way of saying that your worker function has to be sufficiently CPU-intensive to justify multiprocessing.
Second, given the cost of creating processes, you do not want to be creating more processes than you can possibly use. If you have N tasks to complete (the length of arg_tr_cl) and M CPU processors to run them on and your worker function is pure CPU (no I/O involved), then you can never gain anything by trying to run these tasks using more than M processes. If, however, they do combine some I/O, then perhaps using more processes could be profitable. If there is a lot of I/O involved and only some CPU-intensive processing involved, then using a combination of multithreading and multiprocessing might be the way to go. Finally, if the worker function is mostly I/O, then multithreading is what you want.
There is a solution to using X processes (based on whatever value of X you have settled on) to complete N tasks and to be able to get return values back from your worker function, namely using a process pool of size X.
MULTITHREADING = False
n_tasks = len(arg_tr_cl)
if MULTITHREADING:
from multiprocessing.dummy import Pool
# To use multithreading instead (we can use a much larger pool size):
pool_size = min(n_tasks, 100) # 100 is fairly arbitrary
else:
from multiprocessing import Pool, cpu_count
# No point in creating pool size larger than the number of tasks we have
# Otherwise, assuming we are mostly CPU-intensive, just create pool size
# equal to the number of cpu cores that we have:
n_processors = cpu_count()
pool_size = min(n_tasks, n_processors)
pool = Pool(pool_size)
return_values = pool.map(self.text_similarity_spellings, arg_tr_cl)
# You can now iterate return_values to get the return values:
for return_value in return_values:
...
# or create a list, for example: return_values = list(return_values)
But it may be that the SpellChecker is doing lots of I/O if each invocation has to read in an external dictionary. If that is the case, is it not possible that your best performance is to initialize the SpellChecker once and then just loop checking each word and forget completely about multiprocessing (or multithreading)?

Creating a Queue delay in a Python pool without blocking

I have a large program (specifically, a function) that I'm attempting to parallelize using a JoinableQueue and the multiprocessing map_async method. The function that I'm working with does several operations on multidimensional arrays, so I break up each array into sections, and each section evaluates independently; however I need to stitch together one of the arrays early on, but the "stitch" happens before the "evaluate" and I need to introduce some kind of delay in the JoinableQueue. I've searched all over for a workable solution but I'm very new to multiprocessing and most of it goes over my head.
This phrasing may be confusing- apologies. Here's an outline of my code (I can't put all of it because it's very long, but I can provide additional detail if needed)
import numpy as np
import multiprocessing as mp
from multiprocessing import Pool, Pipe, JoinableQueue
def main_function(section_number):
#define section sizes
array_this_section = array[:,start:end+1,:]
histogram_this_section = np.zeros((3, dataset_size, dataset_size))
#start and end are defined according to the size of the array
#dataset_size is to show that the histogram is a different size than the array
for m in range(1,num_iterations+1):
#do several operations- each section of the array
#corresponds to a section on the histogram
hist_queue.put(histogram_this_section)
#each process sends their own part of the histogram outside of the pool
#to be combined with every other part- later operations
#in this function must use the full histogram
hist_queue.join()
full_histogram = full_hist_queue.get()
full_hist_queue.task_done()
#do many more operations
hist_queue = JoinableQueue()
full_hist_queue = JoinableQueue()
if __name__ == '__main__':
pool = mp.Pool(num_sections)
args = np.arange(num_sections)
pool.map_async(main_function, args, chunksize=1)
#I need the map_async because the program is designed to display an output at the
#end of each iteration, and each output must be a compilation of all processes
#a few variable definitions go here
for m in range(1,num_iterations+1):
for i in range(num_sections):
temp_hist = hist_queue.get() #the code hangs here because the queue
#is attempting to get before anything
#has been put
hist_full += temp_hist
for i in range(num_sections):
hist_queue.task_done()
for i in range(num_sections):
full_hist_queue.put(hist_full) #the full histogram is sent back into
#the pool
full_hist_queue.join()
#etc etc
pool.close()
pool.join()
I'm pretty sure that your issue is how you're creating the Queues and trying to share them with the child processes. If you just have them as global variables, they may be recreated in the child processes instead of inherited (the exact details depend on what OS and/or context you're using for multiprocessing).
A better way to go about solving this issue is to avoid using multiprocessing.Pool to spawn your processes and instead explicitly create Process instances for your workers yourself. This way you can pass the Queue instances to the processes that need them without any difficulty (it's technically possible to pass the queues to the Pool workers, but it's awkward).
I'd try something like this:
def worker_function(section_number, hist_queue, full_hist_queue): # take queues as arguments
# ... the rest of the function can work as before
# note, I renamed this from "main_function" since it's not running in the main process
if __name__ == '__main__':
hist_queue = JoinableQueue() # create the queues only in the main process
full_hist_queue = JoinableQueue() # the workers don't need to access them as globals
processes = [Process(target=worker_function, args=(i, hist_queue, full_hist_queue)
for i in range(num_sections)]
for p in processes:
p.start()
# ...
If the different stages of your worker function are more or less independent of one another (that is, the "do many more operations" step doesn't depend directly on the "do several operations" step above it, just on full_histogram), you might be able to keep the Pool and instead split up the different steps into separate functions, which the main process could call via several calls to map on the pool. You don't need to use your own Queues in this approach, just the ones built in to the Pool. This might be best especially if the number of "sections" you're splitting the work up into doesn't correspond closely with the number of processor cores on your computer. You can let the Pool match the number of cores, and have each one work on several sections of the data in turn.
A rough sketch of this would be something like:
def worker_make_hist(section_number):
# do several operations, get a partial histogram
return histogram_this_section
def worker_do_more_ops(section_number, full_histogram):
# whatever...
return some_result
if __name__ == "__main__":
pool = multiprocessing.Pool() # by default the size will be equal to the number of cores
for temp_hist in pool.imap_unordered(worker_make_hist, range(number_of_sections)):
hist_full += temp_hist
some_results = pool.starmap(worker_do_more_ops, zip(range(number_of_sections),
itertools.repeat(hist_full)))

Enqueuing a tf.RandomShuffleQueue from multiple processes using multiprocessing

I would like to use multiple processes (not threads) to do some preprocessing and enqueue the results to a tf.RandomShuffleQueue which can be used by my main graph for training.
Is there a way to do that ?
My actual problem
I have converted my dataset into TFRecords split across 256 shards. I want to start 20 processes using multiprocessing and let each process a range of shards. Each process should read images and then augment them and push them into a tf.RandomShuffleQueue from which the input can be given to a graph for training.
Some people advised me to go through the inception example in tensorflow. However, it is a very different situation because there only reading of the data shards is done by multiple threads (not processes), while the preprocessing (e.g - augmentation) takes place in the main thread.
(This aims to solve your actual problem)
In another topic, someone told you that Python has the global interpreter lock (GIL) and therefore there would be no speed benefits from multi-core, unless you used multiple processes.
This was probably what prompted your desire to use multiprocessing.
However, with TF, Python is normally used only to construct the "graph". The actual execution happens in native code (or GPU), where GIL plays no role whatsoever.
In light of this, I recommend simply letting TF use multithreading. This can be controlled using the intra_op_parallelism_threads argument, such as:
with tf.Session(graph=graph,
config=tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=20)) as sess:
# ...
(Side note: if you have, say, a 2-CPU, 32-core system, the best argument may very well be intra_op_parallelism_threads=16, depending on a lot of factors)
Comment: The pickling of TFRecords is not that important.
I can pass a list of lists containing names of ranges of sharded TFRecord files.
Therebe I have to restart Decision process!
Comment: I can pass it to a Pool.map() as an argument.
Verify, if a multiprocesing.Queue() can handle this.
Results of Tensor functions are a Tensor object.
Try the following:
tensor_object = func(TFRecord)
q = multiprocessing.Manager().Queue()
q.put(tensor_object)
data = q.get()
print(data)
Comment: how do I make sure that all the processes enqueue to the same queue ?
This is simple done enqueue the results from Pool.map(...
after all process finished.
Alternate we can enqueue parallel, queueing data from all processes.
But doing so, depends on pickleabel data as described above.
For instance:
import multiprocessing as mp
def func(filename):
TFRecord = read(filename)
tensor_obj = tf.func(TFRecord)
return tensor_obj
def main_Tensor(tensor_objs):
tf = # ... instantiat Tensor Session
rsq = tf.RandomShuffleQueue(...)
for t in tensor_objs:
rsq.enqueue(t)
if __name__ == '__main__':
sharded_TFRecords = ['file1', 'file2']
with mp.Pool(20) as pool:
tensor_objs = pool.map(func, sharded_TFRecords)
pool.join()
main_Tensor(tensor_objs)
It seems the recommended way to run TF with multiprocessing is via creating a separate tf.Session for each child as sharing it across processes is unfeasible.
You can take a look at this example, I hope it helps.
[EDIT: Old answer]
You can use a multiprocessing.Pool and rely on its callback mechanism to put results in the tf.RandomShuffleQueue as soon as they are ready.
Here's a very simple example on how to do it.
from multiprocessing import Pool
class Processor(object):
def __init__(self, random_shuffle_queue):
self.queue = random_shuffle_queue
self.pool = Pool()
def schedule_task(self, task):
self.pool.apply_async(processing_function, args=[task], callback=self.task_done)
def task_done(self, results):
self.queue.enqueue(results)
This assumes Python 2, for Python 3 I'd recommend to use a concurrent.futures.ProcessPoolExecutor.

Assembling Numpy Array in Parallel

I am attempting to parallelize an algorithm that I have been working on using the Multiprocessing and Pool.map() commands. I ran into a problem and was hoping someone could point me in the right direction.
Let x denote an array of N rows and 1 column, which is initialized to be a vector of zeros. Let C denote an array of length N by 2. The vector x is constructed iteratively by using information from some subsets of C (doing some math operations). The code (not parallelized) as a large for loop looks roughly as follows:
for j in range(0,N)
#indx_j will have n_j <<N entries
indx_j = build_indices(C,j)
#x_j will be entries to be added to vector x at indices indx_j
#This part is time consuming
x_j = build_x_j(indx_j,C)
#Add x_j into entries of x
x[indx_j] = x[indx_j] + x_j
I was able to parallelize this using the multiprocessing module and using the pool.map to eliminate the large for loop. I wrote a function that did the above computations, except the step of adding x_j to x[indx_j]. The parallelized function instead returns two data sets back: x_j and indx_j. After those are computed, I run a for loop (not parallel) to build up x by doing the x[indx_j] = x[indx_j] +x_j computation for j=0,N.
The downside to my method is that pool.map operation returns a gigantic list of N pairs of arrays x_j and indx_j. where both x_j and indx_j were n_j by 1 vectors (n_j << N). For large N (N >20,000) this was taking up way too much memory. Here is my question: Can I somehow, in parallel, do the construction operation x[indx_j] = x[indx_j] + x_j. It seems to me each process in pool.map() would have to be able to interact with the vector x. Do I place x in some sort of shared memory? How would I do such a thing? I suspect that this has to be possible somehow, as I assume people assemble matrices in parallel for finite element methods all the time. How can I have multiple processes interact with a vector without having some sort of problem? I'm worried that perhaps for j= 20 and j = 23, if they happen simultaneously, they might try to add to x[indx_20] = x[indx_20] + x_20 and simultaneously x[indx_30] = x[indx_30] + x_30 and maybe some error will happen. I also don't know how to even have this computation done via the pool.map() (I don't think I can feed x in as an input, as it would be changing after each process).
I'm not sure if it matters or not, but the sets indx_j will have non-trivial intersection (e.g., indx_1 and indx_2 may have indices [1,2,3] and [3,4,5] for example).
If this is unclear, please let me know and I will attempt to clarify. This is my first time trying to work in parallel, so I am very unsure of how to proceed. Any information would be greatly appreciated. Thanks!
I dont know If I am qualified to give proper advice on the topic of shared memory arrays, but I had a similar need to share arrays across processes in python recently and came across a small custom numpy.ndarray implementation for a shared memory array in numpy using the shared ctypes within multiprocessing. Here is a link to the code: shmarray.py. It acts just like a normal array,except the underlying data is stored in shared memory, meaning separate processes can both read and write to the same array.
Using Shared Memory Array
In threading, all information available to the thread (global and local namespace) can be handled as shared between all other threads that have access to it, but in multiprocessing that data is not so easily accessible. On linux data is available for reading, but cannot be written to. Instead when a write is done, the data is copied and then written to, meaning no other process can see those changes. However, if the memory being written to is shared memory, it is not copied. This means with shmarray we can do things similar to the way we would do threading, with the true parallelism of multiprocessing. One way to have access to the shared memory array is with a subclass. I know you are currently using Pool.map(), but I had felt limited by the way map worked, especially when dealing with n-dimensional arrays. Pool.map() is not really designed to work with numpy styled interfaces, at least I don't think it can easily. Here is a simple idea where you would spawn a process for each j in N:
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, j, C, x):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.j = j
def run(self):
#Your Stuff
#indx_j will have n_j <<N entries
indx_j = build_indices(self.C,self.j)
#x_j will be entries to be added to vector x at indices indx_j
x_j = build_x_j(indx_j,self.C)
#Add x_j into entries of x
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#And then actually do the work
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
for j in range(N):
proc = Worker(j, C, x)
procs.append(proc)
proc.start()
#And then join() the processes with the main process
for proc in procs:
proc.join()
Custom Process Pool and Queues
So this might work, but spawning several thousand processes is not really going to be of any use if you only have a few cores. The way I handled this was to implement a Queue system between my process. That is to say, we have a Queue that the main process fills with j's and then a couple worker processes get numbers from the Queue and do work with it, note that by implementing this, you are essentially doing exactly what Pool does. Also note we are actually going to use multiprocessing.JoinableQueue for this since it lets use join() to wait till a queue is emptied.
Its not hard to implement this at all really, simply we must modify our Subclass a bit and how we use it.
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, C, x, job_queue):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.job_queue = job_queue
def run(self):
#New Queue Stuff
j = None
while j!='kill': #this is how I kill processes with queues, there might be a cleaner way.
j = self.job_queue.get() #gets a job from the queue if there is one, otherwise blocks.
if j!='kill':
#Your Stuff
indx_j = build_indices(self.C,j)
x_j = build_x_j(indx_j,self.C)
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#This tells the queue that the job that was pulled from it
#Has been completed (we need this for queue.join())
self.job_queue.task_done()
#The way we interact has changed, now we need to define a job queue
job_queue = multiprocessing.JoinableQueue()
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
proc_count = multiprocessing.cpu_count() # create as many procs as cores
for _ in range(proc_count):
proc = Worker(C, x, job_queue) #now we pass the job queue instead
procs.append(proc)
proc.start()
#all the workers are just waiting for jobs now.
for j in range(N):
job_queue.put(j)
job_queue.join() #this blocks the main process until the queue has been emptied
#Now if you want to kill all the processes, just send a 'kill'
#job for each process.
for proc in procs:
job_queue.put('kill')
job_queue.join()
Finally, I really cannot say anything about how this will handle writing to overlapping indices at the same time. Worst case is that you could have a serious problem if two things attempt to write at the same time and things get corrupted/crash(I am no expert here so I really have no idea if that would happen). Best case since you are just doing addition, and order of operations doesn't matter, everything runs smoothly. If it doesn't run smoothly, my suggestion is to create a second custom Process subclass that specifically does the array assignment. To implement this you would need to pass both a job queue, and an 'output' queue to the Worker subclass. Within the while loop, you should have a `output_queue.put((indx_j, x_j)). NOTE: If you are putting these into a Queue they are being pickled, which can be slow. I recommend making them shared memory arrays if they can be before using put. It may be faster to just pickle them in some cases, but I have not tested this. To assign these as they are generated, you then need to have your Assigner process read these values from a queue as jobs and apply them, such that the work loop would essentially be:
def run(self):
job = None
while job!='kill':
job = self.job_queue.get()
if job!='kill':
indx_j, x_j = job
#Note this is the process which really needs access to the X array.
self.x[indx_j] += x_j
self.job_queue.task_done()
This last solution will likely be slower than doing the assignment within the worker threads, but if you are doing it this way, you have no worries about race conditions, and memory is still lighter since you can use up the indx_j and x_j values as you generate them, instead of waiting till all of them are done.
Note for Windows
I didn't do any of this work on windows, so I am not 100% certain, but I believe the code above will be very memory intensive since windows does not implement a copy-on-write system for spawning independent processes. Essentially windows will copy ALL information that a process needs when spawning a new one from the main process. To fix this, I think replacing all your x_j and C with shared memory arrays (anything you will be handing around to other processes) instead of normal arrays should cause windows to not copy the data, but I am not certain. You did not specify which platform you were on so I figured better safe than sorry, since multiprocessing is a different beast on windows than linux.

Categories

Resources