I am trying to figure out how to perfom a multiprocessing task with an unusual formulation.
Basically, given two lists containing 10 matrices for each list, I have to check if applying an operation (that I'll call fn) gives the same results if the input is (A, B) or vice versa (B, A).
With a sequential approach, the solution is streightforward:
#Given
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
AB_BA= [fn(A[i], B[i])==fn(B[i], A[i]) for i in range(0, len(A)) ]
The next task is a bit strange because it requires setting strictly more than ten threads and applying multiprocessing. The restriction is that you can not assign all the single comparisons to ten different processes because the remaining processes will be unused. I do not know why the request seems to be using "process" and "thread" interchangeably.
This task seems a bit confusing because in multiprocessing, generally, you set the maximum number of workers, not the minimum.
I tried to use a solution that uses a ProcessPoolExecutor, as follows:
def equality(A, B,i):
res= fn(A[i], B[i]) == fn(B[i],A[i] )
return(res)
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
idx=range(0, len(A))
results= executor.map(equality, A, B, idx)
for result in results:
print(result)
My problem is that I am not sure how to check resource usage. I have naively tried to monitor the CPU usage using the ubuntu system monitor as well as "top" from the command line.
In addition, this solution is the most efficient among those I tried, but there is not a direct specification to use at least 11 workers, so this solution seems not to stick with what was requested.
I also tried other solutions, such as using pool directly. This causes to evoke 10 python instances using top, but again, not more than 10. Here's what I tried:
def equality(A, B):
res=fn(A, B) == fn(B,A )
return(res)
with mp.Pool(20) as p:
print(p.starmap(equality, ((A[i], B[i]) for i in range(0, len(A)))))
Do you have any suggestions to address this request as well as monitor the resource usage to be sure it is working as expected?
Thank you very much for your help in advance.
I wish you had published the actual problem word for word, since your description is a bit unclear. But this is what I know (or think I know):
Unless the amount of CPU processing done by your worker function equality is great enough so that what is gained by running the function in parallel more than offsets the additional multiprocessing overhead you would not otherwise have if not using multiprocessing (i.e. starting processes, moving data from one address space to another, etc.), your multiprocessing code will run more slowly. Therefore, you should design your worker function to do the most work possible and to pass as little data as possible.
When you specify ...
results = executor.map(equality, A, B, idx)
... your equality function will be invoked once for each element of A, B and idx. So what is being passed is not the entire lists A and B but rather individual elements (e.g. matrix_a1 and matrix_b1). Therefore, there is no point in even passing an idx argument:
def equality(matrix_a, matrix_b):
"""
matrix_a and matrix_a are each single elements of
lists A and B respecticely.
"""
return fn(matrix_a) == fn(matrix_b)
def main():
from os import cpu_count
from concurrent.futures import ProcessPoolExecutor
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
# Do not create more processes then we have either
# CPU cores or the number of tasks that need to submit:
pool_size = min(cpu_count(), len(A))
with ProcessPoolExecutor(max_workers=pool_size) as executor:
AB_BA = list(executor.map(equality, A, B))
# This will be a list of 10 elements, each either `True` or `False`:
print(AB_BA)
# Required for Windows:
if __name__ == '__main__':
main()
So we will be submitting 10 tasks to a pool size of 10. Internally there is a "task queue" on which all the arguments being passed to equality exist:
matrix_a1, matrix_b1 # task 1
matrix_a2, matrix_b2 # task 2
...
matrix_a10, matrix_b10 # task 10
Any process in the pool that is idle will grap the next task in the queue to work on and the results will be returned in task submission order. But since equality is such a short-running function unless function fn is sufficiently complicated, there is the possibility that the pool process that grabs the first task can complete it and then grab the second task before some other pool process is dispatched by the operating system and can grab it. So there is no guarantee that all 10 tasks will be worked on in parallel by 10 pool processes even if function fn is sufficiently CPU-intensive. If you were to insert a call to time.sleep(.1) at the beginning of equality, that would give the other pool processes a chance to "wake up" and grab its own task from the task queue. But that would slow your program down even more since sleeping for this purposes is totally non-productive. But the point I am trying to make is that you cannot ensure that all pool processes will always be active concurrently.
Related
I'm doing a small task in Python:
1- write a simple add function that will add two given numbers and print their sum.
2- Declare an array of 10 numbers.
3- call sum function in parallel threads and pass the numbers form above array to function as per following rules:
--> sum(array[thread_id], array[9-thread_id])
I have done step 1 and step 2 but I don't know what to do in step 3. Here's the code below:
import array
import multiprocessing
import threading
from multiprocessing import Process
# 1- write a simple add function that will add two given numbers and print their sum
x = 5
y = 10
def add(a,b):
sum = int(a) + int(b)
print ("The sum is: ", sum)
add(x,y)
# 2 Declare an array of 10 numbers
a_list = list(range(1, 11))
print ('an array of 10 numbers: ' + str(a_list))
# 3 call sum function in parallel threads and pass the numbers form above array to function as per following rules
# --> sum(array[thread_id], array[9-thread_id])
Can anyone please help me with what to do in step 3?
I have little knowledge of parallel threads (or multiprocessing)
In Python, you can start a new thread like so
import threading
def function_to_run_in_thread(arg1, arg2): ...
threading.Thread(target=function_to_run_in_thread, args=("some", "arguments")).start()
and you can start a new process similarly like so
import multiprocessing
def function_to_run_in_subprocess(arg1, arg2): ...
multiprocessing.Process(target=function_to_run_in_subprocess, args=("some", "arguments")).start()
however, it is not particularly easy to get at the function return value by threading/multiprocessing in this way. It would be better to use a ThreadPoolExecutor or ProcessPoolExecutor from the concurrent.futures module.
so in your case we can do
# Both the ThreadPoolExecutor and ProcessPoolExecutor are basically interchangeable from an API perspective
from concurrent.futures import ThreadPoolExecutor as Pool
# from concurrent.futures import ProcessPoolExecutor as Pool
pool = Pool(10) # the argument is the number of threads/processes available
sum_futures = [pool.submit(add, *pair) for pair in zip(a_list, reversed(a_list))]
# We must explicitly wait for the calculation to finish for each add call
sum_results = [future.result() for future in sum_futures]
Or, we can also do
sum_results = list(pool.map(lambda args: add(*args), zip(a_list, reversed(a_list))))
to be more concise
In CPython (the reference implementation of Python, and the one that you're probably using), only one thread can be holding the Global Interpreter Lock at a given time. This means that multithreading in code running in the CPython interpreter only gives you a benefit for doing operations that do not require holding the GIL, such as
blocking network/IO operations
running certain code in a C extension that explicitly releases the GIL
Basically what this means is that the operation of adding two numbers together in python cannot truly execute in parallel when using multithreading.
However, mulitprocessing comes with its own drawbacks, one of which being that serialization (using pickle) is required to pass values between processes. Some values cannot be pickled (e.g lambda expressions), which can be a potential roadblock. Threads are not subject to this restriction because they can communicate over using shared regions of memory, which is also a lot faster.
I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.
I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking
I have searched the site but I am not sure precisely what terms would yield relevant answers, my apologies if this question is redundant.
I need to process a very very large matrix (14,000,000 * 250,000) and would like to exploit Python's multiprocessing module to speed things up. For each pair of columns in the matrix I need to apply a function which will then store the results in a proprietary class.
I will be implementing a double four loop which provides the necessary combinations of columns.
I do not want to load up a pool with 250,000 tasks as I fear the memory usage will be significant.Ideally, I would like to have one column then be tasked out amongst the pool I.e
Process 1 takes Column A and Column B and a function F takes A,B and G and then stores the result in Class G[A,B]
Process 2 takes Column A and Column C and proceeds similarly
The processes will never access the same element of G.
So I would like to pause the for loop every N tasks. The set/get methods of G will be overriden to perform some back end tasks.
What I do not understand is whether or not pausing the loop is necessary? I.e is Python smart enough to only take what it can work on? Or will it be populating a massive amount of tasks?
Lastly, I am unclear of how the results work. I just want them to be set in G and not return anything. I do not want to have to worry about about .get() etc. but from my understanding the pool method returns a result object. Can I just ignore this?
Is there a better way? Am I completly lost?
First off - you will want to create a multiprocessing pool class. You setup how many workers you want and then use map to start up tasks. I am sure you already know but here is the python multiprocessing docs.
You say that you don't want to return data because you don't need to but how are you planning on viewing results? Will each task write the data to disk? To pass data between your processes you will want to use something like the multiprocessing queue.
Here is example code from the link on how to use process and queue:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
And this is an example of using the Pool:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
Edit: #goncalopp makes a very important point that you may not want to do heavy numerical calculations in python due to how slow it is. Numpy is a great package for doing number crunching.
If you are heavily IO bound due to writing to disk on each process you should consider running something like 4*num_processors so that you always have something to do. You also should make sure you have a very fast disk :)
I am attempting to parallelize an algorithm that I have been working on using the Multiprocessing and Pool.map() commands. I ran into a problem and was hoping someone could point me in the right direction.
Let x denote an array of N rows and 1 column, which is initialized to be a vector of zeros. Let C denote an array of length N by 2. The vector x is constructed iteratively by using information from some subsets of C (doing some math operations). The code (not parallelized) as a large for loop looks roughly as follows:
for j in range(0,N)
#indx_j will have n_j <<N entries
indx_j = build_indices(C,j)
#x_j will be entries to be added to vector x at indices indx_j
#This part is time consuming
x_j = build_x_j(indx_j,C)
#Add x_j into entries of x
x[indx_j] = x[indx_j] + x_j
I was able to parallelize this using the multiprocessing module and using the pool.map to eliminate the large for loop. I wrote a function that did the above computations, except the step of adding x_j to x[indx_j]. The parallelized function instead returns two data sets back: x_j and indx_j. After those are computed, I run a for loop (not parallel) to build up x by doing the x[indx_j] = x[indx_j] +x_j computation for j=0,N.
The downside to my method is that pool.map operation returns a gigantic list of N pairs of arrays x_j and indx_j. where both x_j and indx_j were n_j by 1 vectors (n_j << N). For large N (N >20,000) this was taking up way too much memory. Here is my question: Can I somehow, in parallel, do the construction operation x[indx_j] = x[indx_j] + x_j. It seems to me each process in pool.map() would have to be able to interact with the vector x. Do I place x in some sort of shared memory? How would I do such a thing? I suspect that this has to be possible somehow, as I assume people assemble matrices in parallel for finite element methods all the time. How can I have multiple processes interact with a vector without having some sort of problem? I'm worried that perhaps for j= 20 and j = 23, if they happen simultaneously, they might try to add to x[indx_20] = x[indx_20] + x_20 and simultaneously x[indx_30] = x[indx_30] + x_30 and maybe some error will happen. I also don't know how to even have this computation done via the pool.map() (I don't think I can feed x in as an input, as it would be changing after each process).
I'm not sure if it matters or not, but the sets indx_j will have non-trivial intersection (e.g., indx_1 and indx_2 may have indices [1,2,3] and [3,4,5] for example).
If this is unclear, please let me know and I will attempt to clarify. This is my first time trying to work in parallel, so I am very unsure of how to proceed. Any information would be greatly appreciated. Thanks!
I dont know If I am qualified to give proper advice on the topic of shared memory arrays, but I had a similar need to share arrays across processes in python recently and came across a small custom numpy.ndarray implementation for a shared memory array in numpy using the shared ctypes within multiprocessing. Here is a link to the code: shmarray.py. It acts just like a normal array,except the underlying data is stored in shared memory, meaning separate processes can both read and write to the same array.
Using Shared Memory Array
In threading, all information available to the thread (global and local namespace) can be handled as shared between all other threads that have access to it, but in multiprocessing that data is not so easily accessible. On linux data is available for reading, but cannot be written to. Instead when a write is done, the data is copied and then written to, meaning no other process can see those changes. However, if the memory being written to is shared memory, it is not copied. This means with shmarray we can do things similar to the way we would do threading, with the true parallelism of multiprocessing. One way to have access to the shared memory array is with a subclass. I know you are currently using Pool.map(), but I had felt limited by the way map worked, especially when dealing with n-dimensional arrays. Pool.map() is not really designed to work with numpy styled interfaces, at least I don't think it can easily. Here is a simple idea where you would spawn a process for each j in N:
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, j, C, x):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.j = j
def run(self):
#Your Stuff
#indx_j will have n_j <<N entries
indx_j = build_indices(self.C,self.j)
#x_j will be entries to be added to vector x at indices indx_j
x_j = build_x_j(indx_j,self.C)
#Add x_j into entries of x
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#And then actually do the work
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
for j in range(N):
proc = Worker(j, C, x)
procs.append(proc)
proc.start()
#And then join() the processes with the main process
for proc in procs:
proc.join()
Custom Process Pool and Queues
So this might work, but spawning several thousand processes is not really going to be of any use if you only have a few cores. The way I handled this was to implement a Queue system between my process. That is to say, we have a Queue that the main process fills with j's and then a couple worker processes get numbers from the Queue and do work with it, note that by implementing this, you are essentially doing exactly what Pool does. Also note we are actually going to use multiprocessing.JoinableQueue for this since it lets use join() to wait till a queue is emptied.
Its not hard to implement this at all really, simply we must modify our Subclass a bit and how we use it.
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, C, x, job_queue):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.job_queue = job_queue
def run(self):
#New Queue Stuff
j = None
while j!='kill': #this is how I kill processes with queues, there might be a cleaner way.
j = self.job_queue.get() #gets a job from the queue if there is one, otherwise blocks.
if j!='kill':
#Your Stuff
indx_j = build_indices(self.C,j)
x_j = build_x_j(indx_j,self.C)
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#This tells the queue that the job that was pulled from it
#Has been completed (we need this for queue.join())
self.job_queue.task_done()
#The way we interact has changed, now we need to define a job queue
job_queue = multiprocessing.JoinableQueue()
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
proc_count = multiprocessing.cpu_count() # create as many procs as cores
for _ in range(proc_count):
proc = Worker(C, x, job_queue) #now we pass the job queue instead
procs.append(proc)
proc.start()
#all the workers are just waiting for jobs now.
for j in range(N):
job_queue.put(j)
job_queue.join() #this blocks the main process until the queue has been emptied
#Now if you want to kill all the processes, just send a 'kill'
#job for each process.
for proc in procs:
job_queue.put('kill')
job_queue.join()
Finally, I really cannot say anything about how this will handle writing to overlapping indices at the same time. Worst case is that you could have a serious problem if two things attempt to write at the same time and things get corrupted/crash(I am no expert here so I really have no idea if that would happen). Best case since you are just doing addition, and order of operations doesn't matter, everything runs smoothly. If it doesn't run smoothly, my suggestion is to create a second custom Process subclass that specifically does the array assignment. To implement this you would need to pass both a job queue, and an 'output' queue to the Worker subclass. Within the while loop, you should have a `output_queue.put((indx_j, x_j)). NOTE: If you are putting these into a Queue they are being pickled, which can be slow. I recommend making them shared memory arrays if they can be before using put. It may be faster to just pickle them in some cases, but I have not tested this. To assign these as they are generated, you then need to have your Assigner process read these values from a queue as jobs and apply them, such that the work loop would essentially be:
def run(self):
job = None
while job!='kill':
job = self.job_queue.get()
if job!='kill':
indx_j, x_j = job
#Note this is the process which really needs access to the X array.
self.x[indx_j] += x_j
self.job_queue.task_done()
This last solution will likely be slower than doing the assignment within the worker threads, but if you are doing it this way, you have no worries about race conditions, and memory is still lighter since you can use up the indx_j and x_j values as you generate them, instead of waiting till all of them are done.
Note for Windows
I didn't do any of this work on windows, so I am not 100% certain, but I believe the code above will be very memory intensive since windows does not implement a copy-on-write system for spawning independent processes. Essentially windows will copy ALL information that a process needs when spawning a new one from the main process. To fix this, I think replacing all your x_j and C with shared memory arrays (anything you will be handing around to other processes) instead of normal arrays should cause windows to not copy the data, but I am not certain. You did not specify which platform you were on so I figured better safe than sorry, since multiprocessing is a different beast on windows than linux.