Dividing a for loop for each process

Dividing a for loop for each process - python

I have this code:
def loop():
alphabet = string.digits + string.letters
for key in itertools.product(alphabet, repeat=6):
...
I am using 4 processes using this code:
if __name__ == '__main__':
jobs = []
for i in range(4):
p = multiprocessing.Process(target=loop)
jobs.append(p)
p.start()
Now.. this will just run the entire function 4 times, I need to somehow split the workload into 4 and run each process on its own, so in this case I need to split the characters I'm generating into 4 different parts.. for example:
Process 1 workload
100,101,102,103
Process 2 workload
104,105,106,107
Process 3 workload
108,109,110,111
Process 4 workload
112,113,114,115
I think you should understand what I want to do..
I tried looping through and just throwing away but it can get super slow when using a large length of characters.. If I had 1,000,000 lines and the processor name was 4, it will loop 750,000 times without doing anything and process the next 250,000, if the processor name was 3.. it would loop 500,000 times, process the next 250k and finish at 75000, so much wasted computing power though :/

You need to divide the workload beforehand and pass it in to your function when you call Process. Generally speaking, this can be a hard problem, but in your case it's pretty trivial since you're just generating cartesian products -- simply slice off the first character and attach it separately.
i.e. instead of generating repeat=6, use repeat=5 and iterate through the possibilities for the first letter yourself, passing each to a separate process.
For example:
def loop(first, sequence):
for seq in sequence:
key = first + seq
....
and call it with:
alphabet = ...
for letter in alphabet:
p = Process(target=loop, args=(letter, itertools.product(alphabet, repeat=5))
# etc.
This will spawn one process per letter in your alphabet; you could do exactly four splits or other things like that by passing ranges for the first character, too.

It sounds like each task only requires a small amount of data, so try using multiprocessing.Pool to create a pool of workers. It will start a pool of worker processes, and send a chunk of items to each worker. Use something like imap_unordered to map all the input combinations to their results.

Related

Multiprocessing task in python

I am trying to figure out how to perfom a multiprocessing task with an unusual formulation.
Basically, given two lists containing 10 matrices for each list, I have to check if applying an operation (that I'll call fn) gives the same results if the input is (A, B) or vice versa (B, A).
With a sequential approach, the solution is streightforward:
#Given
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
AB_BA= [fn(A[i], B[i])==fn(B[i], A[i]) for i in range(0, len(A)) ]
The next task is a bit strange because it requires setting strictly more than ten threads and applying multiprocessing. The restriction is that you can not assign all the single comparisons to ten different processes because the remaining processes will be unused. I do not know why the request seems to be using "process" and "thread" interchangeably.
This task seems a bit confusing because in multiprocessing, generally, you set the maximum number of workers, not the minimum.
I tried to use a solution that uses a ProcessPoolExecutor, as follows:
def equality(A, B,i):
res= fn(A[i], B[i]) == fn(B[i],A[i] )
return(res)
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
idx=range(0, len(A))
results= executor.map(equality, A, B, idx)
for result in results:
print(result)
My problem is that I am not sure how to check resource usage. I have naively tried to monitor the CPU usage using the ubuntu system monitor as well as "top" from the command line.
In addition, this solution is the most efficient among those I tried, but there is not a direct specification to use at least 11 workers, so this solution seems not to stick with what was requested.
I also tried other solutions, such as using pool directly. This causes to evoke 10 python instances using top, but again, not more than 10. Here's what I tried:
def equality(A, B):
res=fn(A, B) == fn(B,A )
return(res)
with mp.Pool(20) as p:
print(p.starmap(equality, ((A[i], B[i]) for i in range(0, len(A)))))
Do you have any suggestions to address this request as well as monitor the resource usage to be sure it is working as expected?
Thank you very much for your help in advance.

I wish you had published the actual problem word for word, since your description is a bit unclear. But this is what I know (or think I know):
Unless the amount of CPU processing done by your worker function equality is great enough so that what is gained by running the function in parallel more than offsets the additional multiprocessing overhead you would not otherwise have if not using multiprocessing (i.e. starting processes, moving data from one address space to another, etc.), your multiprocessing code will run more slowly. Therefore, you should design your worker function to do the most work possible and to pass as little data as possible.
When you specify ...
results = executor.map(equality, A, B, idx)
... your equality function will be invoked once for each element of A, B and idx. So what is being passed is not the entire lists A and B but rather individual elements (e.g. matrix_a1 and matrix_b1). Therefore, there is no point in even passing an idx argument:
def equality(matrix_a, matrix_b):
"""
matrix_a and matrix_a are each single elements of
lists A and B respecticely.
"""
return fn(matrix_a) == fn(matrix_b)
def main():
from os import cpu_count
from concurrent.futures import ProcessPoolExecutor
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
# Do not create more processes then we have either
# CPU cores or the number of tasks that need to submit:
pool_size = min(cpu_count(), len(A))
with ProcessPoolExecutor(max_workers=pool_size) as executor:
AB_BA = list(executor.map(equality, A, B))
# This will be a list of 10 elements, each either `True` or `False`:
print(AB_BA)
# Required for Windows:
if __name__ == '__main__':
main()
So we will be submitting 10 tasks to a pool size of 10. Internally there is a "task queue" on which all the arguments being passed to equality exist:
matrix_a1, matrix_b1 # task 1
matrix_a2, matrix_b2 # task 2
...
matrix_a10, matrix_b10 # task 10
Any process in the pool that is idle will grap the next task in the queue to work on and the results will be returned in task submission order. But since equality is such a short-running function unless function fn is sufficiently complicated, there is the possibility that the pool process that grabs the first task can complete it and then grab the second task before some other pool process is dispatched by the operating system and can grab it. So there is no guarantee that all 10 tasks will be worked on in parallel by 10 pool processes even if function fn is sufficiently CPU-intensive. If you were to insert a call to time.sleep(.1) at the beginning of equality, that would give the other pool processes a chance to "wake up" and grab its own task from the task queue. But that would slow your program down even more since sleeping for this purposes is totally non-productive. But the point I am trying to make is that you cannot ensure that all pool processes will always be active concurrently.

Python multiprocessing refuses to split up my iterable

I am trying to write a multiprocessed program and it seems I have done it and I have verified with the System Monitor app that the Python processes are created. But the thing is that it appears almost all of them are not utilized in reality. In my program I am trying to split audio files in chunks, so I don't consider it to be a "trivial computational load" as I have read in other threads.
A minimal example that shows the same behavior for me:
import os, random, time
from tqdm import tqdm
from multiprocessing import Pool
def myfunc(myli):
print(len(myli))
for item in myli:
x = item*item*item
time.sleep(2)
return
mylist = [random.randint(1,10000) for _ in range (0, 19999)]
with Pool(processes=8) as p, tqdm(total=len(mylist)) as pbar:
for _ in p.imap_unordered(func=myfunc, iterable=(mylist,)):
pbar.update()
As you see I have added a print() inside the func used, and every time it prints the length of the entire array. As if no splitting is happening.
I have naively tried using different chunksizes and removing tqdm (as if it plays any role).
If you could give me any insight, I would appreciate it.

The code is doing what you told it to do: you passed an iterable of length 1, a tuple containing a single item (mylist). So it passes that single item to a single worker to process.
But you can't do iterable=mylist instead, because myfunc() expects to get a sequence, not an integer. Whatever the iterable is, multiprocessing passes it to the worker one element at a time. chunksize has nothing to do with that. Whether chunksize is 1 or a billion, the worker functions see one element at a time. chunksize is an under-the-covers optimization, purely to reduce the number of expensive interprocess communication calls required.
If you want to split a sequence into chunks and use worker functions that expect chunks, then you have to do the "chunking" yourself. For example, add
# Generate slices of `xs` of length (at most) `n`.
def chunkit(xs, n):
start = 0
while start < len(xs):
yield xs[start : start + n]
start += n
and pass iterable=chunkit(mylist, 40). Then all 8 processes will be busy. One will work on mylist[0:40], another on mylist[40:80], another on mylist[80:120], and so on, until mylist is exhausted.

How can I remove duplicates while multiprocesscing?

I am very new to multiprocessing and I am only using it to find an image on the screen, the problem is the code produces duplicates which slow it down I have tried using a "not in" statement to only place proc into processes if it is not already in it, but this did not work. Any help or optimization would be welcome I have no idea what I am doing as this is just a personal project to learn multiprocessing.
from multiprocessing.context import Process
import pyautogui as auto
screenWidth, screenHeight = auto.size()
currentMouseX, currentMouseY = auto.position()
def bot(aim):
while True:
for aim in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
auto.click(aim)
print(aim)
def bot2(aim):
while True:
for aim in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
auto.click(aim)
print(aim)
def bot3(aim):
while True:
for aim in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
auto.click(aim)
print(aim)
if __name__ == "__main__":
processes = []
for t in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
proc = Process(target=bot, args=(t,))
processes.append(proc)
proc.start()
for z in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
proc = Process(target=bot2, args=(z,))
processes.append(proc)
proc.start()
for x in auto.locateAllOnScreen(r"dot.png", confidence=0.9795):
proc = Process(target=bot3, args=(x,))
processes.append(proc)
proc.start()
for p in processes:
p.join()

Unless my eyes deceive me, you have three functions bot, bot2 and bot3 that appear to be identical. You have to ask yourself why you need three identical functions that differ only in a name. I certainly don't have an answer.
Presumanly auto.locateAllOnScreen returns the locations of all occurrences of "dot.png" on your screen and you would like to print out information on each occurrence in parallel. Your main process is iterating all of these occurrences 3 times and for each occurrence staring a new process. Then each process is totally ignoring the occurrence argument, aim, that is being passed to it and instead iterating all the occurrences itself. So if there were 5 occurrences on the screen you would be creating 3 * 5 = 15 processes and each process would be printing 5 lines of output (one for each occurrence) for a total of 15 * 5 = 75 lines of output when in reality you should only be getting 5 lines of output if you were doing this correctly (I am ignoring that there is a while True: loop where all the output is then repeated). You are also potentially creating more processes than the number of CPU cores you have on your computer and so they would not truly be running in parallel on the assumption that the bot function(s) are CPU-intensive, which may not be the case.
I am not sure whether this problem is a candidate for multiprocessing since there is a fair amount of overhead just to create processes and to pass arguments and results to and from one process to another. So you might not gain any improvement in performance. But if the idea is to see how you would solve this using multiprocessing, then I would suggest that if you do not know in advance how many elements the call to auto.locateAllOnScreen might return and recognizing that there is no point in creating more processes than the number of processors you actually have, then it is probably best to use a multiprocessing pool of fixed size.
What you want to do is have your worker function bot (and you only need one of these) be passed a single occurrence that it will process. You then create a pool of processes equal to the smaller value of the size of the number of CPUs you have and the number of tasks you actually have to submit. You then submit to the pool a number of tasks where each task specifies the worker function to perform that task and the argument(s) it requires.
In the code below I have removed from function bot the while True: loop that never terminates. You can put it back in if you want.
from multiprocessing import Pool, cpu_count
import pyautogui as auto
def bot(aim):
# do the work for the single occurrence of aim
auto.click(aim)
print(aim)
if __name__ == "__main__":
aims = list(auto.locateAllOnScreen(r"dot.png", confidence=0.9795))
# choose appropriate pool size:
pool = Pool(min(len(aims), cpu_count()))
# bot will be called for each element returned by call to auto.locateAllOnScreen
pool.map(bot, aims)

python for large data processing

I relatively new to python, and have been able to answer most of my questions based on similar problems answered on forms, but I'm at a point where I'm stuck an could use some help.
I have a simple nested for loop script that generates an output of strings. What I need to do next is have each grouping go through a simulation, based on numerical values that the strings will be matched too.
really my question is how do I go about this in the best way? Im not sure if multithreading will work since the strings are generated and then need to undergo the simulation, one set at a time. I was reading about queue's and wasn't sure if they could be passed into queue's for storage and then undergo the simulation, in the same order they entered the queue.
Regardless of the research I've done I'm open to any suggestion anyone can make on the matter.
thanks!
edit: Im not look for an answer on how to do the simulation, but rather a way to store the combinations while simulations are being computed
example
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
print(D)

As was hinted at in the comments, the multiprocessing module is what you're looking for. Threading won't help you because of the Global Interpreter Lock (GIL), which limits execution to one Python thread at a time. In particular, I would look at multiprocessing pools. These objects give you an interface to have a pool of subprocesses do work for you in parallel with the main process, and you can go back and check on the results later.
Your example snippet could look something like this:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool() # by default, this will create a number of workers equal to
# the number of CPU cores you have available
combination_list = [] # create a list to store the combinations
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
combination_list.append(D) # append this combination to the list
results = pool.map(simulation_function, combination_list)
# simulation_function is the function you're using to actually run your
# simulation - assuming it only takes one parameter: the combination
The call to pool.map is blocking - meaning that once you call it, execution in the main process will halt until all the simulations are complete, but it is running them in parallel. When they complete, whatever your simulation function returns will be available in results, in the same order that the input arguments were in the combination_list.
If you don't want to wait for them, you can also use apply_async on your pool and store the result to look at later:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool()
result_list = [] # create a list to store the simulation results
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
result_list.append(pool.apply_async(
simulation_function,
args=(D,))) # note the extra comma - args must be a tuple
# do other stuff
# now iterate over result_list to check the results when they're ready
If you use this structure, result_list will be full of multiprocessing.AsyncResult objects, which allow you to check if they are ready with result.ready() and, if it's ready, retrieve the result with result.get(). This approach will cause the simulations to be kicked off right when the combination is calculated, instead of waiting until all of them have been calculated to start processing them. The downside is that it's a little more complicated to manage and retrieve the results. For example, you have to make sure the result is ready or be ready to catch an exception, you need to be ready to catch exceptions that may have been raised in the worker function, etc. The caveats are explained pretty well in the documentation.
If calculating the combinations doesn't actually take very long and you don't mind your main process halting until they're all ready, I suggest the pool.map approach.

Why is communication via shared memory so much slower than via queues?

I am using Python 2.7.5 on a recent vintage Apple MacBook Pro which has four hardware and eight logical CPUs; i.e., the sysctl utility gives:
$ sysctl hw.physicalcpu
hw.physicalcpu: 4
$ sysctl hw.logicalcpu
hw.logicalcpu: 8
I need to perform some rather complicated processing on a large 1-D list or array, and then save the result as an intermediate output which will be used again at a later point in a subsequent calculation within my application. The structure of my problem lends itself rather naturally to parallelization, so I thought that I would try to use Python's multiprocessing module to subdivide the 1D array into several pieces (either 4 pieces or 8 pieces, I'm not yet sure which), perform the calculations in parallel, and then reassemble the resulting output into its final format afterwards. I am trying to decide whether to use multiprocessing.Queue() (message queues) or multiprocessing.Array() (shared memory) as my preferred mechanism for communicating the resulting calculations from the child processes back up to the main parent process, and I have been experimenting with a couple of "toy" models in order to make sure that I understand how the multiprocessing module actually works. I've come across a rather unexpected result, however: in creating two essentially equivalent solutions to the same problem, the version which uses shared memory for interprocess communication seems to require much more execution time (like 30X more!) than the version using message queues. Below, I've included two different versions of sample source code for a "toy" problem which generates a long sequence of random numbers using parallel processes, and communicates the agglomerated result back to a parent process in two different ways: first using message queues, and the second time using shared memory.
Here is the version that uses message queues:
import random
import multiprocessing
import datetime
def genRandom(count, id, q):
print("Now starting process {0}".format(id))
output = []
# Generate a list of random numbers, of length "count"
for i in xrange(count):
output.append(random.random())
# Write the output to a queue, to be read by the calling process
q.put(output)
if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4
# Create a list of jobs and queues
jobs = []
outqs = []
for i in xrange(0, procs):
q = multiprocessing.Queue()
p = multiprocessing.Process(target=genRandom, args=(size, i, q))
jobs.append(p)
outqs.append(q)
# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Read out the data from the queues
data = []
for q in outqs:
data.extend(q.get())
# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)
msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))
When I run it, I get a result that typically looks about like this:
$ python multiproc_queue.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 0.514805 seconds
Now, here is the equivalent code segment, but refactored just slightly so that it uses shared memory instead of queues:
import random
import multiprocessing
import datetime
def genRandom(count, id, d):
print("Now starting process {0}".format(id))
# Generate a list of random numbers, of length "count", and write them
# directly to a segment of an array in shared memory
for i in xrange(count*id, count*(id+1)):
d[i] = random.random()
if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4
# Create a list of jobs and a block of shared memory
jobs = []
data = multiprocessing.Array('d', size*procs)
for i in xrange(0, procs):
p = multiprocessing.Process(target=genRandom, args=(size, i, data))
jobs.append(p)
# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)
msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))
When I run the shared memory version, however, the typical result looks more like this:
$ python multiproc_shmem.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 15.839607 seconds
My question: why is there such a huge difference in execution speeds (roughly 0.5 seconds vs. 15 seconds, a factor of 30X!) between the two versions of my code? And in particular, how can I modify the shared memory version in order to get it to run faster?

This is because multiprocessing.Array uses a lock by default to prevent multiple processes from accessing it at once:
multiprocessing.Array(typecode_or_type, size_or_initializer, *,
lock=True)
...
If lock is True (the default) then a new lock object is created to synchronize access to the value. If lock is a Lock or RLock object
then that will be used synchronize access to the value. If lock is
False then access to the returned object will not be automatically
protected by a lock, so it will not necessarily be “process-safe”.
This means you're not really concurrently writing to the array - only one process can access it at a time. Since your example workers are doing almost nothing but array writes, constantly waiting on this lock badly hurts performance. If you use lock=False when you create the array, the performance is much better:
With lock=True:
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 4.811205 seconds
With lock=False:
Now starting process 0
Now starting process 3
Now starting process 1
Now starting process 2
4000000 random numbers generated in 0.192473 seconds
Note that using lock=False means you need to manually protect access to the Array whenever you're doing something that isn't process-safe. Your example is having processes write to unique parts, so it's ok. But if you were trying to read from it while doing that, or had different processes write to overlapping parts, you would need to manually acquire a lock.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.