Multi-Processing nested loops in Python? - python

I have a multi-nested for loop and I'd like to parallelize this as much as possible, in Python.
Suppose I have some arbitrary function, which accepts two arguments func(a,b) and I'd like to compute this function on all combinations of M and N.
What I've done so far is 'flatten' the indices as a dictionary
idx_map = {}
count = 0
for i in range(n):
for j in range(m):
idx_map[count] = (i,j)
count += 1
Now that my nested loop is flattened, I can use it like so:
arr = []
for idx in range(n*m):
i,j = idx_map[idx]
arr.append( func(M[i], N[j]) )
Can I use this with Python's built in multi-Processing to parallelize? Race conditions should not be an issue because I do not need to aggregate func calls; rather, I just want to arrive at some final array, which evaluates all func(a,b) combinations across M and N. (So Async behavior and complexity should not be relevant here.)
What's the best way to accomplish this effect?
I see from this related question but I don't understand what the author was trying to illustrate.
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print

You can accomplish this yes, however the amount of work you are doing per function call needs to be quite substantial to overcome the overhead of the processes.
Vectorizing using something like numpy is typically easier, like Jérôme stated previously.
I have altered your code so that you may observe the speed up you get by using multiprocessing.
Feel free to change the largNum variable to see how as the amount of work increases per function call the scaling for multiprocessing gets better and how at low values multiprocessing is actually slower.
from concurrent.futures import ProcessPoolExecutor
import time
# Sums n**2 of a+b
def costlyFunc(theArgs):
a=theArgs[0]
b=theArgs[1]
topOfRange=(a+b)**2
sum=0
for i in range(topOfRange):
sum+=i
return sum
#changed to list
idx_map = []
largNum=200
# Your indicey flattening
for i in range(largNum):
for j in range(largNum):
idx_map.append((i,j))
I use the map function in the single core version to call costlyFunc on every element in the list. Python's concurrent.futures module also has a similar map function, however it distributes it over multiple processes.
if __name__ == "__main__":
# No multiprocessing
oneCoreTimer=time.time()
result=[x for x in map(costlyFunc,idx_map)]
oneCoreTime=time.time()-oneCoreTimer
print(oneCoreTime," seconds to complete the function without multiprocessing")
# Multiprocessing
mpTimer=time.time()
with ProcessPoolExecutor() as ex:
mpResult=[x for x in ex.map(costlyFunc,idx_map)]
mpTime=time.time()-mpTimer
print(mpTime," seconds to complete the function with multiprocessing")
print(f"Multiprocessing is {oneCoreTime/mpTime} times faster than using one core")

Related

What is the advantage of generator in terms of memory in these two examples?

One of the generator advantage is that it uses less memory and consumes fewer resources. That is, we do not produce all the data at once and we do not allocate memory to all of them, and only a one value is generated each time. The state and status and values ​​of the variables are stored, and in fact the code can be stopped and resumed by calling it to continue.
I wrote two codes and I am comparing them, I see that the generator can be written normally and now I do not see any points for the generator. Can anyone tell me what is the advantage of this generator in compare to when it be written normally? One value is generated with each iteration of both of them.
The first code:
def gen(n):
for i in range(n):
i = i ** 2
i += 1
yield i
g = gen(3)
for i in g:
print(i)
The second one:
def func(i):
i = i ** 2
i += 1
return i
for i in range(3):
print(func(i))
I know that the id of g is constant whereas the id of func(i) is changing.
Is that what the main generator advantage means?
To be specific about the above codes that you have mentioned in the question, there is no difference in terms of memory between the two approaches you have shown, but first one is more preferable because everything you need is inside the same generator function, whereas in the second case, the loop and the function are at two different places, and every time you need to use the second function, you need to use the loop outside which unnecessarily increases the redundancy.
Actually the two functions you have written, the generator one, and the normal function, they are not equivalent.
In the generator, you are returning all the values, i.e. the loop is inside the generator function:
def gen(n):
for i in range(n):
i = i ** 2
i += 1
yield i
But, in the second case, you are just returning one value, and the loop is outside the function:
def func(i):
i = i ** 2
i += 1
return i
In order to make the second function equivalent to the first one, you need to have the loop inside the function:
def func(n):
for i in range(n):
i = i ** 2
i += 1
return i
Now, of course the above function always return a single value for i=0 if control goes inside the loop, so to fix this, you need to return an entire sequence, which demands you to have a list or similar data structure that allows you to store multiple values:
def func(n):
result = []
for i in range(n):
i = i ** 2
i += 1
result.append(i)
return result
for v in func(3):
print(v)
1
2
5
Now, you can clearly differentiate the two cases, in the first one, each values are evaluated sequentially and processed later i.e. printed, but in the second case, you ended up having the entire result in memory before you can actually process it.
The main advantage is when you have a large dataset. It is basically the idea of lazy loading which means that a data is not called unless it is required. This saves your resources because typically in a list, the entire thing is loaded at once which might take up a lot of primary memory if the data is large enough.
The advantage of the first code is with respect to something you did not show. What is meant that generating and consuming one value at a time takes less memory than first generating all values, collecting them in a list, and then consuming them from the list.
The second code with which to compare the first code should have been:
def gen2(n):
result = []
for i in range(n):
i = i ** 2
i += 1
result.append(i)
return result
g = gen2(3)
for i in g:
print(i)
Note how the result of gen2 can be used exactly like the result of gen from your first example, but gen2 uses more memory if n is getting larger, whereas gen uses the same amount of memory no matter how large n is.

Result from a multiprocessing loop

I'm running the following program, in order to compare different times between multiprocessing and single core processing.
Here is the script :
from multiprocessing import Pool, cpu_count
from time import *
#Amount to calculate
N=5000
#Fonction that works alone
def two_loops(x):
t=0
for i in range(1,x+1):
for j in range(i):
t+=1
return t
#Function that need to be called in a loop
def single_loop(x):
tt=0
for j in range(x):
tt+=1
return tt
print 'Starting loop function'
starttime=time()
tot=0
for i in range(1,N+1):
tot+=single_loop(i)
print 'Single loop function. Result ',tot,' in ', time()-starttime,' seconds'
print 'Starting multiprocessing function'
if __name__=='__main__':
starttime=time()
pool = Pool(cpu_count())
res= pool.map(single_loop,range(1,N+1))
pool.close()
print 'MP function. Result ',res,' in ', time()-starttime,' seconds'
print 'Starting two loops function'
starttime=time()
print 'Two loops onction. Result ',two_loops(N),' in ', time()-starttime,' seconds'
So basically the functions gives me the sum of all integers between 1 and N (so N(N+1)/2).
The two_loops function is the basic one, using two for loops. The single_loop is just created to simulate one loop (the j loop).
When I'm running this script, this works well but I don't get the right result. I get :
Starting loop function Single loop function. Result 12502500 in
0.380275964737 seconds
Starting multiprocessing function MP function. Result [1, 2, 3,... a
lot of values here ...,4999, 5000] in 0.683819055557 seconds
Starting two loops function Two loops onction. Result 12502500 in
0.4114818573 seconds
It looks like the script runs, but I can't manage to get the good result. I saw on a the web that the close() function was supposed to do that, but apparently not.
Do you know how I can do ?
Thanks a lot !
I don't understand your question but here's how it can be done:
from concurrent.futures.process import ProcessPoolExecutor
from timeit import Timer
def two_loops_multiprocessing():
with ProcessPoolExecutor() as executor:
executor.map(single_loop, range(N))
if __name__ == "__main__":
iterations, elapsed_time = Timer("two_loops(N)", globals=globals()).autorange()
print(elapsed_time / iterations)
iterations, elapsed_time = Timer("two_loops_multiprocessing()", globals=globals()).autorange()
print(elapsed_time / iterations)
What's happening is that your map function is chopping up your range you provide it and runs the single loop function for all these separate numbers. Look here to see what it does: https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.map And since your single loop just adds 1 to tt for a range up to the value you get the value back. This effectively means you get your range() object back which is the answer you get.
In your other "single loop" you later add all the values you get together to get to a single value here:
for i in range(1,N+1):
tot+=single_loop(i)
But you forget to do this with you multiprocessing. What you should do is add a loop after you have called your map function to add them all together and you will get your expected answer.
Besides this your single loop function is basically a two loop function where you moved one loop to a function call. I'm not sure what you are trying to accomplish but there is not a big difference between the two.
Just sum the result list:
res = sum(pool.map(single_loop,range(1,N+1)))
You could avoid calculating the sum in the main thread by using some shared memory, but keep in mind that you will lose more time on synchronization. And again, there's no gain from multiprocessing in this case. It all depends on the specific case. If you needed to call single_loop fewer times and each call would take more time, then multiprocessing would speed up your code.

How do I replace a nested loop with concurrent.futures?

Say, I have the following loops:
for i in range(50):
for j in range(50):
for k in range(50):
for l in range(50):
some_function(i,j,k,l)
some_function happens to be a web scraper for a lot of small files, so it makes sense to parallelize this.
But as far as I can tell, concurrent.futures only accepts one iterator, so I'm not sure how to approach this. What I could do is express (two of) them as a single iterator like:
def in_between_function(a):
x = math.floor(a/5)
y = a % y
some_function(x,y)
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
future = executor.map(in_between_function, range(50*50))
That doesn't look too bad but I want to do it properly; if I extend this to more iterators, negative numbers or if the iterator is not linear then this will be a nightmare to maintain.

Any efficient way to update very large matrix?

I have a very large matrix, which is a 2D array with few hundred rows but around 2 million columns. My application needs to update this matrix row by rows with one painful constraint. To update a specific row, it needs to wait all columns of previous rows updated. And the process is very slow.
For example:
matrix = [[0 for x in xrange(2000000)] for y in xrange(300)]
for i in xrange(1, 300):
for j in xrange(2000000):
k = a random column in row i-1
matrix[i][j] = matrix[i-1][k] * simple_function(i) # some calculation
To update matrix[i][j] , I need to have the value of row i-1
My initial thought is to use multi-process approach to parallelize j loop in each i round. However, the calculation in j loop is too light. The process creation cost is much higher than calculation (I also tried process pool).
Second thought is to use thread, it works well with almost no performance gain due to GIL limitation.
I would like to know if there's any other approach that can accelerate the my code. Thank you.
BTW, I know Cython can work without GIL. But the calculation function needs to access a Python object and it will takes lots of works to modify the original code.
Use Numba for such tasks
https://numba.pydata.org/
If you wan't a matrix or an array create one (and not a List of Lists). I would also recommend to take a look at some beginners tutorials regarding numpy.
Example code:
import numpy as np
import numba as nb
import time
def main():
matrix = np.random.rand(2000000,300)
t1=time.time()
k=100
Testing(matrix,100)
print(time.time()-t1)
#nb.jit(cache=True)
def Testing(matrix,k):
for i in xrange(1, matrix.shape[0]):
#inline your simple function (in my case np.power(i,2))
res_of_func=np.power(i,2)
for j in xrange(matrix.shape[1]):
matrix[i,j] = matrix[i-1,k] * res_of_func
if __name__ == "__main__":
main()
The matrix calculation takes 0.5 seconds on my machine (Haswell i7).
On Python 2 parallelizing with numba on windows isn't working today. But the jit compiled code should be aproximately 100 times faster.

How to make huge loops in python faster on mac?

I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg
Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.
You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.
Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.

Categories

Resources