How do I replace a nested loop with concurrent.futures?

How do I replace a nested loop with concurrent.futures? - python

Say, I have the following loops:
for i in range(50):
for j in range(50):
for k in range(50):
for l in range(50):
some_function(i,j,k,l)
some_function happens to be a web scraper for a lot of small files, so it makes sense to parallelize this.
But as far as I can tell, concurrent.futures only accepts one iterator, so I'm not sure how to approach this. What I could do is express (two of) them as a single iterator like:
def in_between_function(a):
x = math.floor(a/5)
y = a % y
some_function(x,y)
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
future = executor.map(in_between_function, range(50*50))
That doesn't look too bad but I want to do it properly; if I extend this to more iterators, negative numbers or if the iterator is not linear then this will be a nightmare to maintain.

Related

Multi-Processing nested loops in Python?

I have a multi-nested for loop and I'd like to parallelize this as much as possible, in Python.
Suppose I have some arbitrary function, which accepts two arguments func(a,b) and I'd like to compute this function on all combinations of M and N.
What I've done so far is 'flatten' the indices as a dictionary
idx_map = {}
count = 0
for i in range(n):
for j in range(m):
idx_map[count] = (i,j)
count += 1
Now that my nested loop is flattened, I can use it like so:
arr = []
for idx in range(n*m):
i,j = idx_map[idx]
arr.append( func(M[i], N[j]) )
Can I use this with Python's built in multi-Processing to parallelize? Race conditions should not be an issue because I do not need to aggregate func calls; rather, I just want to arrive at some final array, which evaluates all func(a,b) combinations across M and N. (So Async behavior and complexity should not be relevant here.)
What's the best way to accomplish this effect?
I see from this related question but I don't understand what the author was trying to illustrate.
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print

You can accomplish this yes, however the amount of work you are doing per function call needs to be quite substantial to overcome the overhead of the processes.
Vectorizing using something like numpy is typically easier, like Jérôme stated previously.
I have altered your code so that you may observe the speed up you get by using multiprocessing.
Feel free to change the largNum variable to see how as the amount of work increases per function call the scaling for multiprocessing gets better and how at low values multiprocessing is actually slower.
from concurrent.futures import ProcessPoolExecutor
import time
# Sums n**2 of a+b
def costlyFunc(theArgs):
a=theArgs[0]
b=theArgs[1]
topOfRange=(a+b)**2
sum=0
for i in range(topOfRange):
sum+=i
return sum
#changed to list
idx_map = []
largNum=200
# Your indicey flattening
for i in range(largNum):
for j in range(largNum):
idx_map.append((i,j))
I use the map function in the single core version to call costlyFunc on every element in the list. Python's concurrent.futures module also has a similar map function, however it distributes it over multiple processes.
if __name__ == "__main__":
# No multiprocessing
oneCoreTimer=time.time()
result=[x for x in map(costlyFunc,idx_map)]
oneCoreTime=time.time()-oneCoreTimer
print(oneCoreTime," seconds to complete the function without multiprocessing")
# Multiprocessing
mpTimer=time.time()
with ProcessPoolExecutor() as ex:
mpResult=[x for x in ex.map(costlyFunc,idx_map)]
mpTime=time.time()-mpTimer
print(mpTime," seconds to complete the function with multiprocessing")
print(f"Multiprocessing is {oneCoreTime/mpTime} times faster than using one core")

How to make huge loops in python faster on mac?

I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg

Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.

You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.

Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.

Ipyparallel slow execution with scatter/gather

Context: I have an array that I have scattered across my engines (4 engines at this time), want to apply a function to each point in the array for an arbitrary number of iterations and gather the resulting array from the engines and perform analysis on it.
For example I have the array of data points, that are scattered and the number of iterations on each data point:
data_points = range(16)
iterations = 10
dview.scatter('points', data_points)
I have a user supplied function as such, which is pushed to the engines:
def user_supplied_function(point):
return randint(0, point)
dview.push(dict(function_one = user_supplied_function))
A list for my results and the parallel execution:
result_list = []
for i in range(iterations):
%px engine_result = [function_one(j) for j in points]
result_list.append(dview.gather('engine_result'))
Issue: This works, and I get the result I want from the engines, however as the number of iterations grows the loop takes longer and longer to execute. To the point where 1000 iterations on 50 points takes upwards of 15 seconds to complete. Whereas a sequential version of this task takes less than a second.
Any idea what could be causing this? Could it be the overhead from the message passing from gather()? If so can anyone suggest any solutions?

Figured it out. It was the overhead from gather() and .append() after all. The easiest fix is to gather() after the engines have finished their work, as opposed to doing it each iteration.
Solution
%autopx
engine_result = []
for i in xrange(iterations):
engine_result += [[function_one(j) for j in points]]
%autopx
result_list = list(dview.gather('engine_result'))
This, however, gets the results in a poorly formatted list of lists where the results from each engine are placed next to each other instead of ordered by iteration number. The following commands distribute the lists and flatten the sublists for each iteration.
gathered_list = [None] * iterations
gathered_list = [[result_list[j * iterations + i] for j in xrange(len(result_list) / iterations)] for i in xrange(iterations)]
gathered_list = [reduce(lambda x, y: x.extend(y) or x, z) for z in gathered_list]

Using generator instead of nested loops

I have the following nested loop. But it is inefficient time wise. So using a generator would be much better. Do you know how to do that?
x_sph[:] = [r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p]
It seems like some of you are of the opinion (looking at comments) that using a generator was not helpful in this case. I am under the impression that using generators will avoid assigning variables to memory, and thus save memory and time. Am I wrong?

Judging from your code snippet you want to do something numerical and you want to do it fast. A generator won't help much in this respect. But using the numpy module will. Do it like so:
import numpy
# Change your p into an array, you'll see why.
r = numpy.array(p) # If p is a list this will change it into 1 dimensional vector.
sin_theta = numpy.array(sin_theta) # Same with the rest.
cos_phi = numpy.array(cos_phi)
x_sph = r.dot(sin_theta).dot(cos_phi)
In fact I'd use numpy even earlier, by doing:
phi = numpy.array(phi) # I don't know how you calculate this but you can start here with a phi list.
theta = numpy.array(theta)
sin_theta =numpy.sin(theta)
cos_phi = numpy.cos(phi)
You could even skip the intermediate sin_theta and cos_phi assignments and just put all the stuff in one line. It'll be long and complicated so I'll omit it but I do numpy-maths like that sometimes.
And numpy is fast, it'll make a huge difference. At least a noticeable one.

[...] creates a list and (...) a generator :
generator = (r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p)
for value in generator:
# Do something

To turn a loop into a generator, you can make it a function and yield:
def x_sph(p, cos_phi, sin_theta):
for r in p:
for sin_t in sin_theta:
for cos_p in cos_phi:
yield r * sin_t * cos_p
However, note that the advantages of generators are generally only realised if you don't need to calculate all values and can break at some point, or if you don't want to store all the values (the latter is a space rather than time advantage). If you end up calling this:
lst = list(x_sph(p, cos_phi, sin_theta))
then you won't see any gain.

How to avoid using for-loops with numpy?

I have already written the following piece of code, which does exactly what I want, but it goes way too slow. I am certain that there is a way to make it faster, but I cant seem to find how it should be done. The first part of the code is just to show what is of which shape.
two images of measurements (VV1 and HH1)
precomputed values, VV simulated and HH simulated, which both depend on 3 parameters (precomputed for (101, 31, 11) values)
the index 2 is just to put the VV and HH images in the same ndarray, instead of making two 3darrays
VV1 = numpy.ndarray((54, 43)).flatten()
HH1 = numpy.ndarray((54, 43)).flatten()
precomp = numpy.ndarray((101, 31, 11, 2))
two of the three parameters we let vary
comp = numpy.zeros((len(parameter1), len(parameter2)))
for i,(vv,hh) in enumerate(zip(VV1,HH1)):
comp0 = numpy.zeros((len(parameter1),len(parameter2)))
for j in range(len(parameter1)):
for jj in range(len(parameter2)):
comp0[j,jj] = numpy.min((vv-precomp[j,jj,:,0])**2+(hh-precomp[j,jj,:,1])**2)
comp+=comp0
The obvious thing i know i should do is get rid of as many for-loops as I can, but I don't know how to make the numpy.min behave properly when working with more dimensions.
A second thing (less important if it can get vectorized, but still interesting) i noticed is that it takes mostly CPU time, and not RAM, but i searched a long time already, but i cant find a way to write something like "parfor" instead of "for" in matlab, (is it possible to make an #parallel decorator, if i just put the for-loop in a separate method?)
edit: in reply to Janne Karila: yeah that definately improves it a lot,
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
Is definitely a lot faster, but is there any possibility to remove the outer for-loop too? And is there a way to make a for-loop parallel, with an #parallel or something?

This can replace the inner loops, j and jj
comp0 = numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
This may be a replacement for the whole loop, though all this indexing is stretching my mind a bit. (this creates a large intermediate array though)
comp = numpy.sum(
numpy.min((VV1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,0])**2
+(HH1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,1])**2,
axis=2),
axis=0)

One way to parallelize the loop is to construct it in such a way as to use map. In that case, you can then use multiprocessing.Pool to use a parallel map.
I would change this:
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
To something like this:
def buildcomp(vvhh):
vv, hh = vvhh
return numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
if __name__=='__main__':
from multiprocessing import Pool
nthreads = 2
p = Pool(nthreads)
complist = p.map(buildcomp, np.column_stack((VV1,HH1)))
comp = np.dstack(complist).sum(-1)
Note that the dstack assumes that each comp.ndim is 2, because it will add a third axis, and sum along it. This will slow it down a bit because you have to build the list, stack it, then sum it, but these are all either parallel or numpy operations.
I also changed the zip to a numpy operation np.column_stack, since zip is much slower for long arrays, assuming they're already 1d arrays (which they are in your example).
I can't easily test this so if there's a problem, feel free to let me know.

In computer science, there is the concept of Big O notation, used for getting an approximation of how much work is required to do something. To make a program fast, do as little as possible.
This is why Janne's answer is so much faster, you do fewer calculations. Taking this principle farther, we can apply the concept of memoization, because you are CPU bound instead of RAM bound. You can use the memory library, if it needs to be more complex than the following example.
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
memo = AutoVivification()
def memoize(n, arr, end):
if not memo[n][arr][end]:
memo[n][arr][end] = (n-arr[...,end])**2
return memo[n][arr][end]
for (vv,hh) in zip(VV1,HH1):
first = memoize(vv, precomp, 0)
second = memoize(hh, precomp, 1)
comp+= numpy.min(first+second, axis=2)
Anything that has already been computed gets saved to memory in the dictionary, and we can look it up later instead of recomputing it. You can even break down the math being done into smaller steps that are each memoized if necessary.
The AutoVivification dictionary is just to make it easier to save the results inside of memoize, because I'm lazy. Again, you can memoize any of the math you do, so if numpy.min is slow, memoize it too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I replace a nested loop with concurrent.futures? - python

Related

Multi-Processing nested loops in Python?

How to make huge loops in python faster on mac?

Ipyparallel slow execution with scatter/gather

Using generator instead of nested loops

How to avoid using for-loops with numpy?

Categories

Resources