I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg
Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.
You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.
Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.
Related
I have a multi-nested for loop and I'd like to parallelize this as much as possible, in Python.
Suppose I have some arbitrary function, which accepts two arguments func(a,b) and I'd like to compute this function on all combinations of M and N.
What I've done so far is 'flatten' the indices as a dictionary
idx_map = {}
count = 0
for i in range(n):
for j in range(m):
idx_map[count] = (i,j)
count += 1
Now that my nested loop is flattened, I can use it like so:
arr = []
for idx in range(n*m):
i,j = idx_map[idx]
arr.append( func(M[i], N[j]) )
Can I use this with Python's built in multi-Processing to parallelize? Race conditions should not be an issue because I do not need to aggregate func calls; rather, I just want to arrive at some final array, which evaluates all func(a,b) combinations across M and N. (So Async behavior and complexity should not be relevant here.)
What's the best way to accomplish this effect?
I see from this related question but I don't understand what the author was trying to illustrate.
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
You can accomplish this yes, however the amount of work you are doing per function call needs to be quite substantial to overcome the overhead of the processes.
Vectorizing using something like numpy is typically easier, like Jérôme stated previously.
I have altered your code so that you may observe the speed up you get by using multiprocessing.
Feel free to change the largNum variable to see how as the amount of work increases per function call the scaling for multiprocessing gets better and how at low values multiprocessing is actually slower.
from concurrent.futures import ProcessPoolExecutor
import time
# Sums n**2 of a+b
def costlyFunc(theArgs):
a=theArgs[0]
b=theArgs[1]
topOfRange=(a+b)**2
sum=0
for i in range(topOfRange):
sum+=i
return sum
#changed to list
idx_map = []
largNum=200
# Your indicey flattening
for i in range(largNum):
for j in range(largNum):
idx_map.append((i,j))
I use the map function in the single core version to call costlyFunc on every element in the list. Python's concurrent.futures module also has a similar map function, however it distributes it over multiple processes.
if __name__ == "__main__":
# No multiprocessing
oneCoreTimer=time.time()
result=[x for x in map(costlyFunc,idx_map)]
oneCoreTime=time.time()-oneCoreTimer
print(oneCoreTime," seconds to complete the function without multiprocessing")
# Multiprocessing
mpTimer=time.time()
with ProcessPoolExecutor() as ex:
mpResult=[x for x in ex.map(costlyFunc,idx_map)]
mpTime=time.time()-mpTimer
print(mpTime," seconds to complete the function with multiprocessing")
print(f"Multiprocessing is {oneCoreTime/mpTime} times faster than using one core")
I have a very large matrix, which is a 2D array with few hundred rows but around 2 million columns. My application needs to update this matrix row by rows with one painful constraint. To update a specific row, it needs to wait all columns of previous rows updated. And the process is very slow.
For example:
matrix = [[0 for x in xrange(2000000)] for y in xrange(300)]
for i in xrange(1, 300):
for j in xrange(2000000):
k = a random column in row i-1
matrix[i][j] = matrix[i-1][k] * simple_function(i) # some calculation
To update matrix[i][j] , I need to have the value of row i-1
My initial thought is to use multi-process approach to parallelize j loop in each i round. However, the calculation in j loop is too light. The process creation cost is much higher than calculation (I also tried process pool).
Second thought is to use thread, it works well with almost no performance gain due to GIL limitation.
I would like to know if there's any other approach that can accelerate the my code. Thank you.
BTW, I know Cython can work without GIL. But the calculation function needs to access a Python object and it will takes lots of works to modify the original code.
Use Numba for such tasks
https://numba.pydata.org/
If you wan't a matrix or an array create one (and not a List of Lists). I would also recommend to take a look at some beginners tutorials regarding numpy.
Example code:
import numpy as np
import numba as nb
import time
def main():
matrix = np.random.rand(2000000,300)
t1=time.time()
k=100
Testing(matrix,100)
print(time.time()-t1)
#nb.jit(cache=True)
def Testing(matrix,k):
for i in xrange(1, matrix.shape[0]):
#inline your simple function (in my case np.power(i,2))
res_of_func=np.power(i,2)
for j in xrange(matrix.shape[1]):
matrix[i,j] = matrix[i-1,k] * res_of_func
if __name__ == "__main__":
main()
The matrix calculation takes 0.5 seconds on my machine (Haswell i7).
On Python 2 parallelizing with numba on windows isn't working today. But the jit compiled code should be aproximately 100 times faster.
Context: I have an array that I have scattered across my engines (4 engines at this time), want to apply a function to each point in the array for an arbitrary number of iterations and gather the resulting array from the engines and perform analysis on it.
For example I have the array of data points, that are scattered and the number of iterations on each data point:
data_points = range(16)
iterations = 10
dview.scatter('points', data_points)
I have a user supplied function as such, which is pushed to the engines:
def user_supplied_function(point):
return randint(0, point)
dview.push(dict(function_one = user_supplied_function))
A list for my results and the parallel execution:
result_list = []
for i in range(iterations):
%px engine_result = [function_one(j) for j in points]
result_list.append(dview.gather('engine_result'))
Issue: This works, and I get the result I want from the engines, however as the number of iterations grows the loop takes longer and longer to execute. To the point where 1000 iterations on 50 points takes upwards of 15 seconds to complete. Whereas a sequential version of this task takes less than a second.
Any idea what could be causing this? Could it be the overhead from the message passing from gather()? If so can anyone suggest any solutions?
Figured it out. It was the overhead from gather() and .append() after all. The easiest fix is to gather() after the engines have finished their work, as opposed to doing it each iteration.
Solution
%autopx
engine_result = []
for i in xrange(iterations):
engine_result += [[function_one(j) for j in points]]
%autopx
result_list = list(dview.gather('engine_result'))
This, however, gets the results in a poorly formatted list of lists where the results from each engine are placed next to each other instead of ordered by iteration number. The following commands distribute the lists and flatten the sublists for each iteration.
gathered_list = [None] * iterations
gathered_list = [[result_list[j * iterations + i] for j in xrange(len(result_list) / iterations)] for i in xrange(iterations)]
gathered_list = [reduce(lambda x, y: x.extend(y) or x, z) for z in gathered_list]
I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.
Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.
AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.
I am trying to come up with a faster way of coding what I want to. Here is the part of my program I am trying to speed up, hopefully using more inbuilt functions:
num = 0
num1 = 0
rand1 = rand_pos[0:10]
time1 = time.clock()
for rand in rand1:
for gal in gal_pos:
num1 = dist(gal, rand)
num = num + num1
time2 = time.clock()
time_elap = time2-time1
print time_elap
Here, rand_pos and gal_pos are lists of length 900 and 1 million respectively.
Here dist is function where I calculate the distance between two points in euclidean space.
I used a snippet of the rand_pos to get a time measurement.
My time measurements are coming to be about 125 seconds. This is way too long!
It means that if I run the code over all the rand_pos, it will take about three hours to do!
Is there a faster way I can do this?
Here is the dist function:
def dist(pos1,pos2):
n = 0
dist_x = pos1[0]-pos2[0]
dist_y = pos1[1]-pos2[1]
dist_z = pos1[2]-pos2[2]
if dist_x<radius and dist_y<radius and dist_z<radius:
positions = [pos1,pos2]
distance = scipy.spatial.distance.pdist(positions, metric = 'euclidean')
if distance<radius:
n = 1
return n
While most of the optimization probably needs to happen within your dist function, there are some tips here to speed things up:
# Don't manually sum
for rand in rand1:
num += sum([dist(gal, rand) for gal in gal_pos])
#If you can vectorize something, then do
import numpy as np
new_dist = np.vectorize(dist)
for rand in rand1:
num += np.sum(new_dist(gal_pos, rand))
# use already-built code whenever possible (as already suggested)
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')
There is a function in scipy that does exactly what you want to do here:
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')
It will be faster than anything you write in pure Python probably, since the heavy lifting (looping over the pairwise combinations between arrays) is implemented in C.
Currently your loop is happening in Python, which means there is more overhead per iteration, then you are making many calls to pdist. Even though pdist is very optimized, the overhead of making so many calls to it slows down your code. This type of performance issue was once described to me with a very useful analogy: its like trying to have a conversation with someone over the phone by saying one word per phone call, even though each word is going across the line very fast, your conversation will take a long time because you need to hang up and dial again repeatedly.