Trying to speed up python code by replacing loops with functions - python

I am trying to come up with a faster way of coding what I want to. Here is the part of my program I am trying to speed up, hopefully using more inbuilt functions:
num = 0
num1 = 0
rand1 = rand_pos[0:10]
time1 = time.clock()
for rand in rand1:
for gal in gal_pos:
num1 = dist(gal, rand)
num = num + num1
time2 = time.clock()
time_elap = time2-time1
print time_elap
Here, rand_pos and gal_pos are lists of length 900 and 1 million respectively.
Here dist is function where I calculate the distance between two points in euclidean space.
I used a snippet of the rand_pos to get a time measurement.
My time measurements are coming to be about 125 seconds. This is way too long!
It means that if I run the code over all the rand_pos, it will take about three hours to do!
Is there a faster way I can do this?
Here is the dist function:
def dist(pos1,pos2):
n = 0
dist_x = pos1[0]-pos2[0]
dist_y = pos1[1]-pos2[1]
dist_z = pos1[2]-pos2[2]
if dist_x<radius and dist_y<radius and dist_z<radius:
positions = [pos1,pos2]
distance = scipy.spatial.distance.pdist(positions, metric = 'euclidean')
if distance<radius:
n = 1
return n

While most of the optimization probably needs to happen within your dist function, there are some tips here to speed things up:
# Don't manually sum
for rand in rand1:
num += sum([dist(gal, rand) for gal in gal_pos])
#If you can vectorize something, then do
import numpy as np
new_dist = np.vectorize(dist)
for rand in rand1:
num += np.sum(new_dist(gal_pos, rand))
# use already-built code whenever possible (as already suggested)
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')

There is a function in scipy that does exactly what you want to do here:
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')
It will be faster than anything you write in pure Python probably, since the heavy lifting (looping over the pairwise combinations between arrays) is implemented in C.
Currently your loop is happening in Python, which means there is more overhead per iteration, then you are making many calls to pdist. Even though pdist is very optimized, the overhead of making so many calls to it slows down your code. This type of performance issue was once described to me with a very useful analogy: its like trying to have a conversation with someone over the phone by saying one word per phone call, even though each word is going across the line very fast, your conversation will take a long time because you need to hang up and dial again repeatedly.

Related

How to improve the runtime of this matrix generation loop in python3?

I have code that is simulating interactions between lots of particles. Using profiling, I've worked out that the function that's causing the most slowdown is a loop which iterates over all my particles and works out the time for the collision between each of them. This generates a symmetric matrix, which I then take the minimum value out of.
def find_next_collision(self, print_matrix = False):
"""
Sets up a matrix of collision times
Returns the indices of the balls in self.list_of_balls that are due to
collide next and the time to the next collision
"""
self.coll_time_matrix = np.zeros((np.size(self.list_of_balls), np.size(self.list_of_balls)))
for i in range(np.size(self.list_of_balls)):
for j in range(i+1):
if (j==i):
self.coll_time_matrix[i][j] = np.inf
else:
self.coll_time_matrix[i][j] = self.list_of_balls[i].time_to_collision(self.list_of_balls[j])
matrix = self.coll_time_matrix + self.coll_time_matrix.T
self.coll_time_matrix = matrix
ind = np.unravel_index(np.argmin(self.coll_time_matrix, axis = None), self.coll_time_matrix.shape)
dt = self.coll_time_matrix[ind]
if (print_matrix):
print(self.coll_time_matrix)
return dt, ind
This code is a method inside a class which defines the positions of all of the particles. Each of these particles is an object saved in self.list_of_balls (which is a list). As you can see, I'm already only iterating over half of this matrix, but it's still quite a slow function. I've tried using numba, but this is a section of quite a large code, and I don't want to have to optimize every function with numba when this is the slow one.
Does anyone have any ideas for a more efficient way to write this function?
Thank you in advance!
Like Raubsauger mentioned in their answer, evaluating ifs is slow
for j in range(i+1):
if (j==i):
You can get rid of this if by simply doing for j in range(i). That way j goes from 0 to i-1
You should also try to avoid loops when possible. You can do this by expressing your problem in a vectorized way, and using numpy or scipy functions that leverage SIMD operations to speed up calculations. Here's a simplified example assuming the time_to_collision simply divides Euclidean distance by speed. If you store the coordinates and speeds of the balls in a numpy array instead of storing ball objects in a list, you could do:
from scipy.spatial.distance import pdist
rel_distances = pdist(ball_coordinates)
rel_speeds = pdist(ball_speeds)
time = rel_distances / rel_speeds
pdist documentation
Of course, this isn't going to work verbatim if your time_to_collision function is more complicated, but it should point you in the right direction.
First Question: How many particles do you have?
If you have a lot of particles: one improvement would be
for i in range(np.size(self.list_of_balls)):
for j in range(i):
self.coll_time_matrix[i][j] = self.list_of_balls[i].time_to_collision(self.list_of_balls[j])
self.coll_time_matrix[i][i] = np.inf
often executed ifs slow everything down. Avoid them in innerloops
2nd question: is it necessary to compute this every time? Wouldn't it be faster to compute points in time and only refresh those lines and columns which were involved in a collision?
Edit:
Idea here is to initially calculate either the time left or (better solution) the timestamp for the collision as you already as well as the order. But instead to throw the calculated results away, you only update the values when needed. This way you need to calculate only 2*n instead of n^2/2 values.
Sketch:
# init step, done once at the beginning, might need an own function
matrix ... # calculate matrix like before; I asume that you use timestamps instead of time left
min_times = np.zeros(np.size(self.list_of_balls))
for i in range(np.size(self.list_of_balls)):
min_times[i] = min(self.coll_time_matrix[i])
order_coll = np.argsort(min_times)
ind = order_coll[0]
dt = self.coll_time_matrix[ind]
return dt, ind
# function step: if a collision happened, order_coll[0] and order_coll[1] hit each other
for balls in order_coll[0:2]:
for i in range(np.size(self.list_of_balls)):
self.coll_time_matrix[balls][i] = self.list_of_balls[balls].time_to_collision(self.list_of_balls[i])
self.coll_time_matrix[i][balls] = self.coll_time_matrix[balls][i]
self.coll_time_matrix[balls][balls] = np.inf
for i in range(np.size(self.list_of_balls)):
min_times[i] = min(self.coll_time_matrix[i])
order_coll = np.argsort(min_times)
ind = order_coll[0]
dt = self.coll_time_matrix[ind]
return dt, ind
In case you calculate time left in the matrix you have to subtract the time passed from the matrix. Also you somehow need to store the matrix and (optionally) min_times and order_coll.

Simulation for conditional probabilty problem in python

I am trying to simulate a simple conditional probability problem. You hae two boxes. If you open A you have a 50% change of wining the prize, If you open B you have a 75% chance of winning. With some simple (bad) python I have tired
But the appending doesn't work. Any thoughts on a neater way of doing this?
import random
import numpy as np
def liveORdie(prob):
#Takes an argument of the probability of survival
live = 0
for i in range(100):
if random.random() <= prob*1.0:
live =1
return live
def simulate(n):
trials = np.array([0])
for i in range(n):
if random.random() <= 0.5:
np.append(trials,liveORdie(0.5))
print(trials)
else:
np.append(trials,liveORdie(0.75))
return(sum(trials)/n)
simulate(10)
You could make the code tighter by using list comprehensions and numpy's array operations, like so:
import random
import numpy as np
def LiveOrDie():
prob = 0.5 if random.random()<=0.5 else 0.75
return np.sum(np.random.random(100)<=prob)
def simulate(n):
trials = [LiveOrDie() for x in range(n)]
return(sum(trials)/n)
Simulate(10)
append is a list operation; you're forcing it onto a numpy array, which is not the same thing. Stick with a regular list, since you're not using any extensions specific to an array.
def simulate(n):
trials = []
for i in range(n):
if random.random() <= 0.5:
trials.append(liveORdie(0.5))
Now look at your liveORdie routine. I don't think this is what you want: you loop 100 times to produce a single integer ... and if any one of your trials comes up successful, you return a 1. Since you haven't provided documentation for your algorithms, I'm not sure what you want, but I suspect that it's a list of 100 trials, rather than the conjunction of all 100. You need to append here, as well.
Better yet, run through a tutorial on list comprehension, and use those.
The loop in liveORdie() (please consider PEP8 for naming conventions) will cause the probability of winning to increase: each pass of the loop has a prop chance of winning, and you give it 100 tries, so with 50% resp. 75% you are extremely likely to win.
Unless I really misunderstood the problem, you probably just want
def live_or_die(prob):
return random.random() < prob
I'm pretty sure that just reduces to:
from numpy import mean
from numpy.random import choice
from scipy.stats import bernoulli
def simulate(n):
probs = choice([0.5, 0.75], n)
return 1-mean(bernoulli.rvs((1-probs)**100))
which as others have pointed out will basically always return 1 — 0.5**100 is ~1e-30.

How to make huge loops in python faster on mac?

I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg
Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.
You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.
Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.

optimizing low pass filter smoothing code for activity recognition [duplicate]

This question already has an answer here:
Recursive definitions in Pandas
(1 answer)
Closed 7 years ago.
I'm trying to implement a low pass filter on accelerometer data(with x-acceleration(ax), y-acceleration(ay), z-acceleration(az))
I have calculated my alpha to be 0.2
DC component along the x direction is calculated using the formula
new_ax[n] = (1-alpha)*new_ax[n-1] + (alpha * ax[n])
I'm able to calculate this for a small dataset with few thousand records. But I have a dataset with a million records and it takes forever to run with the below code. I would appreciate any help to improvise my code for time complexity.
### df is a pandas dataframe object
n_ax = []
seq = range(0, 1000000, 128)
for w in range(len(seq)):
prev_x = 0
if w+1 <= len(seq):
subdf = df[seq[w]:seq[w+1]]
for i in range(len(subdf)):
n_ax.append((1-alpha)*prev_x + (alpha*subdf.ax[i]))
prev_x = n_ax[i]
First it seems you don't need
if w+1 <= len(seq):
the w variable will not surpass len(seq).
So to decrease time processing just use numpy module:
import numpy;
Here you will find arrays and methods that are much faster than built-in list. For example instead of looping trough every element in a numpy array to do some processing you can apply a numpy function directly on the array and get the results in seconds rather than hours. as an example:
data = numpy.arange(0, 1000000, 128);
shiftData = numpy.arange(128, 1000000, 128);
result = (1-alpha)*data[:-1] + shiftdata;
Check some tutorials on numpy. I use this module for processing image data and by comparison looping through lists would have taken me 2 weeks to processes 5000+ image while using numpy types takes maximum 2 minutes.
Assuming you are using python 2.7.
Use xrange.
Computing len(seq) inside the loop is not necessary, since its value is not changing.
Accessing seq it is not really needed, since you can compute it on the fly.
You don't really need the if statement, since in your code it always evaluate to true (w in range(len(seq)) means w maximium value will be len(seq)-1).
The slicing you are doing to get subdf is not really necessary, since you can access directly df (and slicing creates a new list).
See the code below.
n_ax = []
SUB_SAMPLE = 128
SAMPLE_LEN = 1000000
seq_len = SAMPLE_LEN/SUB_SAMPLE
for w in xrange(seq_len):
prev_x = 0
for i in xrange(w*SUB_SAMPLE,(w+1)*SUB_SAMPLE):
new_x = (1-alpha)*prev_x + (alpha*df.ax[i])
n_ax.append(new_x)
prev_x = new_x
I cannot think any other obvious optimization. If this is still slow, perhaps you should consider copying df data to a python native data type. If these are all floats, use python array which gives very good performance.
And if you still need better performance, you can try parallelism with the multiprocessing module, or write a C module that takes an array in memory and does the computation, and call it with ctypes python library.

I don't understand why/how one of these methods is faster than the others

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.
Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.
AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

Categories

Resources