I have a function which updates the centroid (mean) in a K-means algoritm.
I ran a profiler and noticed that this function uses a lot of computing time.
It looks like:
def updateCentroid(self, label):
X=[]; Y=[]
for point in self.clusters[label].points:
X.append(point.x)
Y.append(point.y)
self.clusters[label].centroid.x = numpy.mean(X)
self.clusters[label].centroid.y = numpy.mean(Y)
So I ponder, is there a more efficient way to calculate the mean of these points?
If not, is there a more elegant way to formulate it? ;)
EDIT:
Thanks for all great responses!
I was thinking that perhaps I can calculate the mean cumulativly, using something like:
where x_bar(t) is the new mean and x_bar(t-1) is the old mean.
Which would result in a function similar to this:
def updateCentroid(self, label):
cluster = self.clusters[label]
n = len(cluster.points)
cluster.centroid.x *= (n-1) / n
cluster.centroid.x += cluster.points[n-1].x / n
cluster.centroid.y *= (n-1) / n
cluster.centroid.y += cluster.points[n-1].y / n
Its not really working but do you think this could work with some tweeking?
A K-means algorithm is already implemented in scipy.cluster.vq. If there is something about that implementation that you are trying to change, then I'd suggest start by studying the code there:
In [62]: import scipy.cluster.vq as scv
In [64]: scv.__file__
Out[64]: '/usr/lib/python2.6/dist-packages/scipy/cluster/vq.pyc'
PS. Because the algorithm you posted holds the data behind a dict (self.clusters) and attribute lookup (.points) you are forced to use slow Python looping just to get at your data. A major speed gain could be achieved by sticking with numpy arrays. See the scipy implementation of k-means clustering for ideas on a better data structure.
Why not avoid constructing the extra arrays?
def updateCentroid(self, label):
sumX=0; sumY=0
N = len( self.clusters[label].points)
for point in self.clusters[label].points:
sumX += point.x
sumY += point.y
self.clusters[label].centroid.x = sumX/N
self.clusters[label].centroid.y = sumY/N
The costly part of your function is most certainly the iteration over the points. Avoid it altogether by making self.clusters[label].points a numpy array itself, and then compute the mean directly on it. For example if points contains X and Y coordinates concatenated in a 1D array:
points = self.clusters[label].points
x_mean = numpy.mean(points[0::2])
y_mean = numpy.mean(points[1::2])
Without extra lists:
def updateCentroid(self, label):
self.clusters[label].centroid.x = numpy.fromiter(point.x for point in self.clusters[label].points, dtype = np.float).mean()
self.clusters[label].centroid.y = numpy.fromiter(point.y for point in self.clusters[label].points, dtype = np.float).mean()
Perhaps the added features of numpy's mean are adding a bit of overhead.
>>> def myMean(itr):
... c = t = 0
... for item in itr:
... c += 1
... t += item
... return t / c
...
>>> import timeit
>>> a = range(20)
>>> t1 = timeit.Timer("myMean(a)","from __main__ import myMean, a")
>>> t1.timeit()
6.8293311595916748
>>> t2 = timeit.Timer("average(a)","from __main__ import a; from numpy import average")
>>> t2.timeit()
69.697283029556274
>>> t3 = timeit.Timer("average(array(a))","from __main__ import a; from numpy import average, array")
>>> t3.timeit()
51.65147590637207
>>> t4 = timeit.Timer("fromiter(a,npfloat).mean()","from __main__ import a; from numpy import average, fromiter,float as npfloat")
>>> t4.timeit()
18.513712167739868
Looks like numpy's best performance came when using fromiter.
Ok, I figured out a moving average solution which is fast without changing the data structures:
def updateCentroid(self, label):
cluster = self.clusters[label]
n = len(cluster.points)
cluster.centroid.x = ((n-1)*cluster.centroid.x + cluster.points[n-1].x)/n
cluster.centroid.y = ((n-1)*cluster.centroid.y + cluster.points[n-1].y)/n
This lowered computation time (for the whole k means algorithm) to 13% of original. =)
Thank you all for some great insight!
Try this:
def updateCentroid(self, label):
self.clusters[label].centroid.x = numpy.array([point.x for point in self.clusters[label].points]).mean()
self.clusters[label].centroid.y = numpy.array([point.y for point in self.clusters[label].points]).mean()
That's the problem with profilers that only tell you about functions. This is the method I use, and it pinpoints costly lines of code, including points where functions are called.
That said, there's a general idea that data structure is free. As #Michael-Anderson asked, why not avoid making an array? That's the first thing I saw in your code, that you're building arrays by appending. You don't need to.
One way to go is add an x_sum and y_sum to your "clusters" object and sum the coordinates as points are added. If things are moving around, you can also update the sum as points move. Then getting the centroid is just a matter of dividing the x_sum and y_sum by the number of points. If your points are numpy vectors that can be added, then you don't even need to sum the components, just maintain a sum of all the vectors and multiply be 1/len at the end.
Related
I have some complicated function called dis(x), which returns a number.
I am making two lists called let's say ,,indices'' and ,,values''. So what I do is following:
for i in np.arange(0.01,4,0.01):
values.append(dis(i))
indices.append(i)
So i have following problem, how do i find some index j (from indices), which dis(j) (from values) is closest to some number k.
Combination of enumerate and the argmin function in numpy will do the job for you.
import numpy as np
values = []
indices = []
def dis(x):
return 1e6*x**2
for i in np.arange(0.01,4,0.01):
values.append(dis(i))
indices.append(i)
target = 10000
closest_index = np.argmin([np.abs(x-target) for x in values])
print(closest_index)
The way you are stating it, I see two options:
Brute force it (try many indices i, and then see which dis(i) ended up closest to k. Works best when dis is reasonably fast, and the possible indices are reasonably few.
Learn about optimization: https://en.wikipedia.org/wiki/Optimization_problem. This is a pretty extensive field, but the python SciPy packages has many optimization functions.
Using numpy
closestindice = np.argmin(np.abs(np.array(values)-k))
But it is a strange as it does not use the 'indices' list.
May be you could skip the definition of the 'indices' list and and get the values in a numpy array.
import numpy as np
def dis(i):
return ((i-1)**2)
nprange = np.arange(0.01, 4, 0.01)
npvalues = np.array([dis(x) for x in nprange])
k = .5
closestindice = np.abs(npvalues-k).argmin()
print(closestindice, npvalues[closestindice])
Output:
28 0.5041
By the way, if 'dis' function is not monotone on the range, you could have more than one correct answers on both side of a local extremum.
So I'm programing a Basin Hoping from zero to find the potential minima of a system of interacting particles and I have obtained good results so far, but since since I increase the number of particles of the system, the code takes more time to run. I am using scipy conjugate gradient for finding local minima. I'm not getting any error messages and the program seems to work out fine, but I was wondering how I can optimize the code so the computing time is reduced.
I defined the Basin Hopping as a function, given:
def bhmet(n,itmax, pot,derpot, T):
i = 1
phi = np.random.rand(n)*2*np.pi
theta = np.random.rand(n)*np.pi
x = 3*np.sin(theta)*np.cos(phi)
y = 3*np.sin(theta)*np.sin(phi)
z = 3*np.cos(theta)
xyzat = np.hstack((x,y,z))
vmintot = 0
while i <= itmax:
print(i)
plmin = optimize.fmin_cg(pot,xyzat,derpot,gtol = 1e-5,) #posiciones para el mínimo local.4
if pot(plmin) < vmintot or np.exp((1/T)*(vmintot-pot(plmin))) > np.random.rand():
vmintot = pot(plmin)
xyzat = plmin + 2*0.3*(np.random.rand(len(plmin))-0.5)
i = i + 1
return plmin,vmintot
I tried defining the initial condition (the first 'xyzat') as a matrix, but the parameter of scipy.optimize.fmin_cg is requested as an array (and thats why in the functions I will be reshaping the array as a matrix).
The function I'm searching the global minima for is:
def ljpot(posiciones,):
r = np.array([])
matpos = np.zeros((int((1/3)*len(posiciones)),3))
matpos[:,0] = posiciones[0:int((1/3)*len(posiciones))]
matpos[:,1] = posiciones[int((1/3)*len(posiciones)):int((2/3)*len(posiciones))]
matpos[:,2] = posiciones[int((2/3)*len(posiciones)):]
for j in range(0,np.shape(matpos)[0]):
for k in range(j+1,np.shape(matpos)[0]):
ri = np.sqrt(sum((matpos[k,:]-matpos[j,:])**2))
r = np.append(r,ri)
V = 4*((1/r)**12-(1/r)**6)
vt = sum(V)
return vt
And its gradient would be:
def gradpot(posiciones,):
gradv = np.array([])
matposg = np.zeros((int((1/3)*len(posiciones)),3))
matposg[:,0] = posiciones[:int((1/3)*len(posiciones))]
matposg[:,1] = posiciones[int((1/3)*len(posiciones)):int((2/3)*len(posiciones))]
matposg[:,2] = posiciones[int((2/3)*len(posiciones)):]
for w in range(0,np.shape(matposg)[1]): #índice que cambia de columna.
for k in range(0,np.shape(matposg)[0]): #índice que cambia de fila.
rkj = np.array([])
xkj = np.array([])
for j in range(0,np.shape(matposg)[0]): #también este cambia de fila.
if j != k:
r = np.sqrt(sum((matposg[j,:]-matposg[k,:])**2))
rkj = np.append(rkj,r)
xkj = np.append(xkj,matposg[j,w])
dEdxj = sum(4*(6*(1/rkj)**8- 12*(1/rkj)**14)*(matposg[k,w]-xkj))
gradv = np.append(gradv,dEdxj)
return gradv
The reason I transform the array input in a matrix is that for each particle there are three coordinates x,y,z so the columns of the matrix are x's, y's, z's for every particle. I tried to do this using np.reshape(), but it seemed to give me wrong results for systems for which the program has already gotten the correct ones.
The code seems to work out fine, but as I increase the number of particles the running time increases exponentially. I know global optimization can take long time, but maybe I'm messing a little with the code. I don´t know if the way to reduce the time of running is pretty obvious,I'm kinda new to optimizing the code, so sorry if it's the case. Of course, any advices are welcome. Thank you all very much!
Two things I noticed after a quick look where you can definitely save some time. Both require some more thinking before, but afterwards you will be rewarded with an optimised and cleaner code.
1. Try to avoid using append. append is highly inefficient. You start with an empty array, and afterwards need to allocate more memory each time. This leads to inefficient memory handling, as you copy your array each time you append a number. The longer the array gets, the more inefficient append becomes.
Alternative: Use np.zeros((m,n)) to initialise the array, with m and n being the size it will end up with. Then you need a counter that puts the new values in the corresponding places. If the size of the array is not defined before your calculation, you can initialise it as a big number, and afterwards cut it.
2. Try to avoid using for loops. They are generally very slow, especially when iterating through large matrices, as you need to index each entry individually.
Alternative: Matrix operations are generally much faster. For example, instead of r = np.sqrt(sum((matposg[j,:]-matposg[k,:])**2)) inside two for loops, you could first define two matrices A and B that correspond to matposg[j,:] and matposg[k,:] (should be possible without the use of loops!) and then simply use r = np.linalg.norm(A-B)
I have code that is simulating interactions between lots of particles. Using profiling, I've worked out that the function that's causing the most slowdown is a loop which iterates over all my particles and works out the time for the collision between each of them. This generates a symmetric matrix, which I then take the minimum value out of.
def find_next_collision(self, print_matrix = False):
"""
Sets up a matrix of collision times
Returns the indices of the balls in self.list_of_balls that are due to
collide next and the time to the next collision
"""
self.coll_time_matrix = np.zeros((np.size(self.list_of_balls), np.size(self.list_of_balls)))
for i in range(np.size(self.list_of_balls)):
for j in range(i+1):
if (j==i):
self.coll_time_matrix[i][j] = np.inf
else:
self.coll_time_matrix[i][j] = self.list_of_balls[i].time_to_collision(self.list_of_balls[j])
matrix = self.coll_time_matrix + self.coll_time_matrix.T
self.coll_time_matrix = matrix
ind = np.unravel_index(np.argmin(self.coll_time_matrix, axis = None), self.coll_time_matrix.shape)
dt = self.coll_time_matrix[ind]
if (print_matrix):
print(self.coll_time_matrix)
return dt, ind
This code is a method inside a class which defines the positions of all of the particles. Each of these particles is an object saved in self.list_of_balls (which is a list). As you can see, I'm already only iterating over half of this matrix, but it's still quite a slow function. I've tried using numba, but this is a section of quite a large code, and I don't want to have to optimize every function with numba when this is the slow one.
Does anyone have any ideas for a more efficient way to write this function?
Thank you in advance!
Like Raubsauger mentioned in their answer, evaluating ifs is slow
for j in range(i+1):
if (j==i):
You can get rid of this if by simply doing for j in range(i). That way j goes from 0 to i-1
You should also try to avoid loops when possible. You can do this by expressing your problem in a vectorized way, and using numpy or scipy functions that leverage SIMD operations to speed up calculations. Here's a simplified example assuming the time_to_collision simply divides Euclidean distance by speed. If you store the coordinates and speeds of the balls in a numpy array instead of storing ball objects in a list, you could do:
from scipy.spatial.distance import pdist
rel_distances = pdist(ball_coordinates)
rel_speeds = pdist(ball_speeds)
time = rel_distances / rel_speeds
pdist documentation
Of course, this isn't going to work verbatim if your time_to_collision function is more complicated, but it should point you in the right direction.
First Question: How many particles do you have?
If you have a lot of particles: one improvement would be
for i in range(np.size(self.list_of_balls)):
for j in range(i):
self.coll_time_matrix[i][j] = self.list_of_balls[i].time_to_collision(self.list_of_balls[j])
self.coll_time_matrix[i][i] = np.inf
often executed ifs slow everything down. Avoid them in innerloops
2nd question: is it necessary to compute this every time? Wouldn't it be faster to compute points in time and only refresh those lines and columns which were involved in a collision?
Edit:
Idea here is to initially calculate either the time left or (better solution) the timestamp for the collision as you already as well as the order. But instead to throw the calculated results away, you only update the values when needed. This way you need to calculate only 2*n instead of n^2/2 values.
Sketch:
# init step, done once at the beginning, might need an own function
matrix ... # calculate matrix like before; I asume that you use timestamps instead of time left
min_times = np.zeros(np.size(self.list_of_balls))
for i in range(np.size(self.list_of_balls)):
min_times[i] = min(self.coll_time_matrix[i])
order_coll = np.argsort(min_times)
ind = order_coll[0]
dt = self.coll_time_matrix[ind]
return dt, ind
# function step: if a collision happened, order_coll[0] and order_coll[1] hit each other
for balls in order_coll[0:2]:
for i in range(np.size(self.list_of_balls)):
self.coll_time_matrix[balls][i] = self.list_of_balls[balls].time_to_collision(self.list_of_balls[i])
self.coll_time_matrix[i][balls] = self.coll_time_matrix[balls][i]
self.coll_time_matrix[balls][balls] = np.inf
for i in range(np.size(self.list_of_balls)):
min_times[i] = min(self.coll_time_matrix[i])
order_coll = np.argsort(min_times)
ind = order_coll[0]
dt = self.coll_time_matrix[ind]
return dt, ind
In case you calculate time left in the matrix you have to subtract the time passed from the matrix. Also you somehow need to store the matrix and (optionally) min_times and order_coll.
I am trying to simulate a simple conditional probability problem. You hae two boxes. If you open A you have a 50% change of wining the prize, If you open B you have a 75% chance of winning. With some simple (bad) python I have tired
But the appending doesn't work. Any thoughts on a neater way of doing this?
import random
import numpy as np
def liveORdie(prob):
#Takes an argument of the probability of survival
live = 0
for i in range(100):
if random.random() <= prob*1.0:
live =1
return live
def simulate(n):
trials = np.array([0])
for i in range(n):
if random.random() <= 0.5:
np.append(trials,liveORdie(0.5))
print(trials)
else:
np.append(trials,liveORdie(0.75))
return(sum(trials)/n)
simulate(10)
You could make the code tighter by using list comprehensions and numpy's array operations, like so:
import random
import numpy as np
def LiveOrDie():
prob = 0.5 if random.random()<=0.5 else 0.75
return np.sum(np.random.random(100)<=prob)
def simulate(n):
trials = [LiveOrDie() for x in range(n)]
return(sum(trials)/n)
Simulate(10)
append is a list operation; you're forcing it onto a numpy array, which is not the same thing. Stick with a regular list, since you're not using any extensions specific to an array.
def simulate(n):
trials = []
for i in range(n):
if random.random() <= 0.5:
trials.append(liveORdie(0.5))
Now look at your liveORdie routine. I don't think this is what you want: you loop 100 times to produce a single integer ... and if any one of your trials comes up successful, you return a 1. Since you haven't provided documentation for your algorithms, I'm not sure what you want, but I suspect that it's a list of 100 trials, rather than the conjunction of all 100. You need to append here, as well.
Better yet, run through a tutorial on list comprehension, and use those.
The loop in liveORdie() (please consider PEP8 for naming conventions) will cause the probability of winning to increase: each pass of the loop has a prop chance of winning, and you give it 100 tries, so with 50% resp. 75% you are extremely likely to win.
Unless I really misunderstood the problem, you probably just want
def live_or_die(prob):
return random.random() < prob
I'm pretty sure that just reduces to:
from numpy import mean
from numpy.random import choice
from scipy.stats import bernoulli
def simulate(n):
probs = choice([0.5, 0.75], n)
return 1-mean(bernoulli.rvs((1-probs)**100))
which as others have pointed out will basically always return 1 — 0.5**100 is ~1e-30.
I have two numpy array that describes a spatial curve, that are intersected on one point and I want to find the nearest value in both array for that intersection point, I have this code that works fine but its to slow for large amount of points.
from scipy import spatial
def nearest(arr0, arr1):
ptos = []
j = 0
for i in arr0:
distance, index = spatial.KDTree(arr1).query(i)
ptos.append([distance, index, j])
j += 1
ptos.sort()
return (arr1[ptos[0][1]].tolist(), ptos[0][1], ptos[0][2])
the result will be (<point coordinates>,<position in arr1>,<position in arr0>)
Your code is doing a lot of things you don't need. First you're rebuilding the KDtree on every loop and that's a waste. Also query takes an array of points, so no need to write your own loop. Ptos is an odd data structure, and you don't need it (and don't need to sort it). Try something like this.
from scipy import spatial
def nearest(arr0, arr1):
tree = spatial.KDTree(arr1)
distance, arr1_index = tree.query(arr0)
best_arr0 = distance.argmin()
best_arr1 = arr1_index[best_arr0]
two_closest_points = (arr0[best_arr0], arr1[best_arr1])
return two_closest_points, best_arr1, best_arr0
If that still isn't fast enough, you'll need to describe your problem in more detail and figure out if another search algorithm will work better for your problem.