I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?
My Answer
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
Try this:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
Or:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
And don't ask what's it doing :D
Edit: nah, just kidding. The broadcasting in the middle (points[np.newaxis] - centroids[:,np.newaxis]) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs.
Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)). We end up with a (k, m) matrix where row 0 contain the distances to centroids[0], etc. So, the .argmin(axis=0) gives you the result you wanted.
You need to define a distance function where you are using hypot. Usually in K-means it is
Distance=sum((point-centroid)^2)
Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(#minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
UPDATE
This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance
Related
I want to check which data points within X are close to each other and which are far. by calculating the distances between each other without getting to zero, is it possible?
X = np.random.rand(20, 10)
dist = (X - X) ** 2
print(X)
Using just numpy you can either do,
np.linalg.norm((X - X[:,None]),axis=-1)
or,
np.sqrt(np.square(X - X[:,None]).sum(-1))
Another possible solution:
from scipy.spatial.distance import cdist
X = np.random.rand(20, 10)
cdist(X, X)
You can go though each point in sequence
X = np.random.rand(20, 10)
no_points = X.shape[0]
distances = np.zeros((no_points, no_points))
for i in range(no_points):
for j in range(no_points):
distances[i, j] = np.linalg.norm(X[i, :] - X[j, :])
print(distances,np.max(distances))
I would assume you want a way to actually get some way of keeping track of the distances, correct? If so, you can easily build a dictionary that will contain the distances as the keys and a list of tuples that correspond to the points as the value. Then you would just need to iterate through the keys in asc order to get the distances from least to greatest and the points that correspond to that distance. One way to do so would be to just brute force each possible connection between points.
dist = dict()
X = np.random.rand(20, 10)
for indexOfNumber1 in range(len(X) - 1):
for indexOfNumber2 in range(1, len(X)):
distance = sqrt( (X[indexOfNumber1] - X[indexOfNumber2])**2 )
if distance not in dist.keys():
dist[distance] = [tuple(X[indexOfNumber1], X[indexOfNumber2])]
else:
dist[distance] = dist[distance].append(tuple(X[indexOfNumber1], X[indexOfNumber2]))
The code above will then have a dictionary dist that contains all of the possible distances from the points you are looking at and the corresponding points that achieve that distance.
What would you do if you had n particles on a plane (with positions (x_n,y_n)), with a certain flux flux_n, and you have to pixelate these particles, so you have to go from (x,y) to (pixel_i, pixel_j) space and you have to sum up the flux of the m particles which fall in to every single pixel? Any suggestions? Thank you!
The are several ways with which you can solve your problem.
Assumptions: your positions have been stored into two numpy array of shape (N, ), i.e. the position x_n (or y_n) for n in [0, N), let's call them x and y. The flux is stored into a numpy array with the same shape, fluxes.
1 - INTENSIVE CASE
Create something that looks like a grid:
#get minimums and maximums position
mins = int(x.min()), int(y.min())
maxs = int(x.max()), int(y.max())
#actually you can also add and subtract 1 or more unit
#in order to have a grid larger than the x, y extremes
#something like mins-=epsilon and maxs += epsilon
#create the grid
xx = np.arange(mins[0], maxs[0])
yy = np.arange(mins[1], maxs[1])
Now you can perform a double for loop, tacking, each time, two consecutive elements of xx and yy, to do this, you can simple take:
x1 = xx[:-1] #excluding the last element
x2 = xx[1:] #excluding the first element
#the same for y:
y1 = yy[:-1] #excluding the last element
y2 = yy[1:] #excluding the first element
fluxes_grid = np.zeros((xx.shape[0], yy.shape[0]))
for i, (x1_i, x2_i) in enumerate(zip(x1, x2)):
for j, (y1_j, y2_j) in enumerate(zip(y1, y2)):
idx = np.where((x>=x1_i) & (x<x2_i) & (y>=y1_j) & (y<y2_j))[0]
fluxes_grid[i,j] = np.sum(fluxes[idx])
At the end of this loop you have a grid whose elements are pixels representing the sum of fluxes.
2 - USING A QUANTIZATION ALGORITHM LIKE K-NN
What happen if you have a lot o points, so many that the loop takes hours?
A faster solution is to use a quantization method, like K Nearest Neighbor, KNN on a rigid grid. There are many way to run a KNN (included already implemented version, e.g. sklearn KNN). But this is vary efficient if you can take advantage of a GPU. For example this my tensorflow (vs 2.1) implementation. After you have defined a squared grid:
_min, maxs = min(mins), max(maxs)
xx = np.arange(_min, _max)
yy = np.arange(_min, _max)
You can build the matrix, grid, and your position matrix, X:
grid = np.column_stack([xx, yy])
X = np.column_stack([x, y])
then you have to define a matrix euclidean pairwise-distance function:
#tf.function
def pairwise_dist(A, B):
# squared norms of each row in A and B
na = tf.reduce_sum(tf.square(A), 1)
nb = tf.reduce_sum(tf.square(B), 1)
# na as a row and nb as a co"lumn vectors
na = tf.reshape(na, [-1, 1])
nb = tf.reshape(nb, [1, -1])
# return pairwise euclidead difference matrix
D = tf.sqrt(tf.maximum(na - 2*tf.matmul(A, B, False, True) + nb, 0.0))
return D
Thus:
#compute the pairwise distances:
D = pairwise_dist(grid, X)
D = D.numpy() #get a numpy matrix from a tf tensor
#D has shape M, N, where M is the number of points in the grid and N the number of positions.
#now take a rank and from this the best K (e.g. 10)
ranks = np.argsort(D, axis=1)[:, :10]
#for each point in the grid you have the nearest ten.
Now you have to take the fluxes corresponding to this 10 positions and sum on them.
I had avoid to further specify this second method, I don't know the dimension of your catalogue, if you have or not a GPU or if you want to use such kind of optimization.
If you want I can improve this explanation, only if you are interested.
I'm implementing the PC algorithm in python. Such algorithm constructs the graphical model of a n-variate gaussian distribution. This graphical model is basically the skeleton of a directed acyclic graph, which means that if a structure like:
(x1)---(x2)---(x3)
Is in the graph, then x1 is independent by x3 given x2. More generally if A is the adjacency matrix of the graph and A(i,j)=A(j,i) = 0 (there is a missing edge between i and j) then i and j are conditionally independent, by all the variables that appear in any path from i to j. For statistical and machine learning purposes, it is be possible to "learn" the underlying graphical model.
If we have enough observations of a jointly gaussian n-variate random variable we could use the PC algorithm that works as follows:
given n as the number of variables observed, initialize the graph as G=K(n)
for each pair i,j of nodes:
if exists an edge e from i to j:
look for the neighbours of i
if j is in neighbours of i then remove j from the set of neighbours
call the set of neighbours k
TEST if i and j are independent given the set k, if TRUE:
remove the edge e from i to j
This algorithm computes also the separating set of the graph, that are used by another algorithm that constructs the dag starting from the skeleton and the separation set returned by the pc algorithm. This is what i've done so far:
def _core_pc_algorithm(a,sigma_inverse):
l = 0
N = len(sigma_inverse[0])
n = range(N)
sep_set = [ [set() for i in n] for j in n]
act_g = complete(N)
z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)
while l<N:
for (i,j) in itertools.permutations(n,2):
adjacents_of_i = adj(i,act_g)
if j not in adjacents_of_i:
continue
else:
adjacents_of_i.remove(j)
if len(adjacents_of_i) >=l:
for k in itertools.combinations(adjacents_of_i,l):
if N-len(k)-3 < 0:
return (act_g,sep_set)
if test(sigma_inverse,z,i,j,l,a,k):
act_g[i][j] = 0
act_g[j][i] = 0
sep_set[i][j] |= set(k)
sep_set[j][i] |= set(k)
l = l + 1
return (act_g,sep_set)
a is the tuning-parameter alpha with which i will test for conditional independence, and sigma_inverse is the inverse of the covariance matrix of the sampled observations. Moreover, my test is:
def test(sigma_inverse,z,i,j,l,a,k):
def erfinv(x): #used to approximate the inverse of a gaussian cumulative density function
sgn = 1
a = 0.147
PI = numpy.pi
if x<0:
sgn = -1
temp = 2/(PI*a) + numpy.log(1-x**2)/2
add_1 = temp**2
add_2 = numpy.log(1-x**2)/a
add_3 = temp
rt1 = (add_1-add_2)**0.5
rtarg = rt1 - add_3
return sgn*(rtarg**0.5)
def indep_test_ijK(K): #compute partial correlation of i and j given ONE conditioning variable K
part_corr_coeff_ij = z(sigma_inverse,i,j) #this gives the partial correlation coefficient of i and j
part_corr_coeff_iK = z(sigma_inverse,i,K) #this gives the partial correlation coefficient of i and k
part_corr_coeff_jK = z(sigma_inverse,j,K) #this gives the partial correlation coefficient of j and k
part_corr_coeff_ijK = (part_corr_coeff_ij - part_corr_coeff_iK*part_corr_coeff_jK)/((((1-part_corr_coeff_iK**2))**0.5) * (((1-part_corr_coeff_jK**2))**0.5)) #this gives the partial correlation coefficient of i and j given K
return part_corr_coeff_ijK == 0 #i independent from j given K if partial_correlation(i,k)|K == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
if l == 0:
return z(sigma_inverse,i,j) == 0 #i independent from j <=> partial_correlation(i,j) == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
elif l == 1:
return indep_test_ijK(k[0])
elif l == 2:
return indep_test_ijK(k[0]) and indep_test_ijK(k[1]) #ASSUMING THAT IJ ARE INDEPENDENT GIVEN Y,Z <=> IJ INDEPENDENT GIVEN Y AND IJ INDEPENDENT GIVEN Z
else: #i have to use the independent test with the z-fisher function
return indep_test()
Where z is a lambda that receives a matrix (the inverse of the covariance matrix), an integer i, an integer j and it computes the partial correlation of i and j given all the rest of variables with the following rule (which I read in my teacher's slides):
corr(i,j)|REST = -var^-1(i,j)/sqrt(var^-1(i,i)*var^-1(j,j))
The main core of this application is the indep_test() function:
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
This function implements a statistical test which uses the fisher's z-transform of estimated partial correlations. I am using this algorithm in two ways:
Generate data from a linear regression model and compare the learned DAG with the expected one
Read a dataset and learn the underlying DAG
In both cases i do not always get correct results, either because I know the DAG underlying a certain dataset, or because i know the generative model but it does not coincide with the one my algorithm learns. I perfectly know that this is a non-trivial task and I may have misunderstand theoretical concept as well as committed error even in parts of the code i have omitted here; but first i'd like to know (from someone who is more experienced than me), if the test i wrote is right, and also if there are library functions that perform this kind of tests, i tried searching but i couldn't find any suitable function.
I get to the point. The most critical issue in the above code, regards the following error:
sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)
I was mistaking the mean of n, it is not the size of the precision matrix but the number of total multi-variate observations (in my case, 10000 instead of 5). Another wrong assumption is that z(sigma_inverse[i][j]) has to provide the partial correlation of i and j given all the rest. That's not correct, z is the Fisher's transform on a proper subset of the precision matrix which estimates the partial correlation of i and j given the K. The correct test is the following:
if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)
r = CM[i, j] #r is the partial correlation of i and j
elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame
r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given K
else: #more than one conditioning variable
CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for
PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset
r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))
r = min(0.999999, max(-0.999999,r))
res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmation
return 2 * (1 - norm.cdf(abs(res))) #obtaining p-value
I hope someone could find this helpful
I explain what I have to develop.
Let's say I have to perform a function that is responsible for receiving two matrices, which have the same number of columns but can differ in the number of rows.
In summary, we will have two matrices of vectors with the same dimension but different number N of elements.
I have to calculate the Euclidean distance between each of the vectors that make up my two matrices, and then store it in another matrix that will contain the Euclidean distance between all my vectors.
This is the code I have developed:
def compute_distances(x, y):
# Dimension:
N, d = x.shape
M, d_ = y.shape
# The dimension should be the same
if d != d_:
print "Dimensiones de x e y no coinciden, no puedo calcular las distancias..."
return None
# Calculate distance with loops:
D = np.zeros((N, M))
i = 0
j = 0
for v1 in x:
for v2 in y:
if(j != M):
D[i,j] = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
#print "[",i,",",j,"]"
j = j + 1
else:
j = 0
i = i + 1;
print D
In this method I am receiving the two matrices to later create a matrix that will have the Euclidean distances between the vectors of my matrices x and y.
The problem is the following, I do not know how, to each one of the calculated Euclidean distance values I have to assign the correct position of the new matrix D that I have generated.
My main function has the following structure:
n = 1000
m = 700
d = 10
x = np.random.randn(n, d)
y = np.random.randn(m, d)
print "x shape =", x.shape
print "y shape =", y.shape
D_bucle = da.compute_distances(x, y)
D_cdist = cdist(x, y)
print np.max(np.abs(D_cdist - D_bucle))
B_cdist calculates the Euclidean distance using efficient methods.
It has to have the same result as D_bucle that calculates the same as the other but with non efficient code, but I'm not getting what the result should be.
I think it's when I create my Euclidean matrix D that is not doing it correctly, then the calculations are incorrect.
Updated!!!
I just updated my solution, my problem is that firstly I didnt know how to asign to the D Matrix my correct euclidean vector result for each pair of vectors,
Now I khow how to asign it but now my problem is that only the first line from D Matrix is having a correct result in comparison with cdist function
not fully understanding what you're asking, but I do see one problem which may explain your results:
for v1 in x:
for v2 in y:
D = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
You are overwriting the value of D each of the NxM times you go through this loop. When you're done D only contains the distance of the last compare. You might need something like D[i,j] = math.sqrt(...
I need to efficiently calculate the euclidean weighted distances for every x,y point in a given array to every other x,y point in another array. This is the code I have which works as expected:
import numpy as np
import random
def rand_data(integ):
'''
Function that generates 'integ' random values between [0.,1.)
'''
rand_dat = [random.random() for _ in range(integ)]
return rand_dat
def weighted_dist(indx, x_coo, y_coo):
'''
Function that calculates *weighted* euclidean distances.
'''
dist_point_list = []
# Iterate through every point in array_2.
for indx2, x_coo2 in enumerate(array_2[0]):
y_coo2 = array_2[1][indx2]
# Weighted distance in x.
x_dist_weight = (x_coo-x_coo2)/w_data[0][indx]
# Weighted distance in y.
y_dist_weight = (y_coo-y_coo2)/w_data[1][indx]
# Weighted distance between point from array_1 passed and this point
# from array_2.
dist = np.sqrt(x_dist_weight**2 + y_dist_weight**2)
# Append weighted distance value to list.
dist_point_list.append(round(dist, 8))
return dist_point_list
# Generate random x,y data points.
array_1 = np.array([rand_data(10), rand_data(10)], dtype=float)
# Generate weights for each x,y coord for points in array_1.
w_data = np.array([rand_data(10), rand_data(10)], dtype=float)
# Generate second larger array.
array_2 = np.array([rand_data(100), rand_data(100)], dtype=float)
# Obtain *weighted* distances for every point in array_1 to every point in array_2.
dist = []
# Iterate through every point in array_1.
for indx, x_coo in enumerate(array_1[0]):
y_coo = array_1[1][indx]
# Call function to get weighted distances for this point to every point in
# array_2.
dist.append(weighted_dist(indx, x_coo, y_coo))
The final list dist holds as many sub-lists as points are in the first array with as many elements in each as points are in the second one (the weighted distances).
I'd like to know if there's a way to make this code more efficient, perhaps using the cdist function, because this process becomes quite expensive when the arrays have lots of elements (which in my case they have) and when I have to check the distances for lots of arrays (which I also have)
#Evan and #Martinis Group are on the right track - to expand on Evan's answer, here's a function that uses broadcasting to quickly calculate the n-dimensional weighted euclidean distance without Python loops:
import numpy as np
def fast_wdist(A, B, W):
"""
Compute the weighted euclidean distance between two arrays of points:
D{i,j} =
sqrt( ((A{0,i}-B{0,j})/W{0,i})^2 + ... + ((A{k,i}-B{k,j})/W{k,i})^2 )
inputs:
A is an (k, m) array of coordinates
B is an (k, n) array of coordinates
W is an (k, m) array of weights
returns:
D is an (m, n) array of weighted euclidean distances
"""
# compute the differences and apply the weights in one go using
# broadcasting jujitsu. the result is (n, k, m)
wdiff = (A[np.newaxis,...] - B[np.newaxis,...].T) / W[np.newaxis,...]
# square and sum over the second axis, take the sqrt and transpose. the
# result is an (m, n) array of weighted euclidean distances
D = np.sqrt((wdiff*wdiff).sum(1)).T
return D
To check that this works OK, we'll compare it to a slower version that uses nested Python loops:
def slow_wdist(A, B, W):
k,m = A.shape
_,n = B.shape
D = np.zeros((m, n))
for ii in xrange(m):
for jj in xrange(n):
wdiff = (A[:,ii] - B[:,jj]) / W[:,ii]
D[ii,jj] = np.sqrt((wdiff**2).sum())
return D
First, let's make sure that the two functions give the same answer:
# make some random points and weights
def setup(k=2, m=100, n=300):
return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)
a, b, w = setup()
d0 = slow_wdist(a, b, w)
d1 = fast_wdist(a, b, w)
print np.allclose(d0, d1)
# True
Needless to say, the version that uses broadcasting rather than Python loops is several orders of magnitude faster:
%%timeit a, b, w = setup()
slow_wdist(a, b, w)
# 1 loops, best of 3: 647 ms per loop
%%timeit a, b, w = setup()
fast_wdist(a, b, w)
# 1000 loops, best of 3: 620 us per loop
You could use cdist if you don't need weighted distances. If you need weighted distances and performance, create an array of the appropriate output size, and use either an automated accelerator like Numba or Parakeet, or hand-tune the code with Cython.
You can avoid looping by using code that looks like the following:
def compute_distances(A, B, W):
Ax = A[:,0].reshape(1, A.shape[0])
Bx = B[:,0].reshape(A.shape[0], 1)
dx = Bx-Ax
# Same for dy
dist = np.sqrt(dx**2 + dy**2) * W
return dist
That will run a lot faster in python that anything that loops as long as you have enough memory for the arrays.
You could try removing the square root, since if a>b, it follows that a squared > b squared... and computers are REALLY slow at square roots normally.