Calculate euclidean distance from a set in Python

Calculate euclidean distance from a set in Python - python

I have an array S(i) which contains all of the index values of the x coordinates j which are close to i, i.e. if S(1) = {2,3} this means that x2 and x3 are close to x1. In total I have S(1), ..., S(N) sets.
So this part of the code works fine:
arr = np.array([[1,3], [2,8],[3,1],[6,18], [9,8]])
arr = [item[0] for item in arr] #Extract x-coordinates
def Si(x): #This is the set i want to use
return [[j for j in range(len(x)) if np.abs(x[j] - x[i]) < 2] for i in range(len(x))]
Now i have a subscript of j's, I want to calculate the euclidean distance between (x_i,y_i) to each (x_j,y_j) in S(i), e.g for i=1, if S(1) = {7}, find distance between (x_1, y_1) and (x_7, y_7), for i=2, if S(2) = {3,9}, find distance between (x_2,y_2) and (x_3,y_3) and (x_2,y_2) and (x_9,y_9) and repeat for each i.
I don't know how to implement this, i'm really confused! Here is a Euclidean distance code which finds it for ALL values in the array, but not in the set which i want.
def euc_dist(arr):
arr_x = (arr[:,0,np.newaxis].T - arr[:,0,np.newaxis])**2 ##x-coordinates
arr_y = (arr[:,1,np.newaxis].T - arr[:,1,np.newaxis])**2 ##y-coordinates
arr = np.sqrt(arr_x + arr_y)
return arr

This should work:
S = Si(arr) # get the array
def my_fn(i): # take the value of i
euc_dists = []
for j in S[i]: # iterate over j's in S[i]
if i!= j:
dist = np.linalg.norm(arr[i]-arr[j]) # euclidean distance
euc_dists.append(dist)
return euc_dists

Related

Why is my algorithm showing linear behavior when it's supposed to be O(m4^(m))?

I am trying to understand the complexity of an algorithm I am experimenting with. The site where I found the algorithm states that it has a complexity of O(mn4^(m+n)), but when I held n constant in my experimental analysis, the results show a linear behavior, shouldn't it be something like O(m4^m). Can anyone explain why this may be happening?
This is my code:
def longestIncreasingPathDFS(matrix):
maxlen = [0]
for i in range(len(matrix)):
for j in range(len(matrix[0])):
dfs(matrix, i, j, maxlen, 1)
return maxlen[0]
def dfs(matrix, i, j, maxlen, length):
#keeps the longest length in max[0]
maxlen[0] = max(maxlen[0], length)
m = len(matrix)
n = len(matrix[0])
dx = [-1, 0, 1, 0]
dy = [0, 1, 0, -1]
for k in range(4):
x = i + dx[k]
y = j + dy[k]
if x >= 0 and x < m and y >= 0 and y < n and matrix[x][y] > matrix[i][j]:
dfs(matrix, x, y, maxlen, length+ 1)
This is how i get the linear plot
import time
import matplotlib.pyplot as plt
import random
times = []
input_sizes = range(1, 500)
for i in input_sizes:
matrix = [[random.randint(0,100) for _ in range(i)] for _ in range(10)]
start_time = time.time()
longestIncreasingPathDFS(matrix)
end_time = time.time()
times.append(end_time - start_time)
plt.plot(input_sizes, times)
plt.xlabel("Input size")
plt.ylabel("Time (segs)")
plt.show()
I tried increasing the test sample but the plot is clearly lineal, plus i attempted to search related question's about this algorithm but with no luck

Due to the recursion, the worst case is that you go nxm times through in average nxm/2elements, i.e. O((nxm)^4), I'd say.
However, like many algorithms, the normal case is much more forgiving/efficient than the constructed worst case.
So in most cases, it will be more like a constant times nxm, because the longest path is much shorter than the number of matrix elements.
For a random matrix maybe not even growing linear with size, but truly constant - the probability of having a continuous sequence is exponentially decreasing with its length, hence your observation.
Edit:
Tip: Try a large matrix like this (instead of random), the values sorted so the path is stretching over all elements:
[[1, 2, ... n],
[2n, 2n-1, ... n+1],
[2n+1, 2n+2, ... 3n],
[.... n*m]]
I expect this to be more like (n*m)^4
Ah, and another limitation: You use random integers between 1 and 100, so the path is never longer than 100 in your cases. So the complexity is limited to O(n*m*p) where p is the largest integer you use in the random matrix.

Proving #Dr. V's point
import time
import matplotlib.pyplot as plt
import random
import numpy as np
def path_exploit(rows, cols):
"""
Function creates matrix with longest path of size = 2 * (rows + cols) - 2
"""
# Init a zero matrix of size (rows, cols)
matrix = np.zeros(shape = (rows, cols))
# Create longest path along the matrix boundary
bd = [(0, j) for j in range(matrix.shape[1])] + [(i, matrix.shape[1] - 1) for i in range(1, matrix.shape[0])] + [(matrix.shape[0] - 1, j) for j in range(matrix.shape[1] - 2, -1 , -1)] + [(i, 0) for i in range(matrix.shape[0] - 2, 0, -1)]
count = 1
for element in bd:
matrix[element[0], element[1]] = count
count += 1
return matrix.tolist()
times = []
input_sizes = range(1, 1000, 50)
for i in input_sizes:
matrix = path_exploit(i, 10) #[[random.randint(0,100) for _ in range(i)] for _ in range(10)]
start_time = time.time()
longestIncreasingPathDFS(matrix)
end_time = time.time()
times.append(end_time - start_time)
plt.plot(input_sizes, times)
plt.xlabel("Input size")
plt.ylabel("Time (segs)")
plt.show()
Time vs # of cols now starts to look exponential
Plot

how to calculate the distances between all datapoints among each other

I want to check which data points within X are close to each other and which are far. by calculating the distances between each other without getting to zero, is it possible?
X = np.random.rand(20, 10)
dist = (X - X) ** 2
print(X)

Using just numpy you can either do,
np.linalg.norm((X - X[:,None]),axis=-1)
or,
np.sqrt(np.square(X - X[:,None]).sum(-1))

Another possible solution:
from scipy.spatial.distance import cdist
X = np.random.rand(20, 10)
cdist(X, X)

You can go though each point in sequence
X = np.random.rand(20, 10)
no_points = X.shape[0]
distances = np.zeros((no_points, no_points))
for i in range(no_points):
for j in range(no_points):
distances[i, j] = np.linalg.norm(X[i, :] - X[j, :])
print(distances,np.max(distances))

I would assume you want a way to actually get some way of keeping track of the distances, correct? If so, you can easily build a dictionary that will contain the distances as the keys and a list of tuples that correspond to the points as the value. Then you would just need to iterate through the keys in asc order to get the distances from least to greatest and the points that correspond to that distance. One way to do so would be to just brute force each possible connection between points.
dist = dict()
X = np.random.rand(20, 10)
for indexOfNumber1 in range(len(X) - 1):
for indexOfNumber2 in range(1, len(X)):
distance = sqrt( (X[indexOfNumber1] - X[indexOfNumber2])**2 )
if distance not in dist.keys():
dist[distance] = [tuple(X[indexOfNumber1], X[indexOfNumber2])]
else:
dist[distance] = dist[distance].append(tuple(X[indexOfNumber1], X[indexOfNumber2]))
The code above will then have a dictionary dist that contains all of the possible distances from the points you are looking at and the corresponding points that achieve that distance.

Average Directional Unit Vectors - Determinism Test

Say I take this box as an example:
I want to calculate the average directional vectors in the embedding phase space. The average direction vector V_k is calculated at each pass p of the trajectory through the k-th box. This generates a unit vector e_p whose direction is determined by the phase space point where the trajectory enters the box and the phase space point where the trajectory leaves the box. Here's 2 methods, I'm unsure what is the correct method or if both are wrong:
def unit_vector(v):
return v / np.linalg.norm(v)
# x-line and y-line
x = 0.00066496
y = 0.00069381
# y-vals along x-line
ys = [0.0007515997, 0.0007516736, 0.0007517695, 0.0007517716, 0.0007517978, 0.0007518086, 0.0007518439,
0.0007518738, 0.0007518758, 0.0007518850, 0.0007518883, 0.0007518912, 0.0007518925, 0.0007519232]
# x-vals along y-line
xs = [0.0007860762, 0.0007861990, 0.0007862053, 0.0007862724, 0.0007862800, 0.0007863471, 0.0007864196,
0.0007864439, 0.0007864641, 0.0007864704, 0.0007864773, 0.0007864814, 0.0007864959, 0.0007865132]
# Create the coordinates
A, B = [], []
for i, j in zip(ys, xs):
A.append([x, i])
B.append([j, y])
A = np.matrix(A)
B = np.matrix(B)
# Method 1
avg_direction_unit_vector = []
for i in range(len(A)):
avg_direction_unit_vector.append([unit_vector(A[i, 0] - B[i, 0]), unit_vector(A[i, 1] - B[i, 1])])
V = np.mean(np.array(avg_direction_unit_vector), axis=0)
print(np.abs(V))
# Method 2
avg_direction_unit_vector = []
for i in range(len(A)):
avg_direction_unit_vector.append(unit_vector([A[i, 0] - B[i, 0], (A[i, 1] - B[i, 1])]))
V = np.mean(avg_direction_unit_vector, axis=0)
print(np.abs(V))
This is the approach from a slide:

Don't understand this IndexError using numpy

I am given a quadratic matrix and have to do as follows:
For each entry (i,j) in the matrix
If i = j:
set y[i,j] = x[i,j].
Else:
set y[i,j] = x[i,j] + x[j,i]
I have made the following script:
def symmetrize(x):
## The symmetrized matrix that is returned
y = np.zeros(np.shape(x))
## For loop for each element (i,j) in the matrix
for i in range (np.size(x)):
for j in range (np.size(x)):
if i == j:
y[i,j] = x[i,j]
else:
y[i,j] = x[i,j] + x[j,i]
return y
I get this error message whenever I want to run the code with the following matrix:
np.array([[1.2, 2.3, 3.4],[4.5, 5.6, 6.7], [7.8, 8.9, 10.0]])
Error message:
y[i,j] = x[i,j] + x[j,i]
IndexError: index 3 is out of bounds for axis 1 with size 3
Do someone know what the problem is?

np.size(), without an axis, gives you the total number of elements in the matrix. So your range()s are going to go from 0 - 8, not from 0 - 2.
You don't need to use np.size() or np.shape() for that matter; these functions aren't even listed in the documentation any more. Just use the .shape attribute of a matrix:
y = np.zeros(x.shape)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
There are better ways of producing your output. You could use:
def symmetrize(x):
return x + x.T - np.diag(x.diagonal())
instead. x.T is the transposed matrix, so rows and columns swapped. x + x.T is the sum of the original matrix and the transposition matrix, so the numbers on the diagonal are doubled. x.diagonal() is an array of just those numbers on the diagonal, which can be subtracted once you created a matrix of those numbers on the diagonal, which is what np.diag() does for you.

You are using np.size() wrong way, it doesn't tell you how many rows or columns your list has, but number of elements in the array, in your case - 9. You could use shape of your list like so:
def symmetrize(x):
## The symmetrized matrix that is returned
y = np.zeros(np.shape(x))
## For loop for each element (i,j) in the matrix
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if i == j:
y[i,j] = x[i,j]
else:
y[i,j] = x[i,j] + x[j,i]
return y

Distance calculation on matrix using numpy

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?
My Answer
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)

Try this:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
Or:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
And don't ask what's it doing :D
Edit: nah, just kidding. The broadcasting in the middle (points[np.newaxis] - centroids[:,np.newaxis]) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs.
Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)). We end up with a (k, m) matrix where row 0 contain the distances to centroids[0], etc. So, the .argmin(axis=0) gives you the result you wanted.

You need to define a distance function where you are using hypot. Usually in K-means it is
Distance=sum((point-centroid)^2)
Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(#minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
UPDATE
This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate euclidean distance from a set in Python - python

This should work: S = Si(arr) # get the array def my_fn(i): # take the value of i euc_dists = [] for j in S[i]: # iterate over j's in S[i] if i!= j: dist = np.linalg.norm(arr[i]-arr[j]) # euclidean distance euc_dists.append(dist) return euc_dists

Related

Why is my algorithm showing linear behavior when it's supposed to be O(m4^(m))?

how to calculate the distances between all datapoints among each other

Average Directional Unit Vectors - Determinism Test

Don't understand this IndexError using numpy

Distance calculation on matrix using numpy

Categories

Resources