Calculate average weighted euclidean distance between values in numpy - python

I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.
Situation:
I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height).
Now I want to calculate the average distance between all 1 values.
The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )
Example:
import numpy as np
from scipy.spatial.distance import pdist
a = np.array([[1, 0, 1],
[0, 0, 0],
[0, 0, 1]])
# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562, 10. , 10. ])
After that I would calculate the average via d.mean(). However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?

You need the actual coordinates of the non-zero markers, to compute the distance between them:
>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
... [0, 0, 0],
... [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0], # That's the coordinate of the "1" in the top left,
[0, 2], # top right,
[2, 2]]) # and bottom right.
Next you want to calculate the distance between these points. You use pdist for this, like so:
>>> dists = pdist(coords) * 10 # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[ 0. , 20. , 28.28427125],
[ 20. , 0. , 20. ],
[ 28.28427125, 20. , 0. ]])
In this last matrix, you will find (above the diagonal), the distance between each marked point in a and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]) and node 1 (a[0,2]), node 0 and node 2 (a[2,2]) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists), then S[i,j] returns the distance between the coordinates on row i of coords and row j.
Just the values in the upper triangle of that last matrix are also present in the variable dist, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform (shown here just for demonstration purposes):
>>> dists
array([ 20. , 28.2842712, 20. ])
>>> dists.mean()
22.761423749153966
Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]. The 2nd point is [0,0,0]. The Euclidean distance between those two sqrt(2)~1.4. Then, the distance between the first and the 3rd coordinate (the last row in a), is only 1. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]) and the 3rd (last row, row 2: [0,0,1]) is also 1. So remember, pdist interprets its first argument as a stack of coordinates in n-dimensional space, n being the number of elements in the tuple of each node.

Related

How can I use weighted labels in the knn algorithm?

I am working on my own implementation of the weighted knn algorithm.
To simplify the logic, let's represent this as a predict method, which takes three parameters:
indices - matrix of nearest j neighbors from the training sample for object i (i=1...n, n objects in total). [i, j] - index of object from the training sample.
For example, for 4 objects and 3 neighbors:
indices = np.asarray([[0, 3, 1],
[0, 3, 1],
[1, 2, 0],
[5, 4, 3]])
distances - matrix of distances from j nearest neighbors from the training sample to object i. (i=1...n, n objects in total). For example, for 4 objects and 3 neighbors:
distances = np.asarray([[ 4.12310563, 7.07106781, 7.54983444],
[ 4.89897949, 6.70820393, 8.24621125],
[ 0., 1.73205081, 3.46410162],
[1094.09368886, 1102.55022561, 1109.62245832]])
labels - vector with true labels of classes for each object j of training sample. For example:
labels = np.asarray([0, 0, 0, 1, 1, 2])
Thus, the function signature is:
def predict(indices, distances, labels):
....
# return [np.bincount(x).argmax() for x in labels[indices]]
return predict
In the commentary you can see the code that returns the prediction for the "non-weighted" knn-method, which does not use distances. Can you please show, how predictions can be calculated with using the distance matrix? I found the algorithm, but now I'm completely stumped becase I don't know how to realize it with numpy.
Thank you!
This should work:
# compute inverses of distances
# suppress division by 0 warning,
# replace np.inf with a very large number
with np.errstate(divide='ignore'):
dinv = np.nan_to_num(1 / distances)
# an array with distinct class labels
distinct_labels = np.array(list(set(labels)))
# an array with labels of neighbors
neigh_labels = labels[indices]
# compute the weighted score for each potential label
weighted_scores = ((neigh_labels[:, :, np.newaxis] == distinct_labels) * dinv[:, :, np.newaxis]).sum(axis=1)
# choose the label with the highest score
predictions = distinct_labels[weighted_scores.argmax(axis=1)]

How to put one entry across an entire diagonal for a sparse matrix in Python

I am seeking to construct a matrix of which I will calculate the inverse. This will be used in an implicit method for solving a nonlinear parabolic PDE. My current calculations are, which will become obvious to why, giving me a singular (no possible inverse) matrix. For context, in reality the matrix will be of dimension 30 by 30 but in these examples I am using smaller matrices for testing purposes.
Say I want to create a large square sparse matrix. Using spdiags only allows you to input members of the main, lower and upper diagonals individually. So how to you make it so that each diagonal has one value for all its entries?
Example Code:
import numpy as np
from scipy.sparse import spdiags
from numpy.linalg import inv
updiag = -0.25
diag = 0.5
lowdiag = -0.25
Jdata = np.array([[diag], [lowdiag], [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
inverseJ = inv(J)
print(inverseJ)
This would produce an 3 x 3 matrix but only with the first entry of each diagonal given. I wondered about using np.fill_diagonal but that would require a matrix first and only does the main diagonal. Am I misunderstanding something?
The first argument of spdiags is a matrix of values to be used as the diagonals. You can use it this way:
Jdata = np.array([3 * [diag], 3 * [lowdiag], 3 * [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
# [[ 0.5 -0.25 0. ]
# [-0.25 0.5 -0.25]
# [ 0. -0.25 0.5 ]]

Minimum absolute difference between elements in two numpy arrays

Consider two 1d numpy arrays.
import numpy as np
X = np.array([-43, 21, 4, 6, -1, 22, 8])
Y = np.array([13, 5, -12, 0])
I want to find the value(s) from X that have the minimum absolute difference with the value(s) from Y. In the example shown, the minimum absolute difference is 1, given by [[4, 5], [6, 5], [-1, 0]]. There are lots of resources on this site about finding minimum element of arrays, but that's not what I'm after.
For the present question, both starting arrays are 1d, though their sizes may differ. I'd also be interested, though, on tips about how to proceed if the starting arrays had different shapes. Is it simply a matter of flattening both then proceeding as before?
You can calculate the absolute distance array and then find the minimum in that array. This method works for different X and Y lengths. If they are multi-dimensional, simply flatten them first (using X.flatten(), ...) and apply this solution to the flattened arrays:
If you want ALL pairs with minimum absolute distance:
#absolute distance between X and Y
dist = np.abs(X[:,None]-Y)
#elements of X with minimum absolute distance
X[np.where(dist==dist.min())[0]]
#corresponding elements of Y with absolute distance
Y[np.where(dist==dist.min())[1]]
output:
[ 4 6 -1]
[5 5 0]
And you want them in a single array format:
idx = np.where(dist==dist.min())
np.stack((X[idx[0]], Y[idx[1]])).T
[[ 4 5]
[ 6 5]
[-1 0]]
If you want the first occurrence of minimum absolute distance with faster solution:
X[dist.argmin()//Y.size]
Y[dist.argmin()//X.size]
or equally another solution (I think would be faster):
idx = np.unravel_index(np.argmin(dist), dist.shape)
X[idx[0]]
Y[idx[1]]
output:
4
5
Note: Another way of getting the absolute distance array is:
dist = np.abs(np.subtract.outer(X,Y))

Adjusted Cosine Similarity in Python

Referring to this link
which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I cannot see how the 'both rated' condition is met as per this definition
I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.
Could anyone please clarify this?
Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item. User is indexed by u and column is indexed by i.
Each user have different judgement rule of how good or how bad is something is. A 1 from a user could be a 3 from another user. That is why we subtract the average of each R_u, from each R_{u,i}. This is computed as item_mean_subtracted in your code. Notice that we are subtracting each element by its row mean to normalize the user's biasness. After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.
pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that
cosine similarity = 1- cosine distance
and hence that is why the code works.
Now, what if I compute it directly according to the definition directly? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.
import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)
#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)
Both of the codes give the same result:
[[ 1. 0.86743396 0.39694169 -0.67525773 -0.72426278]
[ 0.86743396 1. 0.80099604 -0.64553225 -0.90790362]
[ 0.39694169 0.80099604 1. -0.37833504 -0.80337196]
[-0.67525773 -0.64553225 -0.37833504 1. 0.26594024]
[-0.72426278 -0.90790362 -0.80337196 0.26594024 1. ]]

How to find all neighbors of a given point in a delaunay triangulation using scipy.spatial.Delaunay?

I have been searching for an answer to this question but cannot find anything useful.
I am working with the python scientific computing stack (scipy,numpy,matplotlib) and I have a set of 2 dimensional points, for which I compute the Delaunay traingulation (wiki) using scipy.spatial.Delaunay.
I need to write a function that, given any point a, will return all other points which are vertices of any simplex (i.e. triangle) that a is also a vertex of (the neighbors of a in the triangulation). However, the documentation for scipy.spatial.Delaunay (here) is pretty bad, and I can't for the life of me understand how the simplices are being specified or I would go about doing this. Even just an explanation of how the neighbors, vertices and vertex_to_simplex arrays in the Delaunay output are organized would be enough to get me going.
Much thanks for any help.
I figured it out on my own, so here's an explanation for anyone future person who is confused by this.
As an example, let's use the simple lattice of points that I was working with in my code, which I generate as follows
import numpy as np
import itertools as it
from matplotlib import pyplot as plt
import scipy as sp
inputs = list(it.product([0,1,2],[0,1,2]))
i = 0
lattice = range(0,len(inputs))
for pair in inputs:
lattice[i] = mksite(pair[0], pair[1])
i = i +1
Details here not really important, suffice to say it generates a regular triangular lattice in which the distance between a point and any of its six nearest neighbors is 1.
To plot it
plt.plot(*np.transpose(lattice), marker = 'o', ls = '')
axes().set_aspect('equal')
Now compute the triangulation:
dela = sp.spatial.Delaunay
triang = dela(lattice)
Let's look at what this gives us.
triang.points
output:
array([[ 0. , 0. ],
[ 0.5 , 0.8660254 ],
[ 1. , 1.73205081],
[ 1. , 0. ],
[ 1.5 , 0.8660254 ],
[ 2. , 1.73205081],
[ 2. , 0. ],
[ 2.5 , 0.8660254 ],
[ 3. , 1.73205081]])
simple, just an array of all nine points in the lattice illustrated above. How let's look at:
triang.vertices
output:
array([[4, 3, 6],
[5, 4, 2],
[1, 3, 0],
[1, 4, 2],
[1, 4, 3],
[7, 4, 6],
[7, 5, 8],
[7, 5, 4]], dtype=int32)
In this array, each row represents one simplex (triangle) in the triangulation. The three entries in each row are the indices of the vertices of that simplex in the points array we just saw. So for example the first simplex in this array, [4, 3, 6] is composed of the points:
[ 1.5 , 0.8660254 ]
[ 1. , 0. ]
[ 2. , 0. ]
Its easy to see this by drawing the lattice on a piece of paper, labeling each point according to its index, and then tracing through each row in triang.vertices.
This is all the information we need to write the function I specified in my question.
It looks like:
def find_neighbors(pindex, triang):
neighbors = list()
for simplex in triang.vertices:
if pindex in simplex:
neighbors.extend([simplex[i] for i in range(len(simplex)) if simplex[i] != pindex])
'''
this is a one liner for if a simplex contains the point we`re interested in,
extend the neighbors list by appending all the *other* point indices in the simplex
'''
#now we just have to strip out all the dulicate indices and return the neighbors list:
return list(set(neighbors))
And that's it! I'm sure the function above could do with some optimization, its just what I came up with in a few minutes. If anyone has any suggestions, feel free to post them. Hopefully this helps somebody in the future who is as confused about this as I was.
The methods described above cycle through all the simplices, which could take very long, in case there's a large number of points. A better way might be to use Delaunay.vertex_neighbor_vertices, which already contains all the information about the neighbors. Unfortunately, extracting the information
def find_neighbors(pindex, triang):
return triang.vertex_neighbor_vertices[1][triang.vertex_neighbor_vertices[0][pindex]:triang.vertex_neighbor_vertices[0][pindex+1]]
The following code demonstrates how to get the indices of some vertex (number 17, in this example):
import scipy.spatial
import numpy
import pylab
x_list = numpy.random.random(200)
y_list = numpy.random.random(200)
tri = scipy.spatial.Delaunay(numpy.array([[x,y] for x,y in zip(x_list, y_list)]))
pindex = 17
neighbor_indices = find_neighbors(pindex,tri)
pylab.plot(x_list, y_list, 'b.')
pylab.plot(x_list[pindex], y_list[pindex], 'dg')
pylab.plot([x_list[i] for i in neighbor_indices],
[y_list[i] for i in neighbor_indices], 'ro')
pylab.show()
I know it's been a while since this question was posed. However, I just had the same problem and figured out how to solve it. Just use the (somewhat poorly documented) method vertex_neighbor_vertices of your Delaunay triangulation object (let us call it 'tri').
It will return two arrays:
def get_neighbor_vertex_ids_from_vertex_id(vertex_id, tri):
index_pointers, indices = tri.vertex_neighbor_vertices
result_ids = indices[index_pointers[vertex_id]:index_pointers[vertex_id + 1]]
return result_ids
The neighbor vertices to the point with the index vertex_id are stored somewhere in the second array that I named 'indices'. But where? This is where the first array (which I called 'index_pointers') comes in. The starting position (for the second array 'indices') is index_pointers[vertex_id], the first position past the relevant sub-array is index_pointers[vertex_id+1]. So the solution is indices[index_pointers[vertex_id]:index_pointers[vertex_id+1]]
Here is an ellaboration on #astrofrog answer. This works also in more than 2D.
It took about 300 ms on set of 2430 points in 3D (about 16000 simplices).
from collections import defaultdict
def find_neighbors(tess):
neighbors = defaultdict(set)
for simplex in tess.simplices:
for idx in simplex:
other = set(simplex)
other.remove(idx)
neighbors[idx] = neighbors[idx].union(other)
return neighbors
Here is also a simple one line version of James Porter's own answer using list comprehension:
find_neighbors = lambda x,triang: list(set(indx for simplex in triang.simplices if x in simplex for indx in simplex if indx !=x))
I needed this too and came across the following answer. It turns out that if you need the neighbors for all initial points, it's much more efficient to produce a dictionary of neighbors in one go (the following example is for 2D):
def find_neighbors(tess, points):
neighbors = {}
for point in range(points.shape[0]):
neighbors[point] = []
for simplex in tess.simplices:
neighbors[simplex[0]] += [simplex[1],simplex[2]]
neighbors[simplex[1]] += [simplex[2],simplex[0]]
neighbors[simplex[2]] += [simplex[0],simplex[1]]
return neighbors
The neighbors of point v are then neighbors[v]. For 10,000 points in this takes 370ms to run on my laptop. Maybe others have ideas on optimizing this further?
All the answers here are focused on getting the neighbors for one point (except astrofrog, but that is in 2D and this is 6x faster), however, it's equally expensive to get a mapping for all of the points → all neighbors.
You can do this with
from collections import defaultdict
from itertools import permutations
tri = Delaunay(...)
_neighbors = defaultdict(set)
for simplex in tri.vertices:
for i, j in permutations(simplex, 2):
_neighbors[i].add(j)
points = [tuple(p) for p in tri.points]
neighbors = {}
for k, v in _neighbors.items():
neighbors[points[k]] = [points[i] for i in v]
This works in any dimension and this solution, finding all neighbors of all points, is faster than finding only the neighbors of one point (the excepted answer of James Porter).
Here's mine, it takes around 30ms on a cloud of 11000 points in 2D.
It gives you a 2xP array of indices, where P is the number of pairs of neighbours that exist.
def get_delaunay_neighbour_indices(vertices: "Array['N,D', int]") -> "Array['2,P', int]":
"""
Fine each pair of neighbouring vertices in the delaunay triangulation.
:param vertices: The vertices of the points to perform Delaunay triangulation on
:return: The pairs of indices of vertices
"""
tri = Delaunay(vertices)
spacing_indices, neighbours = tri.vertex_neighbor_vertices
ixs = np.zeros((2, len(neighbours)), dtype=int)
np.add.at(ixs[0], spacing_indices[1:int(np.argmax(spacing_indices))], 1) # The argmax is unfortuantely needed when multiple final elements the same
ixs[0, :] = np.cumsum(ixs[0, :])
ixs[1, :] = neighbours
assert np.max(ixs) < len(vertices)
return ixs
We can find one simplex containing the vertex (tri.vertex_to_simplex[vertex]) and then recursively search the neighbors of this simplex (tri.neighbors) to find other simplices containing the vertex.
from scipy.spatial import Delaunay
tri = Delaunay(points) #points is the list of input points
neighbors =[] #array of neighbors for all vertices
for i in range(points):
vertex = i #vertex index
vertexneighbors = [] #array of neighbors for vertex i
neighbour1 = -1
neighbour2=-1
firstneighbour=-1
neighbour1index = -1
currentsimplexno= tri.vertex_to_simplex[vertex]
for i in range(0,3):
if (tri.simplices[currentsimplexno][i]==vertex):
firstneighbour=tri.simplices[currentsimplexno][(i+1) % 3]
vertexneighbors.append(firstneighbour)
neighbour1index=(i+1) % 3
neighbour1=tri.simplices[currentsimplexno][(i+1) % 3]
neighbour2=tri.simplices[currentsimplexno][(i+2) % 3]
while (neighbour2!=firstneighbour):
vertexneighbors.append(neighbour2)
currentsimplexno= tri.neighbors[currentsimplexno][neighbour1index]
for i in range(0,3):
if (tri.simplices[currentsimplexno][i]==vertex):
neighbour1index=(i+1) % 3
neighbour1=tri.simplices[currentsimplexno][(i+1) % 3]
neighbour2=tri.simplices[currentsimplexno][(i+2) % 3]
neighbors.append(vertexneighbors)
print (neighbors)

Categories

Resources