adding points per pixel along some axis for 2D polygon - python

Assuming I have an open polygon represented by a list of 2D points. E.g. the representation of some kind of a triangle-polygon without a basis would be:
import numpy as np
polygon_arr = np.array([[0,0], [15,10], [2,4]])
I'm looking for an elegant way to enrich the representation, i.e. adding points to the polygon_arr such that the polygon itself wouldn't change but for every y value (in the range of the polygon) there would be a matching point in the polygon_arr.
Example:
simple_line_polygon = np.array([[0,0], [10,5]])
enriched_representation = foo(simple_line_polygon)
# foo() should return: np.array([[0,0], [2,1], [4,2], [6,3], [8,4], [10,5]])
I thought of considering each two adjacent points in the polygon, construct a line equation (y=mx+n) and sample it for each y within the range; then treat special cases such as two points are vertical (so the line equation is not defined) and the case where the points are already closer to each other than one-pixel change in y value. However, this is not so elegant and would appreciate better ideas.

There is no need for a line equation here. You can just scale the x and y distances between points separately. If there should be a minimum distance between the points, you can check that by computing the Euclidean distance between corners. Here is a small function that hopefully does what you are after:
import numpy as np
def enrich_polygon(polygon, maxpoints = 5, mindist=1):
result = []
##looping over all lines of the polygon:
for start, end in zip(polygon, np.vstack([polygon[1:],polygon[:1]])):
dist = np.sqrt(np.sum((start-end)**2)) ##distance between points
N = int(min(maxpoints+1,dist/mindist)) ##amount of sub-sections
if N < 2: ##mindist already reached
result += [start]
##generating the new points:
##put all points (including original start) on the line into results
else:
result += [
start+i*(end-start)/(N-1) for i in range(N-1)
]
return np.array(result)
polygon_arr = np.array([[0,0], [15,10], [2,4]])
res = enrich_polygon(polygon_arr)
print(res)
The function takes the original polygon and iterates over pairs of neighbouring corner points. If the distance between two corners is larger than mindist, new points will be added up to maxpoints (the maximum amount of points to be added per line). For the given example the result looks like this:
[[ 0. 0. ]
[ 3. 2. ]
[ 6. 4. ]
[ 9. 6. ]
[12. 8. ]
[15. 10. ]
[12.4 8.8 ]
[ 9.8 7.6 ]
[ 7.2 6.4 ]
[ 4.6 5.2 ]
[ 2. 4. ]
[ 1.33333333 2.66666667]
[ 0.66666667 1.33333333]]

Related

How to rotate an rotation by an euler rotation in python?

What i want
Input: non-normalized axis rotation
Output: quaternion rotation, but additionally rotated by -90 degree y-axis (euler)
What i have
#!/usr/bin/env python3
#from math import radians, degrees, cos, sin, atan2, asin, pow, floor
#import numpy as np
from scipy.spatial.transform import Rotation
#r = Rotation.from_rotvec(rotation_by_axis).as_quat()
r = Rotation.from_quat([-0.0941422, 0.67905384, -0.2797612, 0.67212856]) # example
print("Input (as Euler): " + str(r.as_euler('xyz', degrees=True)))
print("Output (as Euler): " + str(r.apply([0, -90, 0])))
The result:
Input (as Euler): [-83.23902624 59.33323676 -98.88314731]
Output (as Euler): [-22.33941658 -74.31676511 45.58474405]
How to get the output [-83.23902624 -30.66676324 -98.88314731] instead?
Bad workaround
This works only sometimes (why?).
rotation = r.from_quat([rotation.x, rotation.y, rotation.z, rotation.w])
rotation = rotation.as_euler('xyz', degrees=True)
print(rotation)
rotation = r.from_euler('xyz', [rotation[0], rotation[1]-90, rotation[2]], degrees=True)
print(rotation.as_euler('xyz', degrees=True))
rotation = rotation.as_quat()
How to do it a better way?
Because sometimes i get wrong values:
[ -8.25897711 -16.54712028 -1.90525288]
[ 171.74102289 -73.45287972 178.09474712]
[ -7.18492129 22.22525264 0.44373851]
[ -7.18492129 -67.77474736 0.44373851]
[ 7.52491766 -37.71896037 -40.86915413]
[-172.47508234 -52.28103963 139.13084587]
[ -1.79610826 37.83068221 31.20184248]
[ -1.79610826 -52.16931779 31.20184248]
[-113.5719734 -54.28744892 141.73007557]
[ 66.4280266 -35.71255108 -38.26992443]
[ -83.23903078 59.33323752 -98.88315157]
[ -83.23903078 -30.66676248 -98.88315157]
[ -9.67960912 -7.23784945 13.56800885]
[ 170.32039088 -82.76215055 -166.43199115]
[ -6.21695895 5.66996884 -11.16152822]
[ -6.21695895 -84.33003116 -11.16152822]
[ 0. 0. 0. ]
[ 0. -90. 0. ]
[ 0. 0. 0. ]
[ 0. -90. 0. ]
Here wrong:
[ -8.25897711 -16.54712028 -1.90525288]
[ 171.74102289 -73.45287972 178.09474712]
Here okay:
[ -7.18492129 22.22525264 0.44373851]
[ -7.18492129 -67.77474736 0.44373851]
I require it for this:
https://github.com/Arthur151/ROMP/issues/193#issuecomment-1156960708
apply is for applying a rotation to vectors; it won't work on, e.g., Euler rotation angles, which aren't "mathematical" vectors: it doesn't make sense to add or scale them as triples of numbers.
To combine rotations, use *. So, e.g., to rotate by an additional 20 degrees about a y-axis defined by the first rotation:
In [1]: import numpy as np
In [2]: np.set_printoptions(suppress=True) # don't show round-off
In [3]: from scipy.spatial.transform import Rotation
In [4]: def e(x,y,z): return Rotation.from_euler('xyz', [x,y,z], degrees=True)
In [5]: def s(r): return r.as_euler('xyz', degrees=True)
In [6]: display(s(e(0,20,0) * e(10,0,0)))
Out[6]: array([10., 20., 0.])
However, in general this rotation won't just add to the y-component total rotation. This is because the additional rotation's axes are defined by the first rotation, but the total rotation includes everything combined:
In [7]: s(e(0,20,0) * e(0,0,10))
Out[7]: array([ 3.61644157, 19.68349808, 10.62758414])
Combining rotations as shown above is quite standard; e.g., in a multi-jointed robot, to find the orientation of the final element, you'd use the "combining" technique shown above, with one rotation per joint, defined by the appropriate axes (e.g., z for a "hip" yawing rotation, x for a "wrist" rolling rotation)
If you do need to manipulate Euler angles, your "bad workaround" is fine. Bear in mind that the middle rotation in Euler representations is normally limited to be under 90 degrees absolute value:
In [8]: s(e(0,135,0))
Out[8]: array([180., 45., 180.])

Find three closest points which triangle contains a given point on a sphere

I have a 3D sphere with points on its surface. These points are represented in spherical coordinates, so azimuth, elevation and r.
For example my dataset is a matrix with all the available points on the given sphere:
azimuth elevation r
[[ 0. 90. 1.47 ]
[ 0. 85.2073787 1.47 ]
[ 0. 78.16966379 1.47 ]
[ 0. 70.30452954 1.47 ]
[ 0. 62.0367492 1.47 ]
[ 0. 53.56289304 1.47 ]
[ 0. 45. 1.47 ]
[ 0. 36.43710696 1.47 ]
[ 0. 27.9632508 1.47 ]
[ 0. 19.69547046 1.47 ]
[ 0. 11.83033621 1.47 ]
[ 0. 4.7926213 1.47 ]
[ 0. 0. 1.47 ]
[ 0. -4.7926213 1.47 ]
[ 0. -11.83033621 1.47 ]
[ 0. -19.69547046 1.47 ]
[ 0. -27.9632508 1.47 ]
[ 0. -36.43710696 1.47 ]
[ 0. -45. 1.47 ]
[ 0. -53.56289304 1.47 ]
[ 0. -62.0367492 1.47 ]
[ 0. -70.30452954 1.47 ]
[ 0. -78.16966379 1.47 ]
[ 0. -85.2073787 1.47 ]
[ 0. -90. 1.47 ]
[ 1.64008341 -1.6394119 1.47 ]
[ 1.64008341 1.6394119 1.47 ]
[ 2.37160039 8.01881788 1.47 ]
[ 2.37160039 -8.01881788 1.47 ]
[ 2.80356493 -15.58649429 1.47 ]
[ 2.80356493 15.58649429 1.47 ]
[ 3.16999007 23.70233802 1.47 ]
[ 3.16999007 -23.70233802 1.47 ]
[ 3.56208248 -32.09871039 1.47 ]
[ 3.56208248 32.09871039 1.47 ]
[ 4.04606896 40.63141594 1.47 ]
[ 4.04606896 -40.63141594 1.47 ]
[ 4.1063771 -4.09587122 1.47 ]
ecc...
NB: I am omitting the full data matrix on purpose since it contains quite a lot of data. If needed/required to make the problem fully reproducible, I will provide the full data.
This matrix represents something like this image:
Given an arbitrary point, I would like to compute the 3 closest points in the dataset that "contain" the input point.
My code so far is the following:
def compute_three_closest_positions(self, azimuth_angle, elevation):
requested_position = np.array([azimuth_angle, elevation, 0])
# computing the absolute difference between the requested angles and the available one in the dataset
result = abs(self.sourcePositions - requested_position) #subtracting between dataset and requested point
result = np.delete(result, 2, 1) # removing the r data
result = result.sum(axis=1) #summing the overall difference for each point
# returning index of the closest points
indexes = result.argsort()[:3]
closest_points = self.sourcePositions[indexes]
return closest_points
Basically I am subtracting the requested azimuth and elevation from all the points in the matrix dataset (self.sourcePositions), then i sum these differences for each point, compute the top 3 minimum indexes and then I use these indexes to access the points in my dataset.
The code works fine, the problem is that sometimes I get 3 closest points that do not contain the requested point.
Examples:
Wrong one:
Requested point: azimut, elevation, distance
[200 0 1.47]
# As you might notice, the requested point is not inside the triangle created by the 3 closest points
Three closest points: azimut, elevation, distance
[[199.69547046 0. 1.47 ]
[199.40203659 5.61508214 1.47 ]
[199.40203659 -5.61508214 1.47 ]]
Good one:
Requested position:
[190 0 1.47]
# As you can notice, in this case the requested point is inside the triangle generated by the closest 3 points
Three closest points:
[[191.83033621 0. 1.47 ]
[188.02560265 2.34839855 1.47 ]
[188.02560265 -2.34839855 1.47 ]]
How could I fix this problem? I would like to get the 3 closest points which "triangle" (I am on a spherical surface, so it is not a real triangle) contain my requested point.
From starters azimut+elevation does not sound right for this I would prefer latitude+longitude therms because the ones you used are something else.
Now as mentioned in the comments if your points form a regular grid you can make an equation that maps topology into point and back where topology is described by integer index of your point in array (or 2 indexes). However if you are not capable of infering this or the grid is not regular then you can do this:
reorder your array so its sorted by latitude and long
so longitude is in <0,2*PI> and latitude in <-PI/2,+PI/2> so I would sort your array by both while latitude has bigger priority (weight). Lets call this new array pnt[]
mapping point p to closest vertex of sphere
simply binary search pnt[] until found point index ix that has bigest smaler or equal latitude than p
then linearly search from ix (binary search could be used if you reorder your pnt[] to slices or remember how many points per latitude there is) until found biggest latitude that is smaller or equal to p.
Now pnt[ix] is the "closest" point on sphere to p
list closest neighbors to pnt[ix]
so simply pnt[ix] is from left side of longitude so pnt[ix+1] should be the next point (in case you cross the array size you or pole which needs to check the points with brute force but just the last few in array)
now we just need to find corresponding points below or above these 2 points (depends on which side your p is). So find 2 closest points to p in the same way as #2 but for smaller and bigger latitude (one slice before,after). this will get you 3*2 points which forms (using always the 2 points found first) 4 potential triangles.
test possible triangles
so you have potential triangle p0,p1,p2 that is "parallel" to sphere surface. So simply project your point onto its plane:
u = p1-p0
v = p2-p0
v = cross(u,v)
v = cross(u,v)
p' = p0 + dot(p-p0,u)*u + dot(p-p0,v)*v
so u,v are basis vectors and p0 is the origin of a plane... Now just test of p' is inside triangle so either use 2D and barycentric coordinates or exploit cross product and check for CW/CCW like:
if (dot(cross(p1-p0,p'-p0),p')>=0)
if (dot(cross(p2-p1,p'-p1),p')>=0)
if (dot(cross(p0-p2,p'-p2),p')>=0)
point_is_inside;
if (dot(cross(p1-p0,p'-p0),p')<=0)
if (dot(cross(p2-p1,p'-p1),p')<=0)
if (dot(cross(p0-p2,p'-p2),p')<=0)
point_is_inside;
so if all 3 sides of triangle have the same CW/CCW ness to p' you found your triangle.

How to order an array of 2-dimensional arrays based on y-coordinate

I have code that iterates over some interval in x and calculates the difference between the computed and the expected value, y.
def find_alpha_max():
alpha=0
alphamax=1
alphamin=0
n=10
alphastep=(alphamax-alphamin)/n
data= np.zeros([n,2])
for i in range(1,len(data)):
alpha=i*alphastep
data[i,0] = alpha
data[i,1] = abs(wavelen(alpha)-121)
print(data)
def wavelen(alpha):
val1,val2= energy_levels(alpha)
transition= abs(val1-val2)
wavelength=(h*c/abs(transition*(1e-19)))/(1e-9)
return wavelength
This gives the output:
[[ 0. 0. ]
[ 0.1 61.37844557]
[ 0.2 47.07247138]
[ 0.3 33.72835938]
[ 0.4 21.50950112]
[ 0.5 10.44535124]
[ 0.6 0.49201146]
[ 0.7 8.43091667]
[ 0.8 16.41859421]
[ 0.9 23.56862328]]
What I want to be able to do is sort the order of the arrays based on their y-value, from smallest to largest, so that I can obtain an x-value that I can use for numerical minimisation. How do I do this?
You can use the following:
Sorted(your_array, lambda your_array:your_array[i])
This will allow you to sort by the ith element of each one dimensional array within your larger array.

How to find all neighbors of a given point in a delaunay triangulation using scipy.spatial.Delaunay?

I have been searching for an answer to this question but cannot find anything useful.
I am working with the python scientific computing stack (scipy,numpy,matplotlib) and I have a set of 2 dimensional points, for which I compute the Delaunay traingulation (wiki) using scipy.spatial.Delaunay.
I need to write a function that, given any point a, will return all other points which are vertices of any simplex (i.e. triangle) that a is also a vertex of (the neighbors of a in the triangulation). However, the documentation for scipy.spatial.Delaunay (here) is pretty bad, and I can't for the life of me understand how the simplices are being specified or I would go about doing this. Even just an explanation of how the neighbors, vertices and vertex_to_simplex arrays in the Delaunay output are organized would be enough to get me going.
Much thanks for any help.
I figured it out on my own, so here's an explanation for anyone future person who is confused by this.
As an example, let's use the simple lattice of points that I was working with in my code, which I generate as follows
import numpy as np
import itertools as it
from matplotlib import pyplot as plt
import scipy as sp
inputs = list(it.product([0,1,2],[0,1,2]))
i = 0
lattice = range(0,len(inputs))
for pair in inputs:
lattice[i] = mksite(pair[0], pair[1])
i = i +1
Details here not really important, suffice to say it generates a regular triangular lattice in which the distance between a point and any of its six nearest neighbors is 1.
To plot it
plt.plot(*np.transpose(lattice), marker = 'o', ls = '')
axes().set_aspect('equal')
Now compute the triangulation:
dela = sp.spatial.Delaunay
triang = dela(lattice)
Let's look at what this gives us.
triang.points
output:
array([[ 0. , 0. ],
[ 0.5 , 0.8660254 ],
[ 1. , 1.73205081],
[ 1. , 0. ],
[ 1.5 , 0.8660254 ],
[ 2. , 1.73205081],
[ 2. , 0. ],
[ 2.5 , 0.8660254 ],
[ 3. , 1.73205081]])
simple, just an array of all nine points in the lattice illustrated above. How let's look at:
triang.vertices
output:
array([[4, 3, 6],
[5, 4, 2],
[1, 3, 0],
[1, 4, 2],
[1, 4, 3],
[7, 4, 6],
[7, 5, 8],
[7, 5, 4]], dtype=int32)
In this array, each row represents one simplex (triangle) in the triangulation. The three entries in each row are the indices of the vertices of that simplex in the points array we just saw. So for example the first simplex in this array, [4, 3, 6] is composed of the points:
[ 1.5 , 0.8660254 ]
[ 1. , 0. ]
[ 2. , 0. ]
Its easy to see this by drawing the lattice on a piece of paper, labeling each point according to its index, and then tracing through each row in triang.vertices.
This is all the information we need to write the function I specified in my question.
It looks like:
def find_neighbors(pindex, triang):
neighbors = list()
for simplex in triang.vertices:
if pindex in simplex:
neighbors.extend([simplex[i] for i in range(len(simplex)) if simplex[i] != pindex])
'''
this is a one liner for if a simplex contains the point we`re interested in,
extend the neighbors list by appending all the *other* point indices in the simplex
'''
#now we just have to strip out all the dulicate indices and return the neighbors list:
return list(set(neighbors))
And that's it! I'm sure the function above could do with some optimization, its just what I came up with in a few minutes. If anyone has any suggestions, feel free to post them. Hopefully this helps somebody in the future who is as confused about this as I was.
The methods described above cycle through all the simplices, which could take very long, in case there's a large number of points. A better way might be to use Delaunay.vertex_neighbor_vertices, which already contains all the information about the neighbors. Unfortunately, extracting the information
def find_neighbors(pindex, triang):
return triang.vertex_neighbor_vertices[1][triang.vertex_neighbor_vertices[0][pindex]:triang.vertex_neighbor_vertices[0][pindex+1]]
The following code demonstrates how to get the indices of some vertex (number 17, in this example):
import scipy.spatial
import numpy
import pylab
x_list = numpy.random.random(200)
y_list = numpy.random.random(200)
tri = scipy.spatial.Delaunay(numpy.array([[x,y] for x,y in zip(x_list, y_list)]))
pindex = 17
neighbor_indices = find_neighbors(pindex,tri)
pylab.plot(x_list, y_list, 'b.')
pylab.plot(x_list[pindex], y_list[pindex], 'dg')
pylab.plot([x_list[i] for i in neighbor_indices],
[y_list[i] for i in neighbor_indices], 'ro')
pylab.show()
I know it's been a while since this question was posed. However, I just had the same problem and figured out how to solve it. Just use the (somewhat poorly documented) method vertex_neighbor_vertices of your Delaunay triangulation object (let us call it 'tri').
It will return two arrays:
def get_neighbor_vertex_ids_from_vertex_id(vertex_id, tri):
index_pointers, indices = tri.vertex_neighbor_vertices
result_ids = indices[index_pointers[vertex_id]:index_pointers[vertex_id + 1]]
return result_ids
The neighbor vertices to the point with the index vertex_id are stored somewhere in the second array that I named 'indices'. But where? This is where the first array (which I called 'index_pointers') comes in. The starting position (for the second array 'indices') is index_pointers[vertex_id], the first position past the relevant sub-array is index_pointers[vertex_id+1]. So the solution is indices[index_pointers[vertex_id]:index_pointers[vertex_id+1]]
Here is an ellaboration on #astrofrog answer. This works also in more than 2D.
It took about 300 ms on set of 2430 points in 3D (about 16000 simplices).
from collections import defaultdict
def find_neighbors(tess):
neighbors = defaultdict(set)
for simplex in tess.simplices:
for idx in simplex:
other = set(simplex)
other.remove(idx)
neighbors[idx] = neighbors[idx].union(other)
return neighbors
Here is also a simple one line version of James Porter's own answer using list comprehension:
find_neighbors = lambda x,triang: list(set(indx for simplex in triang.simplices if x in simplex for indx in simplex if indx !=x))
I needed this too and came across the following answer. It turns out that if you need the neighbors for all initial points, it's much more efficient to produce a dictionary of neighbors in one go (the following example is for 2D):
def find_neighbors(tess, points):
neighbors = {}
for point in range(points.shape[0]):
neighbors[point] = []
for simplex in tess.simplices:
neighbors[simplex[0]] += [simplex[1],simplex[2]]
neighbors[simplex[1]] += [simplex[2],simplex[0]]
neighbors[simplex[2]] += [simplex[0],simplex[1]]
return neighbors
The neighbors of point v are then neighbors[v]. For 10,000 points in this takes 370ms to run on my laptop. Maybe others have ideas on optimizing this further?
All the answers here are focused on getting the neighbors for one point (except astrofrog, but that is in 2D and this is 6x faster), however, it's equally expensive to get a mapping for all of the points → all neighbors.
You can do this with
from collections import defaultdict
from itertools import permutations
tri = Delaunay(...)
_neighbors = defaultdict(set)
for simplex in tri.vertices:
for i, j in permutations(simplex, 2):
_neighbors[i].add(j)
points = [tuple(p) for p in tri.points]
neighbors = {}
for k, v in _neighbors.items():
neighbors[points[k]] = [points[i] for i in v]
This works in any dimension and this solution, finding all neighbors of all points, is faster than finding only the neighbors of one point (the excepted answer of James Porter).
Here's mine, it takes around 30ms on a cloud of 11000 points in 2D.
It gives you a 2xP array of indices, where P is the number of pairs of neighbours that exist.
def get_delaunay_neighbour_indices(vertices: "Array['N,D', int]") -> "Array['2,P', int]":
"""
Fine each pair of neighbouring vertices in the delaunay triangulation.
:param vertices: The vertices of the points to perform Delaunay triangulation on
:return: The pairs of indices of vertices
"""
tri = Delaunay(vertices)
spacing_indices, neighbours = tri.vertex_neighbor_vertices
ixs = np.zeros((2, len(neighbours)), dtype=int)
np.add.at(ixs[0], spacing_indices[1:int(np.argmax(spacing_indices))], 1) # The argmax is unfortuantely needed when multiple final elements the same
ixs[0, :] = np.cumsum(ixs[0, :])
ixs[1, :] = neighbours
assert np.max(ixs) < len(vertices)
return ixs
We can find one simplex containing the vertex (tri.vertex_to_simplex[vertex]) and then recursively search the neighbors of this simplex (tri.neighbors) to find other simplices containing the vertex.
from scipy.spatial import Delaunay
tri = Delaunay(points) #points is the list of input points
neighbors =[] #array of neighbors for all vertices
for i in range(points):
vertex = i #vertex index
vertexneighbors = [] #array of neighbors for vertex i
neighbour1 = -1
neighbour2=-1
firstneighbour=-1
neighbour1index = -1
currentsimplexno= tri.vertex_to_simplex[vertex]
for i in range(0,3):
if (tri.simplices[currentsimplexno][i]==vertex):
firstneighbour=tri.simplices[currentsimplexno][(i+1) % 3]
vertexneighbors.append(firstneighbour)
neighbour1index=(i+1) % 3
neighbour1=tri.simplices[currentsimplexno][(i+1) % 3]
neighbour2=tri.simplices[currentsimplexno][(i+2) % 3]
while (neighbour2!=firstneighbour):
vertexneighbors.append(neighbour2)
currentsimplexno= tri.neighbors[currentsimplexno][neighbour1index]
for i in range(0,3):
if (tri.simplices[currentsimplexno][i]==vertex):
neighbour1index=(i+1) % 3
neighbour1=tri.simplices[currentsimplexno][(i+1) % 3]
neighbour2=tri.simplices[currentsimplexno][(i+2) % 3]
neighbors.append(vertexneighbors)
print (neighbors)

scipy linkage format

I have written my own clustering routine and would like to produce a dendrogram. The easiest way to do this would be to use scipy dendrogram function. However, this requires the input to be in the same format that the scipy linkage function produces. I cannot find an example of how the output of this is formatted. I was wondering whether someone out there can enlighten me.
I agree with https://stackoverflow.com/users/1167475/mortonjt that the documentation does not fully explain the indexing of intermediate clusters, while I do agree with the https://stackoverflow.com/users/1354844/dkar that the format is otherwise precisely explained.
Using the example data from this question: Tutorial for scipy.cluster.hierarchy
A = np.array([[0.1, 2.5],
[1.5, .4 ],
[0.3, 1 ],
[1 , .8 ],
[0.5, 0 ],
[0 , 0.5],
[0.5, 0.5],
[2.7, 2 ],
[2.2, 3.1],
[3 , 2 ],
[3.2, 1.3]])
A linkage matrix can be built using the single (i.e, the closest matching points):
z = hac.linkage(a, method="single")
array([[ 7. , 9. , 0.3 , 2. ],
[ 4. , 6. , 0.5 , 2. ],
[ 5. , 12. , 0.5 , 3. ],
[ 2. , 13. , 0.53851648, 4. ],
[ 3. , 14. , 0.58309519, 5. ],
[ 1. , 15. , 0.64031242, 6. ],
[ 10. , 11. , 0.72801099, 3. ],
[ 8. , 17. , 1.2083046 , 4. ],
[ 0. , 16. , 1.5132746 , 7. ],
[ 18. , 19. , 1.92353841, 11. ]])
As the documentation explains the clusters below n (here: 11) are simply the data points in the original matrix A. The intermediate clusters going forward, are indexed successively.
Thus, clusters 7 and 9 (the first merge) are merged into cluster 11, clusters 4 and 6 into 12. Then observe line three, merging clusters 5 (from A) and 12 (from the not-shown intermediate cluster 12) resulting with a Within-Cluster Distance (WCD) of 0.5. The single method entails that the new WCS is 0.5, which is the distance between A[5] and the closest point in cluster 12, A[4] and A[6]. Let's check:
In [198]: norm([a[5]-a[4]])
Out[198]: 0.70710678118654757
In [199]: norm([a[5]-a[6]])
Out[199]: 0.5
This cluster should now be intermediate cluster 13, which subsequently is merged with A[2]. Thus, the new distance should be the closest between the points A[2] and A[4,5,6].
In [200]: norm([a[2]-a[4]])
Out[200]: 1.019803902718557
In [201]: norm([a[2]-a[5]])
Out[201]: 0.58309518948452999
In [202]: norm([a[2]-a[6]])
Out[202]: 0.53851648071345048
Which, as can be seen also checks out, and explains the intermediate format of new clusters.
This is from the scipy.cluster.hierarchy.linkage() function documentation, I think it's a pretty clear description for the output format:
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
Do you need something more?
The scipy documentation is accurate as dkar pointed out ... but it's a little bit difficult to turn the returned data into something that is usable for further analysis.
In my opinion they should include the ability to return the data in a tree like data structure. The code below will iterate through the matrix and build a tree:
from scipy.cluster.hierarchy import linkage
import numpy as np
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
centers = np.concatenate((a, b),)
def create_tree(centers):
clusters = {}
to_merge = linkage(centers, method='single')
for i, merge in enumerate(to_merge):
if merge[0] <= len(to_merge):
# if it is an original point read it from the centers array
a = centers[int(merge[0]) - 1]
else:
# other wise read the cluster that has been created
a = clusters[int(merge[0])]
if merge[1] <= len(to_merge):
b = centers[int(merge[1]) - 1]
else:
b = clusters[int(merge[1])]
# the clusters are 1-indexed by scipy
clusters[1 + i + len(to_merge)] = {
'children' : [a, b]
}
# ^ you could optionally store other info here (e.g distances)
return clusters
print create_tree(centers)
Here's another piece of code that performs the same function. This version tracks the distance (size) of each cluster (node_id), and confirms the number of members.
This uses the scipy linkage() function which is the same foundation of the Aggregator clusterer.
from scipy.cluster.hierarchy import linkage
import copy
Z = linkage(data_x, 'ward')
n_points = data_x.shape[0]
clusters = [dict(node_id=i, left=i, right=i, members=[i], distance=0, log_distance=0, n_members=1) for i in range(n_points)]
for z_i in range(Z.shape[0]):
row = Z[z_i]
cluster = dict(node_id=z_i + n_points, left=int(row[0]), right=int(row[1]), members=[], log_distance=np.log(row[2]), distance=row[2], n_members=int(row[3]))
cluster["members"].extend(copy.deepcopy(members[cluster["left"]]))
cluster["members"].extend(copy.deepcopy(members[cluster["right"]]))
clusters.append(cluster)
on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})
input
output
consider [input] is the data for which you are interested in drawing a dentogram
when you use linkage that returns a matrix with four columns
column1 and column2 -represents the formation of cluster in order
i.e
the 2 and 3 makes a cluster first this cluster is named as 5 ( 2 and 3 represents index that is 2 and 3rd row)
1 and 5 is the second formed cluster this cluster is named as 6
column 3-represents the distance between the clusters
column 4-represents how many data points involved in making this cluster
dentogram

Categories

Resources