Calculate distance to smoothed line

Calculate distance to smoothed line - python

I'm trying to find the distance of a point (in 4 dimensions, only 2 are shown here) (any coloured crosses in the figure) to a supposed Pareto frontier (black line). This line represents the best Pareto frontier representation during an optimization process.
Pareto = [[0.3875575798354123, -2.4122340425531914], [0.37707675586149786, -2.398936170212766], [0.38176077842761763, -2.4069148936170213], [0.4080534133844003, -2.4914285714285715], [0.35963459448268725, -2.3631532329495126], [0.34395217638838566, -2.3579931972789114], [0.32203302106516224, -2.344858156028369], [0.36742404637441123, -2.3886054421768708], [0.40461156254852226, -2.4141156462585034], [0.36387868122767975, -2.375], [0.3393199109776927, -2.348404255319149]]
Right now, I calculate the distance from any point to the Pareto frontier like this:
def dominates(row, rowCandidate):
return all(r >= rc for r, rc in zip(row, rowCandidate))
def dist2Pareto(pareto,candidate):
listDist = []
dominateN = 0
dominatePoss = 0
if len(pareto) >= 2:
for i in pareto:
if i != candidate:
dominatePoss += 1
dominate = dominates(candidate,i)
if dominate == True:
dominateN += 1
listDist.append(np.linalg.norm(np.array(i)-np.array(candidate)))
listDist.sort()
if dominateN == len(pareto):
print "beyond"
return listDist[0]
else:
return listDist[0]
Where I calculate the distance to each point of the black line, and retrieve the shortest distance (distance to the closest point of the known Frontier).
However, I feel I should calculate the distance to the closest line segment instead. How would I go about achieving this?

The formula for the coordinates of the nearest point on the line is given here. Specifically, you are interested in the one called "line defined by two points". For posterity, the formula is:
Because the frontier is relatively simple, you can loop through each two-point line segment in the frontier, and calculate the closest distance for each, keeping the smallest. You could introduce other constraints / pre-computations to limit the number of calculations required.

Related

Underestimation of f(x) by using a piecewise linear function

I am trying to check if there is any Matlab/Python procedure to underestimate f(x) by using a piecewise linear function g(x). That is g(x) needs to be less or equal to, f(x). See the picture and code below. Could you please help to modify this code to find how to underestimate this function?
x = 0.000000001:0.001:1;
y = abs(f(x));
%# Find section sizes, by using an inverse of the approximation of the derivative
numOfSections = 5;
totalRange = max(x(:))-min(x(:));
%# The relevant nodes
xNodes = x(1) + [ 0 cumsum(sectionSize)];
yNodes = abs(f(xNodes));
figure;plot(x,y);
hold on;
plot (xNodes,yNodes,'r');
scatter (xNodes,yNodes,'r');
legend('abs(f(x))','adaptive linear interpolation');

This approach is based on Luis Mendo's comment. The idea is the following:
Select a number of points from the original curve, your final piecewise linear curve will pass through these points
For each point calculate the equation of the tangent to the original curve. Because your graph is convex, the tangents of consecutive points in your sample will intersect below the curve
Calculate, for each set of consecutive tangents, the x-coordinate of the point of intersection. Use the equation of the tangent to calculate the corresponding y-coordinate
Now, after reordering the points, this gives you a piecewise linear approximation with the constraints you want.
h = 0.001;
x = 0.000000001:h:1;
y = abs(log2(x));
% Derivative of function on all the points
der = diff(y)/h;
NPts = 10; % Number of sample points
% Draw the index of the points by which the output will pass at random
% Still make sure you got first and last point
idx = randperm(length(x)-3,NPts-2);
idx = [1 idx+1 length(x)-1];
idx = sort(idx);
x_pckd = x(idx);
y_pckd = y(idx);
der_pckd = der(idx);
% Use obscure math to calculate the points of intersection
xder = der_pckd.*x_pckd;
x_2add = -(diff(y_pckd)-(diff(xder)))./diff(der_pckd);
y_2add = der_pckd(1:(end-1)).*(x_2add-(x_pckd(1:(end-1))))+y_pckd(1:(end-1));
% Calculate the error as the sum of the errors made at the middle points
Err_add = sum(abs(y_2add-interp1(x,y,x_2add)));
% Get final x and y coordinates of interpolant
x_final = [reshape([x_pckd(1:end-1);x_2add],1,[]) x_pckd(end)];
y_final = [reshape([y_pckd(1:end-1);y_2add],1,[]) y_pckd(end)];
figure;
plot(x,y,'-k');
hold on
plot(x_final,y_final,'-or')
You can see in my code that the points are drawn at random. If you want to do some sort of optimization (e.g. what is the set of points that minimizes the error), you can just run this a high amount of time and keep track of the best contender. For example, 10000 random draws see the rise of this guy:

How to implement in Python a function to compute the Euclidean distance between two arbitrary points on a torus

Given a 10x10 grid (2d-array) filled randomly with numbers, either 0, 1 or 2. How can I find the Euclidean distance (the l2-norm of the distance vector) between two given points considering periodic boundaries?
Let us consider an arbitrary grid point called centre. Now, I want to find the nearest grid point containing the same value as centre. I need to take periodic boundaries into account, such that the matrix/grid can be seen rather as a torus instead of a flat plane. In that case, say the centre = matrix[0,2], and we find that there is the same number in matrix[9,2], which would be at the southern boundary of the matrix. The Euclidean distance computed with my code would be for this example np.sqrt(0**2 + 9**2) = 9.0. However, because of periodic boundaries, the distance should actually be 1, because matrix[9,2] is the northern neighbour of matrix[0,2]. Hence, if periodic boundary values are implemented correctly, distances of magnitude above 8 should not exist.
So, I would be interested on how to implement in Python a function to compute the Euclidean distance between two arbitrary points on a torus by applying a wrap-around for the boundaries.
import numpy as np
matrix = np.random.randint(0,3,(10,10))
centre = matrix[0,2]
#rewrite the centre to be the number 5 (to exclude itself as shortest distance)
matrix[0,2] = 5
#find the points where entries are same as centre
same = np.where((matrix == centre) == True)
idx_row, idx_col = same
#find distances from centre to all values which are of same value
dist = np.zeros(len(same[0]))
for i in range(0,len(same[0])):
delta_row = same[0][i] - 0 #row coord of centre
delta_col = same[1][i] - 2 #col coord of centre
dist[i] = np.sqrt(delta_row**2 + delta_col**2)
#retrieve the index of the smallest distance
idx = dist.argmin()
print('Centre value: %i. The nearest cell with same value is at (%i,%i)'
% (centre, same[0][idx],same[1][idx]))

For each axis, you can check whether the distance is shorter when you wrap around or when you don't. Consider the row axis, with rows i and j.
When not wrapping around, the difference is abs(i - j).
When wrapping around, the difference is "flipped", as in 10 - abs(i - j). In your example with i == 0 and j == 9 you can check that this correctly produces a distance of 1.
Then simply take whichever is smaller:
delta_row = same[0][i] - 0 #row coord of centre
delta_row = min(delta_row, 10 - delta_row)
And similarly for delta_column.
The final dist[i] calculation needs no changes.

I have a working 'sketch' of how this could work. In short, I calculate the distance 9 times, 1 for the normal distance, and 8 shifts to possibly correct for a closer 'torus' distance.
As n is getting larger, the calculation costs can go sky high as the numbers go up. But, the torus effect, is probably not needed as there is always a point nearby without 'wrap around'.
You can easily test this, because for a grid of size 1, if a point is found of distance 1/2 or closer, you know there is not a closer torus point (right?)
import numpy as np
n=10000
np.random.seed(1)
A = np.random.randint(low=0, high=10, size=(n,n))
I create 10000x10000 points, and store the location of the 1's in ONES.
ONES = np.argwhere(A == 0)
Now I define my torus distance, which is trying which of the 9 mirrors is the closest.
def distance_on_torus( point=[500,500] ):
index_diff = [[1],[1],[0],[0],[0,1],[0,1],[0,1],[0,1]]
coord_diff = [[-1],[1],[-1],[1],[-1,-1],[-1,1],[1,-1],[1,1]]
tree = BallTree( ONES, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances = [dist[0]]
for indici_to_shift, coord_direction in zip(index_diff, coord_diff):
MIRROR = ONES.copy()
for i,shift in zip(indici_to_shift,coord_direction):
MIRROR[:,i] = MIRROR[:,i] + (shift * n)
tree = BallTree( MIRROR, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances.append(dist[0])
return np.min(distances)
%%time
distance_on_torus([2,3])
It is slow, the above takes 15 minutes.... For n = 1000 less than a second.
A optimisation would be to first consider the none-torus distance, and if the minimum distance is possibly not the smallest, calculate with only the minimum set of extra 'blocks' around. This will greatly increase speed.

What is the Most Efficient Way to Compute the (euclidean) Distance of the Nearest Neighbor in a List of (x,y,z) points?

What is the most efficient way compute (euclidean) distance of the nearest neighbor for each point in an array?
I have a list of 100k (X,Y,Z) points and I would like to compute a list of nearest neighbor distances. The index of the distance would correspond to the index of the point.
I've looked into PYOD and sklearn neighbors, but those seem to require "teaching". I think my problem is simpler than that. For each point: find nearest neighbor, compute distance.
Example data:
points = [
(0 0 1322.1695
0.006711111 0 1322.1696
0.026844444 0 1322.1697
0.0604 0 1322.1649
0.107377778 0 1322.1651
0.167777778 0 1322.1634
0.2416 0 1322.1629
0.328844444 0 1322.1631
0.429511111 0 1322.1627...)]
compute k = 1 nearest neighbor distances
result format:
results = [nearest neighbor distance]
example results:
results = [
0.005939372
0.005939372
0.017815632
0.030118587
0.041569616
0.053475883
0.065324964
0.077200014
0.089077602)
]
UPDATE:
I've implemented two of the approaches suggested.
Use the scipy.spatial.cdist to compute the full distances matrices
Use a nearest X neighbors in radius R to find subset of neighbor distances for every point and return the smallest.
Results are that Method 2 is faster than Method 1 but took a lot more effort to implement (makes sense).
It seems the limiting factor for Method 1 is the memory needed to run the full computation, especially when my data set is approaching 10^5 (x, y, z) points. For my data set of 23k points, it takes ~ 100 seconds to capture the minimum distances.
For method 2, the speed scales as n_radius^2. That is, "neighbor radius squared", which really means that the algorithm scales ~ linearly with number of included neighbors. Using a Radius of ~ 5 (more than enough given application) it took 5 seconds, for the set of 23k points, to provide a list of mins in the same order as the point_list themselves. The difference matrix between the "exact solution" and Method 2 is basically zero.
Thanks for everyones' help!

Similar to Caleb's answer, but you could stop the iterative loop if you get a distance greater than some previous minimum distance (sorry - no code).
I used to program video games. It would take too much CPU to calculate the actual distance between two points. What we did was divide the "screen" into larger Cartesian squares and avoid the actual distance calculation if the Delta-X or Delta-Y was "too far away" - That's just subtraction, so maybe something like that to qualify where the actual Eucledian distance metric calculation is needed (extend to n-dimensions as needed)?
EDIT - expanding "too far away" candidate pair selection comments.
For brevity, I'll assume a 2-D landscape.
Take the point of interest (X0,Y0) and "draw" an nxn square around that point, with (X0,Y0) at the origin.
Go through the initial list of points and form a list of candidate points that are within that square. While doing that, if the DeltaX [ABS(Xi-X0)] is outside of the square, there is no need to calculate the DeltaY.
If there are no candidate points, make the square larger and iterate.
If there is exactly one candidate point and it is within the radius of the circle incribed by the square, that is your minimum.
If there are "too many" candidates, make the square smaller, but you only need to reexamine the candidate list from this iteration, not all the points.
If there are not "too many" candidates, then calculate the distance for that list. When doing so, first calculate DeltaX^2 + DeltaY^2 for the first candidate. If for subsequent candidates the DetlaX^2 is greater than the minumin so far, no need to calculate the DeltaY^2.
The minimum from that calculation is the minimum if it is within the radius of the circle inscribed by the square.
If not, you need to go back to a previous candidate list that includes points within the circle that has the radius of that minimum. For example, if you ended with one candidate in a 2x2 square that happened to be on the vertex X=1, Y=1, distance/radius would be SQRT(2). So go back to a previous candidate list that has a square greated or equal to 2xSQRT(2).
If warranted, generate a new candidate list that only includes points withing the +/- SQRT(2) square.
Calculate distance for those candidate points as described above - omitting any that exceed the minimum calcluated so far.
No need to do the square root of the sum of the Delta^2 until you have only one candidate.
How to size the initial square, or if it should be a rectangle, and how to increase or decrease the size of the square/rectangle could be influenced by application knowledge of the data distribution.
I would consider recursive algorithms for some of this if the language you are using supports that.

How about this?
from scipy.spatial import distance
A = (0.003467119 ,0.01422762 ,0.0101960126)
B = (0.007279433 ,0.01651597 ,0.0045558849)
C = (0.005392258 ,0.02149997 ,0.0177409387)
D = (0.017898802 ,0.02790659 ,0.0006487222)
E = (0.013564214 ,0.01835688 ,0.0008102952)
F = (0.013375397 ,0.02210725 ,0.0286032185)
points = [A, B, C, D, E, F]
results = []
for point in points:
distances = [{'point':point, 'neighbor':p, 'd':distance.euclidean(point, p)} for p in points if p != point]
results.append(min(distances, key=lambda k:k['d']))
results will be a list of objects, like this:
results = [
{'point':(x1, y1, z1), 'neighbor':(x2, y2, z2), 'd':"distance from point to neighbor"},
...]
Where point is the reference point and neighbor is point's closest neighbor.

The fastest option available to you may be scipy.spatial.distance.cdist, which finds the pairwise distances between all of the points in its input. While finding all of those distances may not be the fastest algorithm to find the nearest neighbors, cdist is implemented in C, so it is likely run faster than anything you try in Python.
import scipy as sp
import scipy.spatial
from scipy.spatial.distance import cdist
points = sp.array(...)
distances = sp.spatial.distance.cdist(points)
# An element is not its own nearest neighbor
sp.fill_diagonal(distances, sp.inf)
# Find the index of each element's nearest neighbor
mins = distances.argmin(0)
# Extract the nearest neighbors from the data by row indexing
nearest_neighbors = points[mins, :]
# Put the arrays in the specified shape
results = np.stack((points, nearest_neighbors), 1)
You could theoretically make this run faster (mostly by combining all of the steps into one algorithm), but unless you're writing in C, you won't be able to compete with SciPy/NumPy.
(cdist runs in Θ(n2) time (if the size of each point is fixed), and every other part of the algorithm in O(n) time, so even if you did try to optimize the code in Python, you wouldn't notice the change for small amounts of data, and the improvements would be overshadowed by cdist for more data.)

Geometric median

I have written some code to find the
Geometric median for a set of weighted points
it is based on this Google-kickstart challenge . I don't want a better solution but want to know what is wrong in my code
. The code iterates against given precision value of 10^-6 to arrive to a value close to the geometric median . the problem I face is it returns correct value for digits until 10^-3 and after that it goes wrong . I cannot figure out what is going wrong . I also noted changing the initializing value alters the result but don't know why.The code also holds good if weight of points is not considered
here is formula i use to find distance to each point: max(abs(i.x-k.x), abs(i.y-k.y))x(weight_of_i) (its is Chebyshev distance)
here is the iteration function i used :
#c = previous centre , stp =previous step ,listy_r = list of points(x,y,wt) ,k = previous sum of distances
def move_ct( c,stp,listy_r,k): #calculates the minimum centre itrateviely , returns c--> center,stp-->step,k-->sum of distances
while True:
tmp=list()
moves = [(c[0], c[1]+stp), (c[0], c[1]-stp),
(c[0]+stp, c[1]), (c[0]-stp, c[1])]
for each in moves:tmp.append(sdist(listy_r, each))
tmp_min = min(tmp)
if tmp_min < k:
k = tmp_min
index = tmp.index(tmp_min)
c = moves[index]
break
else:
stp *= 0.5
return (c,stp,k)
here are the values i initialized:
initial geometric centre = centroid of the weighted points
precision = 10**-6
step = half of distance between highest and lowest coordinates on x,y
I have attached an input text file here that contains 10000 points(it is test case 1 for large input for the challenge) in the format
one point for each line and each point has 3 parameters (x,y,weight)
eg: 980.69 595.86 619.03 where
980.69 = x coordinate
595.86 = y coordinate
619.03 = weight
the result of the 10000 points should give :3288079343.471880 but it gives
3288079343.4719906 as result . Notice it is off only after 10^-3 .

Ellipsoid equation containing numerous points

I have a large quantity of pixel colors (96 thousands different colors):
And I want to get some kind of a mathematically-defined probability region like in this question:
The main obstacle I see right now – all methods on Google are mainly about visualisations and about two-dimensional spaces, yet there is no algorithm for finding coefficients of an equation like:
a1x2 + b1y2 + c1y2 + a2xy + b2xz + c2yz + a3x + b3y + c3z = 0
And this paper is too difficult for me to implement it in python. :(
Anyway, what I just want is to determine if some pixel is more-or-less lies within the diapason I have.
I tried making it using scikit clustering, but I failed due to having only one
set of data, probably. And creating an array 2563 elements
representing each pixel color seems a wrong way.
I wonder if there is an easy way to determine boundaries of this point cluster?
Or, maybe I'm just overthinking it and there is something like OpenCV
cv2.inRange() function?

this can be solved by optimization and fitting of the ellipsoid polynomial. However I would start with geometrical approach which is much faster:
find avg point position
that will be the center of your ellipsoid
p0 = sum (p[i]) / n // average
i = { 0,1,2,3,...,n-1 } // of all points
If your point density is not homogenuous then it is safer to use bounding box center instead. So find xmin,ymin,zmin,xmax,ymax,zmax and the middle between them is your center.
find most distant point to center
that will give you main semi axis
pa = p[j];
|p[j]-p0| >= |p[i]-p0| // max
i = { 0,1,2,3,...,n-1 } // of all points
find second semi-axises
so vector pa-p0 is normal to plane in which the other semi-axises should be. So find most distant point to p0 from that plane:
pb = p[j];
|p[j]-p0| >= |p[i]-p0| // max
dot(pa-p0,p[j]-p0) == 0 // but inly if inside plane
i = { 0,1,2,3,...,n-1 } // from all points
beware that the result of dot product may not be precisely zero so it is better to test against something like this:
|dot(pa-p0,p[j]-p0)| <= 1e-3
You can use any threshold you want (should be based on the ellipsoid size).
find last semi-axis
So we know that last semi-axis should be perpendicular to both
(pa-p0) AND (pb-p0)
So find point such that:
pc = p[j];
|p[j]-p0| >= |p[i]-p0| // max
dot(pa-p0,p[j]-p0) == 0 // but inly if inside plane
dot(pb-p0,p[j]-p0) == 0 // and perpendicular also to b semi-axis
i = { 0,1,2,3,...,n-1 } // from all points
Ellipsoid
Now you have all the parameters you need to form your ellipsoid. vectors
(pa-p0),(pb-p0),(pc-p0)
are the basis vectors of your ellipsoid (you can make them perpendicular by using cross product). Their size gives you the radiuses. And p0 is the center. You can also use this parametric equation:
a=pa-p0;
b=pb-p0;
c=pc-p0;
p(u,v) = p0 + a*cos(u)*cos(v)
+ b*cos(u)*sin(v)
+ c*sin(u);
u = < -0.5*PI , +0.5*PI >
v = < 0.0 , 2.0*PI >
This whole process is just O(n) and the results can be used as start point for both optimization and fitting to speed them up without the loss of accuracy. If you want to further improve accuracy See:
How approximation search works
The sub links shows you examples of fitting ...
You can also take a look at this:
Algorithms: Ellipse matching
which is basically similar to your task but only in 2D still may bring you some ideas.

Here is unstrict solution with fast and simple random search approach*. Best side - no heavy linear algebra library required**. Seems it worked fine for mesh collision detection.
Is assumes that ellipsoid center matches cloud center and then uses some sort of mirrored average to search for main axis.
Full working code is slightly bigger and placed on git, idea of main axis search is here:
np.random.shuffle(pts)
pts_len = len(pts)
pt_average = np.sum(pts, axis = 0) / pts_len
vec_major = pt_average * 0
minor_max, major_max = 0, 0
# may be improved with overlapped pass,
for pt_cur in pts:
vec_cur = pt_cur - pt_average
proj_len, rej_len = proj_length(vec_cur, vec_major)
if proj_len < 0:
vec_cur = -vec_cur
vec_major += (vec_cur - vec_major) / pts_len
major_max = max(major_max, abs(proj_len))
minor_max = max(minor_max, rej_len)
It can be improved/optimized even more at some points. Examples what it will produce:
And full experiment code with plots
*i.e. adjusting code lines randomly until they work
**was actually reason to figure out this solution

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.