Optimal method to find perpendicular distance between numpy poly1d and arbitrary point

Optimal method to find perpendicular distance between numpy poly1d and arbitrary point - python

I have a numpy.poly1d that can be any arbitrary 1d polynomial. Given an arbitrary point P, I wish to find the perpendicular distance between the curve and the point, and the point on the curve where the shortest distance line intersects the curve, if the point P does not lie on the polynomial itself
Now I can do something like the question Distance between a point and a curve in python, by sampling a set of points. But I believe that won't be fast enough considering that I have <100000 polynomials and <1000000 points per polynomial that I wish to find that distances to. Can someone suggest a more optimal method?
Here is a sample image illustration of what I am looking for. Basically, given P and B(t) I wish to find the length of g(x) and point Q.

Related

Calculate maximum diameter in 3D binary mask

I want to calculate maximum diameter of a 3D binary mask of a nodule (irregular shape).
I have implemented a function that calculates the distance of all boundary points between each other. This method is very computational expensive while dealing with tumors or larger volume.
So my Question is what can be the possible methods to calculate maximum diameter of a 3d binary mask which is less computationally expensive.

Something similar to a Gradient Descent could be implemented.
Start with 2 points (A and B), located randomly around the 3D mask
For point A, calculate the direction to travel on the 3D mask that will most increase the distance between him and point B.
Make point A take a small step in that direction.
For point B, calculate the direction to travel on the 3D mask that will most increase the distance between him and point A.
Make point B take a small step in that direction.
Repeat until it converges.
This will very likely find local maxima, so you would probably have to repeat the experiment several times to find the real global maxima.

Line f(0) = 0 closest to a set of points

Does anyone know a fast way to find a line closest to a set a points in python? (but the line should always cross the origin, in other words f(0) = 0)
Given the equation of the line y = mx + 0 I want to find the m that optimizes this distance to every point in the set.
The image above is an example, the line should be closest to all the points. I tried doing this using scipy.optimize.minimize_scalar but the performance was not good enough, I wonder if there is a faster algorithm or a analytical way of doing this.

The distance formula can be found here: https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line. You want to minimize the sum of the distances but the distance formula contains an absolute value which will created a discontinuity in the first derivative which will prevent numerical solvers from finding the solution. Alternatively you could minimize the square of the sum of the distances. However, it's not the same as minimizing the sum of the distances.

Efficient calculation of euclidean distance

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.

You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.

You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.

I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.

Exemplar-Based Inpainting - how to compute the normal to the contour and the isophate

I am using the Exemplar-Based algorithm by Criminisi. In section 3 of his paper it describes the algorithm. The target region that needs to be inpainted is denoted as Ω (omega). The border or the contour of Ω where it meets the rest of the image (denoted as Φ(phi)), is δΩ (delta omega).
Now on page four of the paper, it states that np(n subscript p) is the normal to the contour of δΩ. and ▽Ip (also includes orthogonal superscript) is the isophote at point p, which is the gradient turned 90 degrees.
My multivariable calculus is rusty, but how do we go about computing np and ▽Ip with python libraries? Also isn't np different for each point p on δΩ?

There are different ways of computing those variables, all depending in your numeric description of that boundary. n_p is the normal direction of the contour.
Generally, if your contour is described with an analytic equation, or if you can write an analytic equation that approximates the contour (e.g. a spline curve that fits 5 points (2 in each side of the point you want), you can derive that spline, compute the tangent line in the point you want using
Then, get a unit vector among that line and get the orthonormal vector to that one. All this is very easy to do (ask if you don't understand).
Then you have the isophone. It looks like its a vector orthonormal of the gradient with its modulus. Computing the directional gradient on an image is very very commonly used technique in image processing. You can get the X and Y derivatives of the image easily (hint: numpy.gradient, or SO python gradient). Then, the total gradient of the image is described as:
So just create a vector with the x and y gradients (taken from numpy.gradient). Then get the orthogonal vector to that one.
NOTE: How to get an orthogonal vector in 2D
[v2x v2y] = [v1y, -v1x]

Implementing k-means with Euclidean distance vs Manhattan distance?

I am implementing kmeans algorithm from scratch in python and on Spark. Actually, it is my homework. The problem is to implement kmeans with predefined centroids with different initialization methods, one of them is random initialization(c1) and the other is kmeans++(c2). Also, it is required to use different distance metrics, Euclidean distance, and Manhattan distance. The formula for both of them is introduced as follows:
The second formula in each section is for the corresponding cost function which is going to be minimized. I have implemented both of them but I think there is a problem. This is the graph for the cost function per iteration of kmeans using different settings:
The first graph looks fine but the second one seems to have a problem because as far as I'm concerned, the cost of kmeans must decrease after each iteration. So, What is the problem? It's from my code or formula?
And these are my functions for computing distances and cost:
def Euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
def Manhattan_distance(point1, point2):
return np.sum(np.absolute(point1 - point2))
def cost_per_point(point, center, cost_type = 'E'):
if cost_type =='E':
return Euclidean_distance(point, center)**2
else:
return Manhattan_distance(point, center)
And here is my full code on GitHub:
https://github.com/mrasoolmirzaei/My-Data-Science-Projects/blob/master/Implementing%20Kmeans%20With%20Spark.ipynb

K-means does not minimize distances.
It minimizes the sum of squares (which is not a metric).
If you assign points to the nearest cluster by Euclidean distance, it will still minimize the sum of squares, not Euclidean distances. In particular, the sum of euclidean distances may increase.
Minimizing Euclidean distances is the Weber problem. The mean is not optimal. You need a complex geometrical median to minimize Euclidean distances.
If you assign points with Manhattan distance, it is not clear what is being minimized... You have two competing objectives. While I would assume that it will still converge, that may be tricky to prove. because using the mean may increase the sum of Manhattan distances.
I think I posted a counterexample for k-means minimizing Euclidean distance here at SO or stats.SE some time ago. So your code and analysis may even be fine - it is the assignment that is flawed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.