Does anyone know a fast way to find a line closest to a set a points in python? (but the line should always cross the origin, in other words f(0) = 0)
Given the equation of the line y = mx + 0 I want to find the m that optimizes this distance to every point in the set.
The image above is an example, the line should be closest to all the points. I tried doing this using scipy.optimize.minimize_scalar but the performance was not good enough, I wonder if there is a faster algorithm or a analytical way of doing this.
The distance formula can be found here: https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line. You want to minimize the sum of the distances but the distance formula contains an absolute value which will created a discontinuity in the first derivative which will prevent numerical solvers from finding the solution. Alternatively you could minimize the square of the sum of the distances. However, it's not the same as minimizing the sum of the distances.
Related
I need to approximate a function y(x) with a step function of height h where each "high" segment has a length l_i=n_i*l_0 and every "low" segment has a length of d_j=n_j*d_0 where n_i must be an integer. The function is strictly positive, (not strictly) steadily decreasing and continuous.
My function has been derived in sympy and is available as symbolic equation but it's acceptable to convert to numpy/scipy if beneficial.
My first approach was to solve the segments pairwise.
The end application requires the total difference, i.e. the integral between the approximation and target function, to be minimized pairwise.
Another practical constraint is for the segments to be as short as possible, with the constraint of n being an integer.
I would also need to take over any residual of the integral sum into the next calculation because the total approximation should also minimize the accumulated error.
The approach I thought about taking would involve doing a segment wise integral from x_0 to x_1 and from x_1 to x_2, find for which x_1, x_2 the sum of these integrals changes sign (or is minimized) and then find the lowest common denominator of n_i and n_j.
integral = smp.integrate(y-h,(x,x_0,x_1)) + smp.integrate(y,(x,x_1,x_2)
One approach would be to switch over to scipy.optimize.minimize at this point, however, I have read it has problems with integer values? On the other hand, I don't know how I could find a relationship for x_1(x_2) for which the integral would be close to 0 in sympy either as I just started using sympy yesterday. Any help would be hugely appreciated!
I have a numpy.poly1d that can be any arbitrary 1d polynomial. Given an arbitrary point P, I wish to find the perpendicular distance between the curve and the point, and the point on the curve where the shortest distance line intersects the curve, if the point P does not lie on the polynomial itself
Now I can do something like the question Distance between a point and a curve in python, by sampling a set of points. But I believe that won't be fast enough considering that I have <100000 polynomials and <1000000 points per polynomial that I wish to find that distances to. Can someone suggest a more optimal method?
Here is a sample image illustration of what I am looking for. Basically, given P and B(t) I wish to find the length of g(x) and point Q.
I have a set of points p and I need to transform them so that the they align with another given set of points q (find the transform T from source to target).
So far it is an easy problem. My problem is that I do have some freedom aligning these points i.e, I only have to keep the alignment error below some given threshold (alpha) and not minimize the distance. I want to exploit this alignment freedom to minimize distances between p and a different set of points r. I marked the vectors to be optimized E = Tp - r
So basically I want to use the first alignment as a hard constraint and try to minimize another set of correspondences (I attached a picture). I want to minimize |E| (the green distances) under the constraint that the black points are within the red circles (alpha) after applying the transformation T.
I tried some heuristic solutions like calculating the maximum allowed rotation around the centroid and only then taking the maximum allowed translation but none of these solutions guarantee the optimal solution.
Have you heard about Lagrange optimization?
Here's the corresponding article.
You minimize a cost function (in your case E) under certain inequality
constraints and equality constraints (in your case no equality constraints).
This may be an approach for your solution?
Step 1:
Build augmented cost function: E - L * (Tp - q - alpha)
Step 2:
Find partial derivatives w.r.t T and L
Step 3:
Solve for zeros in partial derivatives
I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.
You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.
You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.
I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.
I am implementing kmeans algorithm from scratch in python and on Spark. Actually, it is my homework. The problem is to implement kmeans with predefined centroids with different initialization methods, one of them is random initialization(c1) and the other is kmeans++(c2). Also, it is required to use different distance metrics, Euclidean distance, and Manhattan distance. The formula for both of them is introduced as follows:
The second formula in each section is for the corresponding cost function which is going to be minimized. I have implemented both of them but I think there is a problem. This is the graph for the cost function per iteration of kmeans using different settings:
The first graph looks fine but the second one seems to have a problem because as far as I'm concerned, the cost of kmeans must decrease after each iteration. So, What is the problem? It's from my code or formula?
And these are my functions for computing distances and cost:
def Euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
def Manhattan_distance(point1, point2):
return np.sum(np.absolute(point1 - point2))
def cost_per_point(point, center, cost_type = 'E'):
if cost_type =='E':
return Euclidean_distance(point, center)**2
else:
return Manhattan_distance(point, center)
And here is my full code on GitHub:
https://github.com/mrasoolmirzaei/My-Data-Science-Projects/blob/master/Implementing%20Kmeans%20With%20Spark.ipynb
K-means does not minimize distances.
It minimizes the sum of squares (which is not a metric).
If you assign points to the nearest cluster by Euclidean distance, it will still minimize the sum of squares, not Euclidean distances. In particular, the sum of euclidean distances may increase.
Minimizing Euclidean distances is the Weber problem. The mean is not optimal. You need a complex geometrical median to minimize Euclidean distances.
If you assign points with Manhattan distance, it is not clear what is being minimized... You have two competing objectives. While I would assume that it will still converge, that may be tricky to prove. because using the mean may increase the sum of Manhattan distances.
I think I posted a counterexample for k-means minimizing Euclidean distance here at SO or stats.SE some time ago. So your code and analysis may even be fine - it is the assignment that is flawed.