The task is to find such a dot with a (x,0) coordinates, that the distance from it to the most distant point from the original set (distance is Euclidean) is minimal.
My idea is to find the minimum of the function that finds an euclidean distance like this:
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
def function_3(points_x, points_y):
dots = np.array([points_x,points_y])
ans = minimize(cdist(dots,points1),x0=0)
return(ans)
But it seems like that I'm doing something wrong... Can somebody give an advice?
Solution
Here's a complete working example for fitting points of the form (x, 0):
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
# set up a test set of 100 points to fit against
n = 100
xyTestset = np.random.rand(n,2)*10
def fun(x, xycomp):
# x is a vector, assumed to be of size 1
# cdist expects a 2D array, so we reshape xy into a 1x2 array
xy = np.array((x[0], 0)).reshape(1, -1)
return cdist(xy, xycomp).max()
fit = minimize(fun, x0=0, args=xyTestset)
print(fit.x)
which outputs:
[5.06807808]
This means, roughly speaking, that the minimization is finding the centroid of the set of random test points, as expected. If you wanted to do a 2D fitting to points of the form (x, y) instead, you can do:
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
# set up a test set of 100 points to fit against
n = 100
xyTestset = np.random.rand(n,2)*10
def fun(x, xycomp):
# x is a vector, assumed to be of size 2
return cdist(x.reshape(1, -1), xycomp).max()
fit = minimize(fun, x0=(0, 0), args=xyTestset)
print(fit.x)
which outputs:
[5.21292828 5.01491085]
which, again, is roughly the centroid of the 100 random points in xyTestset, as you'd expect.
Complete explanation
The problem that you're running into is that scipy.optimize.minimize has very specific expectations about the form of its first argument fun. fun is supposed to be a function that takes x as its first argument, where x is a 1D vector of the values to be minimized over. fun can also take additional arguments. These have to be passed into minimize via the args parameter, and their values are constant (ie they won't change over the course of the minimization).
Also, you should be aware that your case of fitting (x, 0) can be simplified. It's effectively a 1D problem, so you all you need to do is calculate the x distances between the points. You can completely ignore the y distances and still get the same results.
Additionally, you don't need a minimization to solve the problem you stated. The point that minimizes the distance to the farthest point (which is the same as saying "minimizing the distance to all points") is the centroid. The coordinates of the centroid are the means of each coordinate in your set of points, so if your points are stored in an Nx2 array xydata you can calculate the centroid by just doing:
xydata.mean(axis=1)
Related
I have a cloud of data points (x,y) that I would like to interpolate and smooth.
Currently, I am using scipy :
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
spl = interp1d(Cloud[:,1], Cloud[:,0]) # interpolation
x = np.linspace(Cloud[:,1].min(), Cloud[:,1].max(), 1000)
smoothed = savgol_filter(spl(x), 21, 1) #smoothing
This is working pretty well, except that I would like to give some weights to the data points given at interp1d. Any suggestion for another function that is handling this ?
Basically, I thought that I could just multiply the occurrence of each point of the cloud according to its weight, but that is not very optimized as it increases a lot the number of points to interpolate, and slows down the algorithm ..
The default interp1d uses linear interpolation, i.e., it simply computes a line between two points. A weighted interpolation does not make much sense mathematically in such scenario - there is only one way in euclidean space to make a straight line between two points.
Depending on your goal, you can look into other methods of interpolation, e.g., B-splines. Then you can use scipy's scipy.interpolate.splrep and set the w argument:
w - Strictly positive rank-1 array of weights the same length as x and y. The weights are used in computing the weighted least-squares spline fit. If the errors in the y values have standard-deviation given by the vector d, then w should be 1/d. Default is ones(len(x)).
I am having issues in implementing some less-than-usual interpolation problem. I have some (x,y) data points scattered along some curve which a priori I don't know, and I want to reconstruct this curve at my best, interpolating my point with min square error. I thought of using scipy.interpolate.splrep for this purpose (but maybe there are better options you would advise to use). The additional difficulty in my case, is that I want to constrain the spline curve to pass through some specific points of my original data. I assume that playing with knots and weights could make the trick, but I don't know how to do so (I am procrastinating avoidance of spline interpolation theory besides basic fitting procedures). Also, for some undisclosed reasons, when I try to setup knots in my splrep I get the same error of this post, which keeps complicating things. The following is my sample code:
from __future__ import division
import numpy as np
import scipy.interpolate as spi
import matplotlib.pylab as plt
# Some surrogate sample data
f = lambda x : x**2 - x/2.
x = np.arange(0.,20.,0.1)
y = f(4*(x + np.random.normal(size=np.size(x))))
# I want to use spline interpolation with least-square fitting criterion, making sure though that the spline starts
# from the origin (or in general passes through a precise point of my dataset).
# In my case for example I would like the spline to originate from the point in x=0. So I attempted to include as first knot x=0...
# but it won't work, nor I am sure this is the right procedure...
fy = spi.splrep(x,y)
fy = spi.splrep(x,y,t=fy[0])
yy = spi.splev(x,fy)
plt.plot(x,y,'-',x,yy,'--')
plt.show()
which despite the fact I am even passing knots computed from a first call of splrep, it will give me:
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/fitpack.py", line 289, in splrep
res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 515, in splrep
raise _iermess[ier][1](_iermess[ier][0])
ValueError: Error on input data
You use the weights argument of splrep: can give these points you need fixed very large weights. This is a workaround for sure, so keep an eye on the fit quality and stability.
Setting high weights for specific points is indeed a working solution as suggested by #ev-br. In addition, because there is no direct way to match derivatives at the extrema of the curve, the same rationale can be applied in this case as well. Say you want the derivative in y[0] and y[-1] match the derivative of your data points, then you add large weights also for y[1] and y[-2], i.e.
weights = np.ones(len(x))
weights[[0,-1]] = 100 # Promote spline interpolant through first and last point
weights[[1,-2]] = 50 # Make spline interpolant derivative tend to derivatives at first/last point
fy = spi.splrep(x,y,w=weights,s=0.1)
yy = spi.splev(x,fy)
The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.
So I was doing my assignment and we are required to use interpolation (linear interpolation) for the same. We have been asked to use the interp1d package from scipy.interpolate and use it to generate new y values given new x values and old coordinates (x1,y1) and (x2,y2).
To get new x coordinates (lets call this x_new) I used np.linspace between (x1,x2) and the new y coordinates (lets call this y_new) I found out using interp1d function on x_new.
However, I also noticed that applying np.linspace on (y1,y2) generates the exact same values of y_new which we got from interp1d on x_new.
Can anyone please explain to me why this is so? And if this is true, is it always true?
And if this is always true why do we at all need to use the interp1d function when we can use the np.linspace in it's place?
Here is the code I wrote:
import scipy.interpolate as ip
import numpy as np
x = [-1.5, 2.23]
y = [0.1, -11]
x_new = np.linspace(start=x[0], stop=x[-1], num=10)
print(x_new)
y_new = np.linspace(start=y[0], stop=y[-1], num=10)
print(y_new)
f = ip.interp1d(x, y)
y_new2 = f(x_new)
print(y_new2) # y_new2 values always the same as y_new
The reason why you stumbled upon this is that you only use two points for an interpolation of a linear function. You have as an input two different x values with corresponding y values. You then ask interp1d to find a linear function f(x)=m*x +b that fits best your input data. As you only have two points as input data, there is an exact solution, because a linear function is exactly defined by two points. To see this: take piece of paper, draw two dots an then think about how many straight lines you can draw to connect these dots.
The linear function that you get from two input points is defined by the parameters m=(y1-y2)/(x1-x2) and b=y1-m*x1, where (x1,y1),(x2,y2) are your two inputs points (or elements in your x and y arrays in your code snippet.
So, now what does np.linspace(start, stop, num,...) do? It gives you num evenly spaced points between start and stop. These points are start, start + delta, ..., end. The step width delta is given by delta=(end-start)/(num - 1). The -1 comes from the fact that you want to include your endpoint. So the nth point in your interval will lie at xn=x1+n*(x2-x1)/(num-1). At what y values will these points end up after we apply our linear function from interp1d? Lets plug it in:
f(xn)=m*xn+b=(y1-y2)/(x1-x2)*(x1+n/(num-1)*(x2-x1)) + y1-(y1-y1)/(x1-x2)*x1. Simplifying this results in f(xn)=(y2-y1)*n/(num - 1) + y1. And this is exactly what you get from np.linspace(y1,y2,num), i.e. f(xn)=yn!
Now, does this always work? No! We made use of the fact that our linear function is defined by the two endpoints of the intervals we use in np.linspace. So this will not work in general. Try to add one more x value and one more y value in your input list and then compare the results.
I am interested in computing the power spectrum of a system of particles (~100,000) in 3D space with Python. What I have found so far is a group of functions in Numpy (fft,fftn,..) which compute the discrete Fourier transform, of which the square of the absolute value is the power spectrum. My question is a matter of how my data are being represented - and truthfully may be fairly simple to answer.
The data structure I have is an array which has a shape of (n,2), n being the number of particles I have, and each column representing either the x, y, and z coordinate of the n particles. The function I believe I should be using it the fftn() function, which takes the discrete Fourier transform of an n-dimensional array - but it says nothing about the format. How should the data be represented as a data structure to be fed into fftn?
Here is what I've tried so far to test the function:
import numpy as np
import random
import matplotlib.pyplot as plt
DATA = np.zeros((100,3))
for i in range(len(DATA)):
DATA[i,0] = random.uniform(-1,1)
DATA[i,1] = random.uniform(-1,1)
DATA[i,2] = random.uniform(-1,1)
FFT = np.fft.fftn(DATA)
PS = abs(FFT)**2
plt.plot(PS)
plt.show()
The array entitled DATA is a mock array, ultimately the thing which will be 100,000 by 3 in shape. The output of the code gives me something like:
As you can see, I think this is giving me three 1D power spectra (1 for each column of my data), but really I'd like a power spectrum as a function of radius.
Does anybody have any advice or alternative methods/packages they know of to compute the power spectrum (I'd even settle for the two point autocorrelation function).
It doesn't quite work the way you are setting it out...
You need a function, lets call it f(x, y, z), that describes the density of mass in space. In your case, you can consider the galaxies as point masses, so you will have a delta function centered at the location of each galaxy. It is for this function that you can calculate the three-dimensional autocorrelation, from which you could calculate the power spectrum.
If you want to use numpy to do that for you, you are first going to have to discretize your function. A possible mock example would be:
import numpy as np
import matplotlib.pyplot as plt
space = np.zeros((100, 100, 100), dtype=np.uint8)
x, y, z = np.random.randint(100, size=(3, 1000))
space[x, y, z] += 1
space_ps = np.abs(np.fft.fftn(space))
space_ps *= space_ps
space_ac = np.fft.ifftn(space_ps).real.round()
space_ac /= space_ac[0, 0, 0]
And now space_ac holds the three-dimensional autocorrelation function for the data set. This is not quite what you are after, and to get you one-dimensional correlation function you would have to average the values on spherical shells around the origin:
dist = np.minimum(np.arange(100), np.arange(100, 0, -1))
dist *= dist
dist_3d = np.sqrt(dist[:, None, None] + dist[:, None] + dist)
distances, _ = np.unique(dist_3d, return_inverse=True)
values = np.bincount(_, weights=space_ac.ravel()) / np.bincount(_)
plt.plot(distances[1:], values[1:])
There is another issue with doing things yourself this way: when you compute the power spectrum as above, mathematically is as if your three dimensional array wrapped around the borders, i.e. point [999, y, z] is a neighbour to [0, y, z]. So your autocorrelation could show two very distant galaxies as close neighbours. The simplest way to deal with this is by making your array twice as large along every dimension, padding with extra zeros, and then discarding the extra data.
Alternatively you could use scipy.ndimage.filters.correlate with mode='constant' to do all the dirty work for you.