Underestimation of f(x) by using a piecewise linear function

Underestimation of f(x) by using a piecewise linear function - python

I am trying to check if there is any Matlab/Python procedure to underestimate f(x) by using a piecewise linear function g(x). That is g(x) needs to be less or equal to, f(x). See the picture and code below. Could you please help to modify this code to find how to underestimate this function?
x = 0.000000001:0.001:1;
y = abs(f(x));
%# Find section sizes, by using an inverse of the approximation of the derivative
numOfSections = 5;
totalRange = max(x(:))-min(x(:));
%# The relevant nodes
xNodes = x(1) + [ 0 cumsum(sectionSize)];
yNodes = abs(f(xNodes));
figure;plot(x,y);
hold on;
plot (xNodes,yNodes,'r');
scatter (xNodes,yNodes,'r');
legend('abs(f(x))','adaptive linear interpolation');

This approach is based on Luis Mendo's comment. The idea is the following:
Select a number of points from the original curve, your final piecewise linear curve will pass through these points
For each point calculate the equation of the tangent to the original curve. Because your graph is convex, the tangents of consecutive points in your sample will intersect below the curve
Calculate, for each set of consecutive tangents, the x-coordinate of the point of intersection. Use the equation of the tangent to calculate the corresponding y-coordinate
Now, after reordering the points, this gives you a piecewise linear approximation with the constraints you want.
h = 0.001;
x = 0.000000001:h:1;
y = abs(log2(x));
% Derivative of function on all the points
der = diff(y)/h;
NPts = 10; % Number of sample points
% Draw the index of the points by which the output will pass at random
% Still make sure you got first and last point
idx = randperm(length(x)-3,NPts-2);
idx = [1 idx+1 length(x)-1];
idx = sort(idx);
x_pckd = x(idx);
y_pckd = y(idx);
der_pckd = der(idx);
% Use obscure math to calculate the points of intersection
xder = der_pckd.*x_pckd;
x_2add = -(diff(y_pckd)-(diff(xder)))./diff(der_pckd);
y_2add = der_pckd(1:(end-1)).*(x_2add-(x_pckd(1:(end-1))))+y_pckd(1:(end-1));
% Calculate the error as the sum of the errors made at the middle points
Err_add = sum(abs(y_2add-interp1(x,y,x_2add)));
% Get final x and y coordinates of interpolant
x_final = [reshape([x_pckd(1:end-1);x_2add],1,[]) x_pckd(end)];
y_final = [reshape([y_pckd(1:end-1);y_2add],1,[]) y_pckd(end)];
figure;
plot(x,y,'-k');
hold on
plot(x_final,y_final,'-or')
You can see in my code that the points are drawn at random. If you want to do some sort of optimization (e.g. what is the set of points that minimizes the error), you can just run this a high amount of time and keep track of the best contender. For example, 10000 random draws see the rise of this guy:

Related

Fit a line segment to a set of points

I'm trying to fit a line segment to a set of points but I have trouble finding an algorithm for it. I have a 2D line segment L and a set of 2D points C. L can be represented in any suitable way (I don't care), like support and definition vector, two points, a linear equation with left and right bound, ... The only important thing is that the line has a beginning and an end, so it's not infinite.
I want to fit L in C, so that the sum of all distances of c to L (where c is a point in C) is minimized. This is a least squares problem but I (think) cannot use polynmoial fitting, because L is only a segment. My mathematical knowledge in that area is a bit lacking so any hints on further reading would be appreciated aswell.
Here is an illustration of my problem:
The orange line should be fitted to the blue points so that the sum of squares of distances of each point to the line is minimal. I don't mind if the solution is in a different language or not code at all, as long as I can extract an algorithm from it.
Since this is more of a mathematical question I'm not sure if it's ok for SO or should be moved to cross validated or math exchange.

This solution is relatively similar to one already posted here, but I think is slightly more efficient, elegant and understandable, which is why I posted it despite the similarity.
As was already written, the min(max(...)) formulation makes it hard to solve this problem analytically, which is why scipy.optimize fits well.
The solution is based on the mathematical formulation for distance between a point and a finite line segment outlined in https://math.stackexchange.com/questions/330269/the-distance-from-a-point-to-a-line-segment
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, NonlinearConstraint
def calc_distance_from_point_set(v_):
#v_ is accepted as 1d array to make easier with scipy.optimize
#Reshape into two points
v = (v_[:2].reshape(2, 1), v_[2:].reshape(2, 1))
#Calculate t* for s(t*) = v_0 + t*(v_1-v_0), for the line segment w.r.t each point
t_star_matrix = np.minimum(np.maximum(np.matmul(P-v[0].T, v[1]-v[0]) / np.linalg.norm(v[1]-v[0])**2, 0), 1)
#Calculate s(t*)
s_t_star_matrix = v[0]+((t_star_matrix.ravel())*(v[1]-v[0]))
#Take distance between all points and respective point on segment
distance_from_every_point = np.linalg.norm(P.T -s_t_star_matrix, axis=0)
return np.sum(distance_from_every_point)
if __name__ == '__main__':
#Random points from bounding box
box_1 = np.random.uniform(-5, 5, 20)
box_2 = np.random.uniform(-5, 5, 20)
P = np.stack([box_1, box_2], axis=1)
segment_length = 3
segment_length_constraint = NonlinearConstraint(fun=lambda x: np.linalg.norm(np.array([x[0], x[1]]) - np.array([x[2] ,x[3]])), lb=[segment_length], ub=[segment_length])
point = minimize(calc_distance_from_point_set, (0.0,-.0,1.0,1.0), options={'maxiter': 100, 'disp': True},constraints=segment_length_constraint).x
plt.scatter(box_1, box_2)
plt.plot([point[0], point[2]], [point[1], point[3]])
Example result:

Here is a proposition in python. The distance between the points and the line is computed based on the approach proposed here: Fit a line segment to a set of points
The fact that the segment has a finite length, which impose the usage of min and max function, or if tests to see whether we have to use perpendicular distance or distance to one of the end points, makes really difficult (impossible?) to get an analytic solution.
The proposed solution will thus use optimization algorithm to approach the best solution. It uses scipy.optimize.minimize, see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
Since the segment length is fixed, we have only three degrees of freedom. In the proposed solution I use x and y coordinate of the starting segment point and segment slope as free parameters. I use getCoordinates function to get starting and ending point of the segment from these 3 parameters and the length.
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
import math as m
from scipy.spatial import distance
# Plot the points and the segment
def plotFunction(points,x1,x2):
'Plotting function for plane and iterations'
plt.plot(points[:,0],points[:,1],'ro')
plt.plot([x1[0],x2[0]],[x1[1],x2[1]])
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.show()
# Get the sum of the distance between all the points and the segment
# The segment is defined by guess and length were:
# guess[0]=x coordinate of the starting point
# guess[1]=y coordinate of the starting point
# guess[2]=slope
# Since distance is always >0 no need to use root mean square values
def getDist(guess,points,length):
start_pt=np.array([guess[0],guess[1]])
slope=guess[2]
[x1,x2]=getCoordinates(start_pt,slope,length)
total_dist=0
# Loop over each points to get the distance between the point and the segment
for pt in points:
total_dist+=minimum_distance(x1,x2,pt,length)
return(total_dist)
# Return minimum distance between line segment x1-x2 and point pt
# Adapted from https://stackoverflow.com/questions/849211/shortest-distance-between-a-point-and-a-line-segment
def minimum_distance(x1, x2, pt,length):
length2 = length**2 # i.e. |x1-x2|^2 - avoid a sqrt, we use length that we already know to avoid re-computation
if length2 == 0.0:
return distance.euclidean(p, v);
# Consider the line extending the segment, parameterized as x1 + t (x2 - x1).
# We find projection of point p onto the line.
# It falls where t = [(pt-x1) . (x2-x1)] / |x2-x1|^2
# We clamp t from [0,1] to handle points outside the segment vw.
t = max(0, min(1, np.dot(pt - x1, x2 - x1) / length2));
projection = x1 + t * (x2 - x1); # Projection falls on the segment
return distance.euclidean(pt, projection);
# Get coordinates of start and end point of the segment from start_pt,
# slope and length, obtained by solving slope=dy/dx, dx^2+dy^2=length
def getCoordinates(start_pt,slope,length):
x1=start_pt
dx=length/m.sqrt(slope**2+1)
dy=slope*dx
x2=start_pt+np.array([dx,dy])
return [x1,x2]
if __name__ == '__main__':
# Generate random points
num_points=20
points=np.random.rand(num_points,2)
# Starting position
length=0.5
start_pt=np.array([0.25,0.5])
slope=0
#Use scipy.optimize, minimize to find the best start_pt and slope combination
res = minimize(getDist, x0=[start_pt[0],start_pt[1],slope], args=(points,length), method="Nelder-Mead")
# Retreive best parameters
start_pt=np.array([res.x[0],res.x[1]])
slope=res.x[2]
[x1,x2]=getCoordinates(start_pt,slope,length)
print("\n** The best segment found is defined by:")
print("\t** start_pt:\t",x1)
print("\t** end_pt:\t",x2)
print("\t** slope:\t",slope)
print("** The total distance is:",getDist([x1[0],x2[1],slope],points,length),"\n")
# Plot results
plotFunction(points,x1,x2)

Inverse FFT returns negative values when it should not

I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
deltas.add((x,y,z))
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
X.append(x)
Y.append(y)
Z.append(z)
M.append(m)
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
#njit(parallel=True)
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
else:
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
print(np.min(Density))
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
plt.xlabel('rho/rhom')
plt.ylabel('P(rho)')
plt.show()
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
-EDIT-
A follow-up question on the issue can be found [here].1

The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.

Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.

Interpolate unstructured X,Y,Z data on best grid based on nearest neighbour distance for each points

This question was edited after answers for show final solution I used
I have unstructured 2D datasets coming from different sources, like by example:
Theses datasets are 3 numpy.ndarray (X, Y coordinates and Z value).
My final aim is to interpolate theses datas on a grid for conversion to image/matrix.
So, I need to find the "best grid" for interpolate theses datas. And, for this I need to find the best X and Y step between pixels of that grid.
Determinate step based on Euclidean distance between points:
Use the mean of Euclidean distances between each point and its nearest neighbour.
Use KDTree/cKDTree from scipy.spacial for build tree of the X,Y datas.
Use the query method with k=2 for get the distances (If k=1, distances are only zero because query for each point found itself).
# Generate KD Tree
xy = np.c_[x, y] # X,Y data converted for use with KDTree
tree = scipy.spacial.cKDTree(xy) # Create KDtree for X,Y coordinates.
# Calculate step
distances, points = tree.query(xy, k=2) # Query distances for X,Y points
distances = distances[:, 1:] # Remove k=1 zero distances
step = numpy.mean(distances) # Result
Performance tweaking:
Use of scipy.spatial.cKDTree and not scipy.spatial.KDTree because it is really faster.
Use balanced_tree=False with scipy.spatial.cKDTree: Big speed up in my case, but may not be true for all data.
Use n_jobs=-1 with cKDTree.query for use multithreading.
Use p=1 with cKDTree.query for use Manhattan distance in place of Euclidian distance (p=2): Faster but may be less accurate.
Query the distance for only a random subsample of points: Big speed up with large datasets, but may be less accurate and less repeatable.
Interpolate points on grid:
Interpolate dataset points on grid using the calculated step.
# Generate grid
def interval(axe):
'''Return numpy.linspace Interval for specified axe'''
cent = axe.min() + axe.ptp() / 2 # Interval center
nbs = np.ceil(axe.ptp() / step) # Number of step in interval
hwid = nbs * step / 2 # Half interval width
return np.linspace(cent - hwid, cent + hwid, nbs) # linspace
xg, yg = np.meshgrid(interval(x), interval(y)) # Generate grid
# Interpolate X,Y,Z datas on grid
zg = scipy.interpolate.griddata((x, y), z, (xg, yg))
Set NaN if pixel too far from initials points:
Set NaN to pixels from grid that are too far (Distance > step) from points from initial X,Y,Z data. The previous generated KDTree is used.
# Calculate pixel to X,Y,Z data distances
dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()])
dist = dist.reshape(xg.shape)
# Set NaN value for too far pixels
zg[dist > step] = np.nan

The problem you want to solve is called the "all-nearest-neighbors problem". See this article for example: http://link.springer.com/article/10.1007/BF02187718
I believe solutions to this are O(N log N), so on the same order as KDTree.query, but in practice much, much faster than a bunch of separate queries. I'm sorry, I don't know of a python implementation of this.

I suggest you to go with KDTree.query.
You are searching of a carachteristic distance to scale your binning: I suggest you to take only a random subset of your points, and to use the Manhattan distance, becasue KDTree.query is very slow (and yet it is a n*log(n) complexity).
Here is my code:
# CreateTree
tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy?
# Create random subsample of points
n_repr=1000
shuffled_points=numpy.array(points)
numpy.random.shuffle(shuffled_points)
shuffled_points=shuffled_points[:n_repr]
# Query the tree
(dists,points)=tree.query(shuffled_points,k=2,p=1)
# Get _extimate_ of average distance:
avg_dists=numpy.average(dists)
print('average distance Manhattan with nearest neighbour is:',avg_dists)
I suggest you to use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it is was faster to compute than the euclidean distance. And since you need only an estimator of the average distance it should be sufficient.

Construct an array spacing proportional to a function or other array

I have a function (f : black line) which varies sharply in a specific, small region (derivative f' : blue line, and second derivative f'' : red line). I would like to integrate this function numerically, and if I distribution points evenly (in log-space) I end up with fairly large errors in the sharply varying region (near 2E15 in the plot).
How can I construct an array spacing such that it is very well sampled in the area where the second derivative is large (i.e. a sampling frequency proportional to the second derivative)?
I happen to be using python, but I'm interested in a general algorithm.
Edit:
1) It would be nice to be able to still control the number of sampling points (at least roughly).
2) I've considered constructing a probability distribution function shaped like the second derivative and drawing randomly from that --- but I think this will offer poor convergence, and in general, it seems like a more deterministic approach should be feasible.

Assuming f'' is a NumPy array, you could do the following
# Scale these deltas as you see fit
deltas = 1/f''
domain = deltas.cumsum()
To account only for order of magnitude swings, this could be adjusted as follows...
deltas = 1/(-np.log10(1/f''))

I'm just spitballing here ... (as I don't have time to try this out for real)...
Your data looks (roughly) linear on a log-log plot (at least, each segment seems to be... So, I might consider doing a sort-of integration in log-space.
log_x = log(x)
log_y = log(y)
Now, for each of your points, you can get the slope (and intercept) in log-log space:
rise = np.diff(log_y)
run = np.diff(log_x)
slopes = rise / run
And, similarly, the the intercept can be calculated:
# y = mx + b
# :. b = y - mx
intercepts = y_log[:-1] - slopes * x_log[:-1]
Alright, now we have a bunch of (straight) lines in log-log space. But, a straight line in log-log space, corresponds to y = log(intercept)*x^slope in real space. We can integrate that easily enough: y = a/(k+1) x ^ (k+1), so...
def _eval_log_log_integrate(a, k, x):
return np.log(a)/(k+1) * x ** (k+1)
def log_log_integrate(a, k, x1, x2):
return _eval_log_log_integrate(a, k, x2) - _eval_log_log_integrate(a, k, x1)
partial_integrals = []
for a, k, x_lower, x_upper in zip(intercepts, slopes, x[:-1], x[1:]):
partial_integrals.append(log_log_integrate(a, k, x_lower, x_upper))
total_integral = sum(partial_integrals)
You'll want to check my math -- It's been a while since I've done this sort of thing :-)

1) The Cool Approach
At the moment I implemented an 'adaptive refinement' approach inspired by hydrodynamics techniques. I have a function which I want to sample, f, and I choose some initial array of sample points x_i. I construct a "sampling" function g, which determines where to insert new sample points.
In this case I chose g as the slope of log(f) --- since I want to resolve rapid changes in log space. I then divide the span of g into L=3 refinement levels. If g(x_i) exceeds a refinement level, that span is subdivided into N=2 pieces, those subdivisions are added into the samples and are checked against the next level. This yields something like this:
The solid grey line is the function I want to sample, and the black crosses are my initial sampling points.
The dashed grey line is the derivative of the log of my function.
The colored dashed lines are my 'refinement levels'
The colored crosses are my refined sampling points.
This is all shown in log-space.
2) The Simple Approach
After I finished (1), I realized that I probably could have just chosen a maximum spacing in in y, and choose x-spacings to achieve that. Similarly, just divide the function evenly in y, and find the corresponding x points.... The results of this are shown below:

A simple approach would be to split the x-axis-array into three parts and use different spacing for each of them. It would allow you to maintain the total number of points and also the required spacing in different regions of the plot. For example:
x = np.linspace(10**13, 10**15, 100)
x = np.append(x, np.linspace(10**15, 10**16, 100))
x = np.append(x, np.linspace(10**16, 10**18, 100))
You may want to choose a better spacing based on your data, but you get the idea.

Approximating data with a multi segment cubic bezier curve and a distance as well as a curvature contraint

I have some geo data (the image below shows the path of a river as red dots) which I want to approximate using a multi segment cubic bezier curve. Through other questions on stackoverflow here and here I found the algorithm by Philip J. Schneider from "Graphics Gems". I successfully implemented it and can report that even with thousands of points it is very fast. Unfortunately that speed comes with some disadvantages, namely that the fitting is done quite sloppily. Consider the following graphic:
The red dots are my original data and the blue line is the multi segment bezier created by the algorithm by Schneider. As you can see, the input to the algorithm was a tolerance which is at least as high as the green line indicates. Nevertheless, the algorithm creates a bezier curve which has too many sharp turns. You see too of these unnecessary sharp turns in the image. It is easy to imagine a bezier curve with less sharp turns for the shown data while still maintaining the maximum tolerance condition (just push the bezier curve a bit into the direction of the magenta arrows). The problem seems to be that the algorithm picks data points from my original data as end points of the individual bezier curves (the magenta arrows point indicate some suspects). With the endpoints of the bezier curves restricted like that, it is clear that the algorithm will sometimes produce rather sharp curvatures.
What I am looking for is an algorithm which approximates my data with a multi segment bezier curve with two constraints:
the multi segment bezier curve must never be more than a certain distance away from the data points (this is provided by the algorithm by Schneider)
the multi segment bezier curve must never create curvatures that are too sharp. One way to check for this criteria would be to roll a circle with the minimum curvature radius along the multisegment bezier curve and check whether it touches all parts of the curve along its path. Though it seems there is a better method involving the cross product of the first and second derivative
The solutions I found which create better fits sadly either work only for single bezier curves (and omit the question of how to find good start and end points for each bezier curve in the multi segment bezier curve) or do not allow a minimum curvature contraint. I feel that the minimum curvature contraint is the tricky condition here.
Here another example (this is hand drawn and not 100% precise):
Lets suppose that figure one shows both, the curvature constraint (the circle must fit along the whole curve) as well as the maximum distance of any data point from the curve (which happens to be the radius of the circle in green). A successful approximation of the red path in figure two is shown in blue. That approximation honors the curvature condition (the circle can roll inside the whole curve and touches it everywhere) as well as the distance condition (shown in green). Figure three shows a different approximation to the path. While it honors the distance condition it is clear that the circle does not fit into the curvature any more. Figure four shows a path which is impossible to be approximated with the given constraints because it is too pointy. This example is supposed to illustrate that to properly approximate some pointy turns in the path, it is necessary that the algorithm chooses control points which are not part of the path. Figure three shows that if control points along the path were chosen, the curvature constraint cannot be fulfilled anymore. This example also shows that the algorithm must quit on some inputs as it is not possible to approximate it with the given constraints.
Does there exist a solution to this problem? The solution does not have to be fast. If it takes a day to process 1000 points, then that's fine. The solution does also not have to be optimal in the sense that it must result in a least squares fit.
In the end I will implement this in C and Python but I can read most other languages too.

I found the solution that fulfills my criterea. The solution is to first find a B-Spline that approximates the points in the least square sense and then convert that spline into a multi segment bezier curve. B-Splines do have the advantage that in contrast to bezier curves they will not pass through the control points as well as providing a way to specify a desired "smoothness" of the approximation curve. The needed functionality to generate such a spline is implemented in the FITPACK library to which scipy offers a python binding. Lets suppose I read my data into the lists x and y, then I can do:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
tck,u = interpolate.splprep([x,y],s=3)
unew = np.arange(0,1.01,0.01)
out = interpolate.splev(unew,tck)
plt.figure()
plt.plot(x,y,out[0],out[1])
plt.show()
The result then looks like this:
If I want the curve more smooth, then I can increase the s parameter to splprep. If I want the approximation closer to the data I can decrease the s parameter for less smoothness. By going through multiple s parameters programatically I can find a good parameter that fits the given requirements.
The question though is how to convert that result into a bezier curve. The answer in this email by Zachary Pincus. I will replicate his solution here to give a complete answer to my question:
def b_spline_to_bezier_series(tck, per = False):
"""Convert a parametric b-spline into a sequence of Bezier curves of the same degree.
Inputs:
tck : (t,c,k) tuple of b-spline knots, coefficients, and degree returned by splprep.
per : if tck was created as a periodic spline, per *must* be true, else per *must* be false.
Output:
A list of Bezier curves of degree k that is equivalent to the input spline.
Each Bezier curve is an array of shape (k+1,d) where d is the dimension of the
space; thus the curve includes the starting point, the k-1 internal control
points, and the endpoint, where each point is of d dimensions.
"""
from fitpack import insert
from numpy import asarray, unique, split, sum
t,c,k = tck
t = asarray(t)
try:
c[0][0]
except:
# I can't figure out a simple way to convert nonparametric splines to
# parametric splines. Oh well.
raise TypeError("Only parametric b-splines are supported.")
new_tck = tck
if per:
# ignore the leading and trailing k knots that exist to enforce periodicity
knots_to_consider = unique(t[k:-k])
else:
# the first and last k+1 knots are identical in the non-periodic case, so
# no need to consider them when increasing the knot multiplicities below
knots_to_consider = unique(t[k+1:-k-1])
# For each unique knot, bring it's multiplicity up to the next multiple of k+1
# This removes all continuity constraints between each of the original knots,
# creating a set of independent Bezier curves.
desired_multiplicity = k+1
for x in knots_to_consider:
current_multiplicity = sum(t == x)
remainder = current_multiplicity%desired_multiplicity
if remainder != 0:
# add enough knots to bring the current multiplicity up to the desired multiplicity
number_to_insert = desired_multiplicity - remainder
new_tck = insert(x, new_tck, number_to_insert, per)
tt,cc,kk = new_tck
# strip off the last k+1 knots, as they are redundant after knot insertion
bezier_points = numpy.transpose(cc)[:-desired_multiplicity]
if per:
# again, ignore the leading and trailing k knots
bezier_points = bezier_points[k:-k]
# group the points into the desired bezier curves
return split(bezier_points, len(bezier_points) / desired_multiplicity, axis = 0)
So B-Splines, FITPACK, numpy and scipy saved my day :)

polygonize data
find the order of points so you just find the closest points to each other and try them to connect 'by lines'. Avoid to loop back to origin point
compute derivation along path
it is the change of direction of the 'lines' where you hit local min or max there is your control point ... Do this to reduce your input data (leave just control points).
curve
now use these points as control points. I strongly recommend interpolation polynomial for both x and y separately for example something like this:
x=a0+a1*t+a2*t*t+a3*t*t*t
y=b0+b1*t+b2*t*t+b3*t*t*t
where a0..a3 are computed like this:
d1=0.5*(p2.x-p0.x);
d2=0.5*(p3.x-p1.x);
a0=p1.x;
a1=d1;
a2=(3.0*(p2.x-p1.x))-(2.0*d1)-d2;
a3=d1+d2+(2.0*(-p2.x+p1.x));
b0 .. b3 are computed in same way but use y coordinates of course
p0..p3 are control points for cubic interpolation curve
t =<0.0,1.0> is curve parameter from p1 to p2
this ensures that position and first derivation is continuous (c1) and also you can use BEZIER but it will not be as good match as this.
[edit1] too sharp edges is a BIG problem
To solve it you can remove points from your dataset before obtaining the control points. I can think of two ways to do it right now ... choose what is better for you
remove points from dataset with too high first derivation
dx/dl or dy/dl where x,y are coordinates and l is curve length (along its path). The exact computation of curvature radius from curve derivation is tricky
remove points from dataset that leads to too small curvature radius
compute intersection of neighboring line segments (black lines) midpoint. Perpendicular axises like on image (red lines) the distance of it and the join point (blue line) is your curvature radius. When the curvature radius is smaller then your limit remove that point ...
now if you really need only BEZIER cubics then you can convert my interpolation cubic to BEZIER cubic like this:
// ---------------------------------------------------------------------------
// x=cx[0]+(t*cx[1])+(tt*cx[2])+(ttt*cx[3]); // cubic x=f(t), t = <0,1>
// ---------------------------------------------------------------------------
// cubic matrix bz4 = it4
// ---------------------------------------------------------------------------
// cx[0]= ( x0) = ( X1)
// cx[1]= (3.0*x1)-(3.0*x0) = (0.5*X2) -(0.5*X0)
// cx[2]= (3.0*x2)-(6.0*x1)+(3.0*x0) = -(0.5*X3)+(2.0*X2)-(2.5*X1)+( X0)
// cx[3]= ( x3)-(3.0*x2)+(3.0*x1)-( x0) = (0.5*X3)-(1.5*X2)+(1.5*X1)-(0.5*X0)
// ---------------------------------------------------------------------------
const double m=1.0/6.0;
double x0,y0,x1,y1,x2,y2,x3,y3;
x0 = X1; y0 = Y1;
x1 = X1-(X0-X2)*m; y1 = Y1-(Y0-Y2)*m;
x2 = X2+(X1-X3)*m; y2 = Y2+(Y1-Y3)*m;
x3 = X2; y3 = Y2;
In case you need the reverse conversion see:
Bezier curve with control points within the curve

The question was posted long ago, but here is a simple solution based on splprep, finding the minimal value of s allowing to fulfill a minimum curvature radius criteria.
route is the set of input points, the first dimension being the number of points.
import numpy as np
from scipy.interpolate import splprep, splev
#The minimum curvature radius we want to enforce
minCurvatureConstraint = 2000
#Relative tolerance on the radius
relTol = 1.e-6
#Initial values for bisection search, should bound the solution
s_0 = 0
minCurvature_0 = 0
s_1 = 100000000 #Should be high enough to produce curvature radius larger than constraint
s_1 *= 2
minCurvature_1 = np.float('inf')
while np.abs(minCurvature_0 - minCurvature_1)>minCurvatureConstraint*relTol:
s = 0.5 * (s_0 + s_1)
tck, u = splprep(np.transpose(route), s=s)
smoothed_route = splev(u, tck)
#Compute radius of curvature
derivative1 = splev(u, tck, der=1)
derivative2 = splev(u, tck, der=2)
xprim = derivative1[0]
xprimprim = derivative2[0]
yprim = derivative1[1]
yprimprim = derivative2[1]
curvature = 1.0 / np.abs((xprim*yprimprim - yprim* xprimprim) / np.power(xprim*xprim + yprim*yprim, 3 / 2))
minCurvature = np.min(curvature)
print("s is %g => Minimum curvature radius is %g"%(s,np.min(curvature)))
#Perform bisection
if minCurvature > minCurvatureConstraint:
s_1 = s
minCurvature_1 = minCurvature
else:
s_0 = s
minCurvature_0 = minCurvature
It may require some refinements such as iterations to find a suitable s_1, but works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.