I have been trying to code a piece of software to bootstrap data where every data point has a different and unique uncertainty. I take this uncertainty as the standard deviation when sampling that point from a gaussian distribution.
I run many samples, however the bootstrapped results does not agree with the curve_fit best result (the only difference I can think of is that this takes the datapoints and assumes they have no uncertainty), but this best result, by definition, be identical. Any ideas why?
The code is as follows:
with inputs as:
def f(x, a, b):
y = a*x + b
return y
x (array, x data points)
y (array, y data points)
x_err (array, uncertainty in each x point)
y_err (array, uncertainty in each y point)
n_samples = 10000
conf_pct = 68 (% for a 1 sigma test)
So just for clarity, x[i], y[i], x_err[i] and y_err[i] is all the information associated with the ith datapoint. (I did have these in a dataframe but took it out into arrays because I understood the processing of it better)
def bootstrap_fit(f, x, y, x_err, y_err, n_samples, conf_pct):
# n_samples number of draws from each data point, and then
# take the transpose to make n_samples number of samples (because n_samples >> len(x) )
x_sampling = []
y_sampling = []
a_boot = []
b_boot = []
# cov_boot = [] # don't think we'll need this but just in case?
for i, this_x in enumerate(x):
this_x_err = x_err[i]
this_y = y[i]
this_y_err = y_err[i]
this_x_samp = np.random.normal(loc=this_x, scale=this_x_err, size=n_samples)
this_y_samp = np.random.normal(loc=this_y, scale=this_y_err, size=n_samples)
x_sampling.append(this_x_samp)
y_sampling.append(this_y_samp)
# convert to np arrays and take the transpose
x_sampling = np.array(x_sampling).T
y_sampling = np.array(y_sampling).T
# ok, now that we have n_samples number of datasets randomly sampled within
# the actual errorbars of the data, let's fit each of those datasets
# notice how this_x and this_y and i, etc, are temporary variables
# that will get overwritten from the past loop
for i, this_x in enumerate(x_sampling):
this_y = y_sampling[i]
p_opt, p_cov = curve_fit(f, this_x, this_y)
a_boot.append(p_opt[0])
b_boot.append(p_opt[1])
# cov_boot.append(p_cov)
# make these into np arrays as well
a_boot = np.array(a_boot)
b_boot = np.array(b_boot)
# set up an array to use to plot the lines (because each x, y random dataset
# actually has slightly different min and max x values, and that gets messy)
x_fit = np.linspace(np.min(x), np.max(x), num=1000, endpoint=True)
y_fit = []
for i, this_a in enumerate(a_boot):
this_b = b_boot[i]
this_y = f(x_fit, this_a, this_b)
y_fit.append(this_y)
y_fit = np.array(y_fit)
# figure out from that what percentiles we actually need to identify
conf_lo = (100. - conf_pct)/2.
conf_hi = 100. - conf_lo
# set up the lists that will hold the upper and lower lines
y_upper = []
y_lower = []
y_median = []
y_difference = []
for i, this_x in enumerate(x_fit):
# we need to extract all the y-values for every random sample that correspond
# to this x value. We will just take the ith array of the transpose of y_boot.
this_y = y_fit.T[i]
# add the percentile values to each list for this value of x
y_lower.append(np.percentile(this_y, conf_lo))
y_upper.append(np.percentile(this_y, conf_hi))
y_median.append(np.percentile(this_y, 50.))
# make them numpy arrays because sometimes matplotlib doesn't like plotting lists
y_lower = np.array(y_lower)
y_upper = np.array(y_upper)
y_median = np.array(y_median)
# finding equation for the median line
p_opt, p_cov = curve_fit(f, x_fit, y_median)
a = float("{:.4f}".format(p_opt[0]))
b = float("{:.4f}".format(p_opt[1]))
for i, this_x in enumerate(x_fit):
this_y = y_fit.T[i]
spread_above = abs(point_line_distance(x, np.percentile(this_y, conf_hi), p_opt[0], p_opt[1]))
spread_below = abs(point_line_distance(x, np.percentile(this_y, conf_lo), p_opt[0], p_opt[1]))
orthog_distance = spread_above + spread_below
y_difference.append(orthog_distance)
spread = float("{:.4f}".format(np.amin(y_difference)))
print("narrowest orthogonal point on bootstrap_2 "+str(spread))
plt.fill_between(x_fit, y_lower, y_upper, alpha=0.4, label='Bootstrapped uncertainty at '+str(conf_pct)+'%')
plt.plot(x_fit, y_median, label='Bootstrapped curve_fit: y = ('+str(a)+')x + ('+str(b)+')')
def CURVE_fit(x, y):
# do a standard linear fitting to the data
p_opt, p_cov = curve_fit(f, x, y)
p_err = np.sqrt(np.diag(p_cov))
# y=ax+b for a linear fit
a, b = p_opt
a_err, b_err = p_err
x_plot = np.sort(x)
plt.plot(x, a*x+b, label='best curve_fit: y = ('+result(a,a_err)+')x + ('+result(b,b_err)+')', color = 'purple', linestyle = 'dashed')
I've tried playing around with the number of samples, the data input, and I've even coded an entirely seperate straight line fit using ODR and a corresponding independant bootstrapping method (they don't agree but thats a whole different issue) and nothing seems to reconsile these two values. Any ideas would be much appreciated.
Related
My idea is to apply linear regression to draw a line on a time series dataset to approximate the direction it is evolving in (first I draw the line, then I calculate the slope and I see if my plot is increasing decreasing, or constant).
For that, I relied on this code
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
For that I need an X and Y array.
The data I extracted had an index in the format of a date "Y-M-D".
enter image description here
As you may know for linear regression it does not make sense to have the "date" as index, hence I used the A.reset_index() to get numeric indexes
enter image description here
Now that I got my data I need to extract the indexes to put them in an array "X" and the data to be plotted in an array "Y".
Therefore my question would be how to extract these new indexes and put them in the array X.
You can do:
x=[i + 1 for i in A.index] # to make data x starts with 1 instead of 0
y=A['lift']
And you apply your functions on those x and y
I am working with a projected coordinate dataset that contains x,y,z data (432 line csv with X Y Z headers, not attached). I wish to import this dataset, calculate a new grid based on user input and then start performing some statistics on points that fall within the new grid. I've gotten to the point that I have two lists (raw_lst with 431(x,y,z) and grid_lst with 16(x,y) (calling n,e)) but when I try to iterate through to start calculating average and density for the new grid it all falls apart. I am trying to output a final list that contains the grid_lst x and y values along with the calculated average z and density values.
I searched numpy and scipy libraries thinking that they may have already had something to do what I am wanting but was unable to find anything. Let me know if any of you all have any thoughts.
sample_xyz_reddot_is_newgrid_pictoral_representation
import pandas as pd
import math
df=pd.read_csv("Sample_xyz.csv")
N=df["X"]
E=df["Y"]
Z=df["Z"]
#grid = int(input("Specify grid value "))
grid = float(0.5) #for quick testing the grid value is set to 0.5
#max and total calculate the input area extents
max_N = math.ceil(max(N))
max_E = math.ceil(max(E))
min_E = math.floor(min(E))
min_N = math.floor(min(N))
total_N = max_N - min_N
total_E = max_E - min_E
total_N = int(total_N/grid)
total_E = int(total_E/grid)
#N_lst and E_lst calculate the mid points based on the input file extents and the specified grid file
N_lst = []
n=float(max_N)-(0.5*grid)
for x in range(total_N):
N_lst.append(n)
n=n-grid
E_lst = []
e=float(max_E)-(0.5*grid)
for x in range(total_E):
E_lst.append(e)
e=e-grid
grid_lst = []
for n in N_lst:
for e in E_lst:
grid_lst.append((n,e))
#converts the imported dataframe to list
raw_lst = df.to_records(index=False)
raw_lst = list(raw_lst)
#print(grid_lst) # grid_lst is a list of 16 (n,e) tuples for the new grid coordinates.
#print(raw_lst) # raw_lst is a list of 441 (n,e,z) tuples from the imported file - calling these x,y,z.
#The calculation where it all falls apart.
t=[]
average_lst = []
for n, e in grid_lst:
for x, y, z in raw_lst:
if n >= x-(grid/2) and n <= x+(grid/2) and e >= y-(grid/2) and e <= y+(grid/2):
t.append(z)
average = sum(t)/len(t)
density = len(t)/grid
average_lst = (n,e,average,density)
print(average_lst)
# print("The length of this list is " + str(len(average_lst)))
# print("The length of t is " + str(len(t)))
SAMPLE CODE FOR RUNNING
import random
grid=5
raw_lst = [(random.randrange(0,10), random.randrange(0,10), random.randrange(0,2))for i in range(100)]
grid_lst = [(2.5,2.5),(2.5,7.5),(7.5,2.5),(7.5,7.5)]
t=[]
average_lst = []
for n, e in grid_lst:
for x, y, z in raw_lst:
if n >= x-(grid/2) and n <= x+(grid/2) and e >= y-(grid/2) and e <= y+(grid/2):
t.append(z)
average = sum(t)/len(t)
density = len(t)/grid
average_lst = (n,e,average,density)
print(average_lst)
Some advices
when working with arrays, use numpy. It has more functionalities
when working with grids it's often more handy the use x-coords, y-coords as single arrays
Comments to the solution
obviousley you have a grid, or rather a box, grd_lst. We generate it as a numpy meshgrid (gx,gy)
you have a number of points raw_list. We generate each elemnt of it as 1-dimensional numpy arrays
you want to select the r_points that are in the g_box. We use the percentage formula for that: tx = (rx-gxMin)/(gxMax-gxMin)
if tx, ty are within [0..1] we store the index
as an intermediate result we get all indices of raw_list that are within the g_box
with that index you can extract the elements of raw_list that are within the g_box and can do some statistics
note that I have omitted the z-coord. You will have to improve this solution.
--
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.colors as mclr
from matplotlib import cm
f10 = 'C://gcg//picStack_10.jpg' # output file name
f20 = 'C://gcg//picStack_20.jpg' # output file name
def plot_grid(gx,gy,rx,ry,Rx,Ry,fOut):
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
myCmap = mclr.ListedColormap(['blue','lightgreen'])
ax.pcolormesh(gx, gy, gx, edgecolors='b', cmap=myCmap, lw=1, alpha=0.3)
ax.scatter(rx,ry,s=150,c='r', alpha=0.7)
ax.scatter(Rx,Ry,marker='s', s=150,c='gold', alpha=0.5)
ax.set_aspect('equal')
plt.savefig(fOut)
plt.show()
def get_g_grid(nx,ny):
ix = 2.5 + 5*np.linspace(0,1,nx)
iy = 2.5 + 5*np.linspace(0,1,ny)
gx, gy = np.meshgrid(ix, iy, indexing='ij')
return gx,gy
def get_raw_points(N):
rx,ry,rz,rv = np.random.randint(0,10,N), np.random.randint(0,10,N), np.random.randint(0,2,N), np.random.uniform(low=0.0, high=1.0, size=N)
return rx,ry,rz,rv
N = 100
nx, ny = 2, 2
gx,gy = get_base_grid(nx,ny)
rx,ry,rz,rv = get_raw_points(N)
plot_grid(gx,gy,rx,ry,0,0,f10)
def get_the_points_inside(gx,gy,rx,ry):
#----- run throuh the g-grid -------------------------------
nx,ny = gx.shape
N = len(rx)
index = []
for jx in range(0,nx-1):
for jy in range(0,ny-1):
#--- run through the r_points
for jr in range(N):
test_x = (rx[jr]-gx[jx,jy]) / (gx[jx+1,jy] - gx[jx,jy])
test_y = (ry[jr]-gy[jx,jy]) / (gy[jx,jy+1] - gy[jx,jy])
if (0.0 <= test_x <= 1.0) and (0.0 <= test_y <= 1.0):
index.append(jr)
return index
index = get_the_points_inside(gx,gy,rx,ry)
Rx, Ry, Rz, Rv = rx[index], ry[index], rz[index], rv[index]
plot_grid(gx,gy,rx,ry,Rx,Ry,f20)
I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.
Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)
Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()
It is the first time I am trying to write a Poincare section code at Python.
I borrowed the piece of code from here:
https://github.com/williamgilpin/rk4/blob/master/rk4_demo.py
and I have tried to run it for my system of second order coupled odes. The problem is that I do not see what I was expecting to. Actually, I need the Poincare section when x=0 and px>0.
I believe that my implementation is not the best out there. I would like to:
Improve the way that the initial conditions are chosen.
Apply the correct conditions (x=0 and px>0) in order to acquire the correct Poincare section.
Create one plot with all the collected poincare section data, not four separate ones.
I would appreciate any help.
This is the code:
from matplotlib.pyplot import *
from scipy import *
from numpy import *
# a simple Runge-Kutta integrator for multiple dependent variables and one independent variable
def rungekutta4(yprime, time, y0):
# yprime is a list of functions, y0 is a list of initial values of y
# time is a list of t-values at which solutions are computed
#
# Dependency: numpy
N = len(time)
y = array([thing*ones(N) for thing in y0]).T
for ii in xrange(N-1):
dt = time[ii+1] - time[ii]
k1 = dt*yprime(y[ii], time[ii])
k2 = dt*yprime(y[ii] + 0.5*k1, time[ii] + 0.5*dt)
k3 = dt*yprime(y[ii] + 0.5*k2, time[ii] + 0.5*dt)
k4 = dt*yprime(y[ii] + k3, time[ii+1])
y[ii+1] = y[ii] + (k1 + 2.0*(k2 + k3) + k4)/6.0
return y
# Miscellaneous functions
n= 1.0/3.0
kappa1 = 0.1
kappa2 = 0.1
kappa3 = 0.1
def total_energy(valpair):
(x, y, px, py) = tuple(valpair)
return .5*(px**2 + py**2) + (1.0/(1.0*(n+1)))*(kappa1*np.absolute(x)**(n+1)+kappa2*np.absolute(y-x)**(n+1)+kappa3*np.absolute(y)**(n+1))
def pqdot(valpair, tval):
# input: [x, y, px, py], t
# takes a pair of x and y values and returns \dot{p} according to the Hamiltonian
(x, y, px, py) = tuple(valpair)
return np.array([px, py, -kappa1*np.sign(x)*np.absolute(x)**n+kappa2*np.sign(y-x)*np.absolute(y-x)**n, kappa2*np.sign(y-x)*np.absolute(y-x)**n-kappa3*np.sign(y)*np.absolute(y)**n]).T
def findcrossings(data, data1):
# returns indices in 1D data set where the data crossed zero. Useful for generating Poincare map at 0
prb = list()
for ii in xrange(len(data)-1):
if (((data[ii] > 0) and (data[ii+1] < 0)) or ((data[ii] < 0) and (data[ii+1] > 0))) and data1[ii] > 0:
prb.append(ii)
return array(prb)
t = linspace(0, 1000.0, 100000)
print ("step size is " + str(t[1]-t[0]))
# Representative initial conditions for E=1
E = 1
x0=0
y0=0
init_cons = [[x0, y0, np.sqrt(2*E-(1.0*i/10.0)*(1.0*i/10.0)-2.0/(n+1)*(kappa1*np.absolute(x0)**(n+1)+kappa2*np.absolute(y0-x0)**(n+1)+kappa3*np.absolute(y0)**(n+1))), 1.0*i/10.0] for i in range(-10,11)]
outs = list()
for con in init_cons:
outs.append( rungekutta4(pqdot, t, con) )
# plot the results
fig1 = figure(1)
for ii in xrange(4):
subplot(2, 2, ii+1)
plot(outs[ii][:,1],outs[ii][:,3])
ylabel("py")
xlabel("y")
title("Full trajectory projected onto the plane")
fig1.suptitle('Full trajectories E = 1', fontsize=10)
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(4):
subplot(2, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
yints = [.5*(outs[ii][cross, 1] + outs[ii][cross+1, 1]) for cross in xcrossings]
pyints = [.5*(outs[ii][cross, 3] + outs[ii][cross+1, 3]) for cross in xcrossings]
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
fig2.suptitle('Poincare Sections E = 1', fontsize=10)
show()
You need to compute the derivatives of the Hamiltonian correctly. The derivative of |y-x|^n for x is
n*(x-y)*|x-y|^(n-2)=n*sign(x-y)*|x-y|^(n-1)
and the derivative for y is almost, but not exactly (as in your code), the same,
n*(y-x)*|x-y|^(n-2)=n*sign(y-x)*|x-y|^(n-1),
note the sign difference. With this correction you can take larger time steps, with correct linear interpolation probably even larger ones, to obtain the images
I changed the integration of the ODE to
t = linspace(0, 1000.0, 2000+1)
...
E_kin = E-total_energy([x0,y0,0,0])
init_cons = [[x0, y0, (2*E_kin-py**2)**0.5, py] for py in np.linspace(-10,10,8)]
outs = [ odeint(pqdot, con, t, atol=1e-9, rtol=1e-8) ) for con in init_cons[:8] ]
Obviously the number and parametrization of initial conditions may change.
The computation and display of the zero-crossings was changed to
def refine_crossing(a,b):
tf = -a[0]/a[2]
while abs(b[0])>1e-6:
b = odeint(pqdot, a, [0,tf], atol=1e-8, rtol=1e-6)[-1];
# Newton step using that b[0]=x(tf) and b[2]=x'(tf)
tf -= b[0]/b[2]
return [ b[1], b[3] ]
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(8):
#subplot(4, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
ycrossings = [ refine_crossing(outs[ii][cross], outs[ii][cross+1]) for cross in xcrossings]
yints, pyints = array(ycrossings).T
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
and evaluating the result of a longer integration interval
I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()