curve fitting different in Python than Matlab - python

I have a code in Matlab that I want to convert to python. In the Matlab code, I'm using the curve fitting toolbox to fit some data to the Fourier series of order 3. Here is how I did it in Matlab:
ft= fittype('fourier3');
myfit = fit(x,y,ft)
figure(20)
plot(y)
hold
figure(20)
plot(myfit)
And here is the plot of the data
So, to convert it to Python, I searched for equivalent library to the curve fitting toolbox, and found a library named 'symfit' which serves the same purpose. I looked the documentation and found example that can help, so I used the example as in the description with my data as follows:
from symfit import parameters, variables, sin, cos, Fit
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def fourier_series(x, f, n=0):
"""
Returns a symbolic fourier series of order `n`.
:param n: Order of the fourier series.
:param x: Independent variable
:param f: Frequency of the fourier series
"""
# Make the parameter objects for all the terms
a0, *cos_a = parameters(','.join(['a{}'.format(i) for i in range(0, n + 1)]))
sin_b = parameters(','.join(['b{}'.format(i) for i in range(1, n + 1)]))
# Construct the series
series = a0 + sum(ai * cos(i * f * x) + bi * sin(i * f * x)
for i, (ai, bi) in enumerate(zip(cos_a, sin_b), start=1))
return series
T = pd.read_excel('data.xls')
A = pd.DataFrame(T)
x, y = variables('x, y')
w, = parameters('w')
model_dict = {y: fourier_series(x, f=w, n=3)}
print(model_dict)
xdata = np.array(A.iloc[:, 0])
ydata = np.array(A.iloc[:, 1])
# Define a Fit object for this model and data
fit = Fit(model_dict, x=xdata, y=ydata)
fit_result = fit.execute()
print(fit_result)
# Plot the result
plt.plot(xdata, ydata)
plt.plot(xdata, fit.model(x=xdata, **fit_result.params).y, ls=':')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
But when running the code, here is the plot I get:
I don't know why the fitted data is a straight line. Can anyone help with that problem? I don't know whether I used the wrong algorithm or I'm plotting the data incorrectly.
Edit:
Here is the data file for those who would like to try: https://docs.google.com/spreadsheets/d/18lL1iMZ3kdaqUUtRDLNRK4A3uCPzOrXt/edit?usp=sharing&ouid=112684448221465330517&rtpof=true&sd=true

Related

Error showing when using Symfit function in python to fit curve made out of data from excel file

The code was just taken from the Symfit documentation.
https://symfit.readthedocs.io/en/stable/examples/ex_fourier_series.html
Now I only modified xdata and ydata to read it instead from an excel file using pandas. But the problem is not matching data types as far as I understood from other similar Q/A.
Could anyone tell me the solution to this.
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('float64'), dtype('<U1')) -> None
from symfit import parameters, variables, sin, cos, Fit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def fourier_series(x, f, n=0):
"""
Returns a symbolic fourier series of order `n`.
:param n: Order of the fourier series.
:param x: Independent variable
:param f: Frequency of the fourier series
"""
# Make the parameter objects for all the terms
a0, *cos_a = parameters(','.join(['a{}'.format(i) for i in range(0, n + 1)]))
sin_b = parameters(','.join(['b{}'.format(i) for i in range(1, n + 1)]))
# Construct the series
series = a0 + sum(ai * cos(i * f * x) + bi * sin(i * f * x)
for i, (ai, bi) in enumerate(zip(cos_a, sin_b), start=1))
return series
x, y = variables('x, y')
w, = parameters('w')
model_dict = {y: fourier_series(x, f=w, n=12)}
print(model_dict)
# Make step function data
data= pd.read_csv('/content/Vel_Prof.csv')
xdata = ['t']
ydata = ['v']
# Define a Fit object for this model and data
fit = Fit(model_dict, x=xdata, y=ydata)
fit_result = fit.execute()
print(fit_result)
# Plot the result
plt.plot(xdata, ydata)
plt.plot(xdata, fit.model(x=xdata, **fit_result.params).y, ls=':')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

How to fit a specific exponential function with numpy

I'm trying to fit a series of data to a exponential equation, I've found some great answer here: How to do exponential and logarithmic curve fitting in Python? I found only polynomial fitting But it didn't contain the step forward that I need for this question.
I'm trying to fit y and x against a equation: y = -AeBx + A. The final A has proven to be a big trouble and I don't know how to transform the equation like log(y) = log(A) + Bx as if the final A was not there.
Any help is appreciated.
You can always just use scipy.optimize.curve_fit as long as your equation isn't too crazy:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as sio
def f(x, A, B):
return -A*np.exp(B*x) + A
A = 2
B = 1
x = np.linspace(0,1)
y = f(x, A, B)
scale = (max(y) - min(y))*.10
noise = np.random.normal(size=x.size)*scale
y += noise
fit = sio.curve_fit(f, x, y)
plt.scatter(x, y)
plt.plot(x, f(x, *fit[0]))
plt.show()
This produces:

Generate random numbers from exponential distribution and model using python

My goal is to create a dataset of random points whose histogram looks like an exponential decay function and then plot an exponential decay function through those points.
First I tried to create a series of random numbers (but did not do so successfully since these should be points, not numbers) from an exponential distribution.
from pylab import *
from scipy.optimize import curve_fit
import random
import numpy as np
import pandas as pd
testx = pd.DataFrame(range(10)).astype(float)
testx = testx[0]
for i in range(1,11):
x = random.expovariate(15) # rate = 15 arrivals per second
data[i] = [x]
testy = pd.DataFrame(data).T.astype(float)
testy = testy[0]; testy
plot(testx, testy, 'ko')
The result could look something like this.
And then I define a function to draw a line through my points:
def func(x, a, e):
return a*np.exp(-a*x)+e
popt, pcov = curve_fit(f=func, xdata=testx, ydata=testy, p0 = None, sigma = None)
print popt # parameters
print pcov # covariance
plot(testx, testy, 'ko')
xx = np.linspace(0, 15, 1000)
plot(xx, func(xx,*popt))
plt.show()
What I'm looking for is: (1) a more elegant way to create an array of random numbers from an exponential (decay) distribution and (2) how to test that my function is indeed going through the data points.
I would guess that the following is close to what you want. You can generate some random numbers drawn from an exponential distribution with numpy,
data = numpy.random.exponential(5, size=1000)
You can then create a histogram of them using numpy.hist and draw the histogram values into a plot. You may decide to take the middle of the bins as position for the point (this assumption is of course wrong, but gets the more valid the more bins you use).
Fitting works as in the code from the question. You will then find out that our fit roughly finds the parameter used for the data generation (in this case below ~5).
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
data = np.random.exponential(5, size=1000)
hist,edges = np.histogram(data,bins="auto",density=True )
x = edges[:-1]+np.diff(edges)/2.
plt.scatter(x,hist)
func = lambda x,beta: 1./beta*np.exp(-x/beta)
popt, pcov = curve_fit(f=func, xdata=x, ydata=hist)
print(popt)
xx = np.linspace(0, x.max(), 101)
plt.plot(xx, func(xx,*popt), ls="--", color="k",
label="fit, $beta = ${}".format(popt))
plt.legend()
plt.show()
I think you are actually asking about a regression problem, which is what Praveen was suggesting.
You have a bog standard exponential decay that arrives at the y-axis at about y=0.27. Its equation is therefore y = 0.27*exp(-0.27*x). I can model gaussian error around the values of this function and plot the result using the following code.
import matplotlib.pyplot as plt
from math import exp
from scipy.stats import norm
x = range(0, 16)
Y = [0.27*exp(-0.27*_) for _ in x]
error = norm.rvs(0, scale=0.05, size=9)
simulated_data = [max(0, y+e) for (y,e) in zip(Y[:9],error)]
plt.plot(x, Y, 'b-')
plt.plot(x[:9], simulated_data, 'r.')
plt.show()
print (x[:9])
print (simulated_data)
Here's the plot. Notice that I save the output values for subsequent use.
Now I can calculate the nonlinear regression of the exponential decay values, contaminated with noise, on the independent variable, which is what curve_fit does.
from math import exp
from scipy.optimize import curve_fit
import numpy as np
def model(x, p):
return p*np.exp(-p*x)
x = list(range(9))
Y = [0.22219001972988275, 0.15537454187341937, 0.15864069451825827, 0.056411162886672819, 0.037398831058143338, 0.10278251869912845, 0.03984605649260467, 0.0035360087611421981, 0.075855255999424692]
popt, pcov = curve_fit(model, x, Y)
print (popt[0])
print (pcov)
The bonus is that, not only does curve_fit calculate an estimate for the parameter — 0.207962159793 — it also offers an estimate for this estimate's variance — 0.00086071 — as an element of pcov. This would appear to be a fairly small value, given the small sample size.
Here's how to calculate the residuals. Notice that each residual is the difference between the data value and the value estimated from x using the parameter estimate.
residuals = [y-model(_, popt[0]) for (y, _) in zip(Y, x)]
print (residuals)
If you wanted to further 'test that my function is indeed going through the data points' then I would suggest looking for patterns in the residuals. But discussions like this might be beyond what's welcomed on stackoverflow: Q-Q and P-P plots, plots of residuals vs y or x, and so on.
I agree with the solution of #ImportanceOfBeingErnes, but I'd like to add a (well known?) general solution for distributions. If you have a distribution function f with integral F (i.e. f = dF / dx) then you get the required distribution by mapping random numbers with inv F i.e. the inverse function of the integral. In case of the exponential function, the integral is, again, an exponential and the inverse is the logarithm. So it can be done like this:
import matplotlib.pyplot as plt
import numpy as np
from random import random
def gen( a ):
y=random()
return( -np.log( y ) / a )
def dist_func( x, a ):
return( a * np.exp( -a * x) )
data = [ gen(3.14) for x in range(20000) ]
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.hist(data, bins=80, normed=True, histtype="step")
ax.plot(np.linspace(0,5,150), dist_func( np.linspace(0,5,150), 3.14 ) )
plt.show()

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()

predicting values given a sinusouidal fit

I'm using Python to fit a time series with a sinusoidal function. I found quite a good match and now I want to be able to predict future values.. I'm at lost here.
Here's what I've got:
timeSeries = [0.01146, 0.00724, 0.00460, 0.00192, 0.00145, 0.01559, 0.02585, 0.04118, 0.05073, 0.01966, 0.01486, 0.02784]
import numpy as np
from scipy.optimize import curve_fit
def createSinFromFit(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
def sinRegr(series):
t = np.linspace(0, 4*np.pi, len(series))
guess_freq = 1
guess_amplitude = 3*np.std(series)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(series)
p0=[guess_freq, guess_amplitude, guess_phase, guess_offset]
fit = curve_fit(createSinFromFit, t, series, p0=p0)
results = createSinFromFit(t,*fit[0])
return results
plotThis = sinRegr(timeSeries)
This code produces the fitting you see in this picture:
How can I extend the sin function so that it predicts the future points of the series? i.e. how can I have the sine plot span on to the right, beyond the area covered by the 'known' data points?
You need to distinguish a data timeline (input) and a fit timeline (output). Once you do that, the approach is fairly clear. Below I called them tdata and tfit:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
tdata = np.linspace(0, 10)
timeSeries = np.sin(tdata) + .4*np.random.random(tdata.shape)
def createSinFromFit(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
def sinRegr(tdata, series):
tfit = np.linspace(0, 6*np.pi, len(series))
guess_freq = .2
guess_amplitude = 3*np.std(series)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(series)
p0=[guess_freq, guess_amplitude, guess_phase, guess_offset]
fit = curve_fit(createSinFromFit, tdata, series, p0=p0) # use tdata to create the fit
results = createSinFromFit(tfit,*fit[0]) # use tfit to generate a new curve
return tfit, results
tfit, plotThis = sinRegr(tdata, timeSeries)
plt.plot(tfit, plotThis)
plt.plot(tdata, timeSeries, "ro")
plt.show()

Categories

Resources