SciPy + Numpy: Finding the slope of a sigmoid curve - python

I have some data that follow a sigmoid distribution as you can see in the following image:
After normalizing and scaling my data, I have adjusted the curve at the bottom using scipy.optimize.curve_fit and some initial parameters:
popt, pcov = curve_fit(sigmoid_function, xdata, ydata, p0 = [0.05, 0.05, 0.05])
>>> print popt
[ 2.82019932e+02 -1.90996563e-01 5.00000000e-02]
So popt, according to the documentation, returns *"Optimal values for the parameters so that the sum of the squared error of f(xdata, popt) - ydata is minimized". I understand here that there is no calculation of the slope with curve_fit, because I do not think the slope of this gentle curve is 282, neither is negative.
Then I tried with scipy.optimize.leastsq, because the documentation says it returns "The solution (or the result of the last iteration for an unsuccessful call).", so I thought the slope would be returned. Like this:
p, cov, infodict, mesg, ier = leastsq(residuals, p_guess, args = (nxdata, nydata), full_output=True)
>>> print p
Param(x0=281.73193626250207, y0=-0.012731420027056234, c=1.0069006606656596, k=0.18836680131910222)
But again, I did not get what I expected. curve_fit and leastsq returned almost the same values, with is not surprising I guess, as curve_fit is using an implementation of the least squares method within to find the curve. But no slope back...unless I overlooked something.
So, how to calculate the slope in a point, say, where X = 285 and Y = 0.5?
I am trying to avoid manual methods, like calculating the derivative in, say, (285.5, 0.55) and (284.5, 0.45) and subtract and divide results and so. I would like to know if there is a more automatic method for this.
Thank you all!
EDIT #1
This is my "sigmoid_function", used by curve_fit and leastsq methods:
def sigmoid_function(xdata, x0, k, p0): # p0 not used anymore, only its components (x0, k)
# This function is called by two different methods: curve_fit and leastsq,
# this last one through function "residuals". I don't know if it makes sense
# to use a single function for two (somewhat similar) methods, but there
# it goes.
# p0:
# + Is the initial parameter for scipy.optimize.curve_fit.
# + For residuals calculation is left empty
# + It is initialized to [0.05, 0.05, 0.05]
# x0:
# + Is the convergence parameter in X-axis and also the shift
# + It starts with 0.05 and ends up being around ~282 (days in a year)
# k:
# + Set up either by curve_fit or leastsq
# + In least squares it is initially fixed at 0.5 and in curve_fit
# + to 0.05. Why? Just did this approach in two different ways and
# + it seems it is working.
# + But honestly, I have no clue on what it represents
# xdata:
# + Positions in X-axis. In this case from 240 to 365
# Finally I changed those parameters as suggested in the answer.
# Sigmoid curve has 2 degrees of freedom, therefore, the initial
# guess only needs to be this size. In this case, p0 = [282, 0.5]
y = np.exp(-k*(xdata-x0)) / (1 + np.exp(-k*(xdata-x0)))
return y
def residuals(p_guess, xdata, ydata):
# For the residuals calculation, there is no need of setting up the initial parameters
# After fixing the initial guess and sigmoid_function header, remove []
# return ydata - sigmoid_function(xdata, p_guess[0], p_guess[1], [])
return ydata - sigmoid_function(xdata, p_guess[0], p_guess[1], [])
I am sorry if I made mistakes while describing the parameters or confused technical terms. I am very new with numpy and I have not studied maths for years, so I am catching up again.
So, again, what is your advice to calculate the slope of X = 285, Y = 0.5 (more or less the midpoint) for this dataset? Thanks!!
EDIT #2
Thanks to Oliver W., I updated my code as he suggested and understood a bit better the problem.
There is a final detail I do not fully get. Apparently, curve_fit returns a popt array (x0, k) with the optimum parameters for the fitting:
x0 seems to be how shifted is the curve by indicating the central point of the curve
k parameter is the slope when y = 0.5, also in the center of the curve (I think!)
Why if the sigmoid function is a growing one, the derivative/slope in popt is negative? Does it make sense?
I used sigmoid_derivative to calculate the slope and, yes, I obtained the same results that popt but with positive sign.
# Year 2003, 2005, 2007. Slope in midpoint.
k = [-0.1910, -0.2545, -0.2259] # Values coming from popt
slope = [0.1910, 0.2545, 0.2259] # Values coming from sigmoid_derivative function
I know this is being a bit peaky because I could use both. The relevant data is in there but with negative sign, but I was wondering why is this happening.
So, the calculation of the derivative function as you suggested, is only required if I need to know the slope in other points than y = 0.5. Only for midpoint, I can use popt.
Thanks for your help, it saved me a lot of time. :-)

You're never using the parameter p0 you're passing to your sigmoid function. Hence, curve fitting will not have any good measure to find convergence, because it can take any value for this parameter. You should first rewrite your sigmoid function like this:
def sigmoid_function(xdata, x0, k):
y = np.exp(-k*(xdata-x0)) / (1 + np.exp(-k*(xdata-x0)))
return y
This means your model (the sigmoid) has only two degrees of freedom. This will be returned in popt:
initial_guess = [282, 1] # (x0, k): at x0, the sigmoid reaches 50%, k is slope related
popt, pcov = curve_fit(sigmoid_function, xdata, ydata, p0=initial_guess)
Now popt will be a tuple (or array of 2 values), being the best possible x0 and k.
To get the slope of this function at any point, to be honest, I would just calculate the derivative symbolically as the sigmoid is not such a hard function. You will end up with:
def sigmoid_derivative(x, x0, k):
f = np.exp(-k*(x-x0))
return -k / f
If you have the results from your curve fitting stored in popt, you could pass this easily to this function:
print(sigmoid_derivative(285, *popt))
which will return for you the derivative at x=285. But, because you ask specifically for the midpoint, so when x==x0 and y==.5, you'll see (from the sigmoid_derivative) that the derivative there is just -k, which can be observed immediately from the curve_fit output you've already obtained. In the output you've shown, that's about 0.19.

Related

How to prioritise some points over others using curve fit from SciPy

I want to model the following curve:
To perform it, I'm using curve_fit from SciPy, fitting an exponential function.
def exponenial_func(x, a, b, c):
return a * b**(c*x)
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)))
When I first do it, I get this:
Which is minimising the residuals, each point with the same level of importance.
What I want, is to get a curve that gives more importance to the last values of the curve (from x-axis 30, for example) than to the first values, so it fits better in the end of the curve than in the beginning of it.
I know that from here there are many ways to approach this (first of all, define what is the importance that I want to give to each of the residuals). My question here, is to get some idea of how to approach this.
One idea that I had, is to change the sigma value to weight each data point by its inverse value.
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)),
sigma=1/y)
In this case, I get something like I was looking for:
It doesn't look bad, but I'm looking for another way of doing this, so that I can "control" each of the data points, like to weight each of the residuals in a linear way, or exponential, or even choosing it manually (rather than all of them by the inverse, as in the previous case).
Thanks in advance
First of all, note that there's no need for three coefficients. Since
a * b**(c*x) = a * exp(log(b)*c*x).
we can define k = log(b)*c.
Here's a suggestion how you could tackle your problem by hands with scipy.optimize.least_squares and a priority vector:
import numpy as np
from scipy.optimize import least_squares
def exponenial_func2(x, a, k):
return a * np.exp(k*x)
# returns the vector of residuals
def fitwrapper2(coeffs, *args):
xdata, ydata, prio = args
return prio*(exponenial_func2(xdata, *coeffs)-ydata)
# Data
n = 31
xdata = np.arange(n)
ydata = np.array([155.0,229,322,453,655,888,1128,1694,
2036,2502,3089,3858,4636,5883,7375,
9172,10149,12462,12462,17660,21157,
24747,27980,31506,35713,41035,47021,
53578,59138,63927,69176])
# The priority vector
prio = np.ones(n)
prio[-1] = 5
res = least_squares(fitwrapper2, x0=[1.0,2.0], bounds=(0,np.inf), args=(xdata,ydata,prio))
With prio[-1] = 5 we give the last point a high priority.
res.x contains your optimal coefficients. Here a, k = res.x.
Note that for prio = np.ones(n) it's a normal least squares fitting (like curve_fit does) where all points have the same priority.
You can control the priority of each point by increasing its value in the prio array. Comparing both results gives me:

Better fitting for a power-law curve

So I had some points in a dataframe that led me to believe I was dealing with a power law curve. After some googling, I used what I found in this post to go about curve fitting.
def func_powerlaw(x, m, c, c0):
return c0 + x**m * c
target_func = func_powerlaw
X = np.array(selection_to_feed.selection[1:])
y = np.array(selection_to_feed.avg_feed_size[1:])
popt, pcov = curve_fit(func_powerlaw, X, y, p0 =np.asarray([-1,10**5,0]))
curvex = np.linspace(0,5000,1000)
curvey = target_func(curvex, *popt)
plt.figure(figsize=(10, 5))
plt.plot(curvex, curvey, '--')
plt.plot(X, y, 'ro')
plt.legend()
plt.show()
This is the result:
Curve
The problem is, the curve fit results in negative values for the first few values (as you can see in the blue line), and in the actual relationship, no negative Y values can exist.
A few questions:
What can I do make sure no negative Y values can be output? Really, an X of 0 should have a Y value of 0 as well.
Is power law curve fitting even the right thing to do? How would you describe this curve?
Thank you!
If you are only looking for a simple approximating equation with a better fit, I extracted data from your plot and added the known data point [0,0] per your note. Since the uncertainty for the [0,0] point is zero - that is, you are 100% certain of that value - I used a weighted regression where that one known point was given an extremely high weight and the weight for all other points was 1. This had the effect of forcing the curve through the [0,0] point, which can be done with any software that allows weighted fitting. I found that a Standard Geometric plus offset equation, "y = a * pow(x, (b * x)) + offset", with parameters:
a = -1.0704001788540748E+02
b = -1.5095055897637395E-03
Offset = 1.0704001788540748E+02
fits as shown in the attached plot and passes through [0,0]. My suggestion is to perform a regression using this equation with the actual data plus the known [0,0] point, using these values as the initial parameter estimates - and if possible using a very large weight for the [0,0] point as I did.

Gaussian fit returning negative sigma

One of my algorithms performs automatic peak detection based on a Gaussian function, and then later determines the the edges based either on a multiplier (user setting) of the sigma or the 'full width at half maximum'. In the scenario where a user specified that he/she wants the peak limited at 2 Sigma, the algorithm takes -/+ 2*sigma from the peak center (mu). However, I noticed that the sigma returned by curve_fit can be negative, which is something that has been noticed before as can be seen here. However, as I determine the border by doing -/+ this can lead to the algorithm 'failing' (due to a - - scenario) as can be seen in the following code.
MVCE
#! /usr/bin/env python
from scipy.optimize import curve_fit
import bisect
import numpy as np
X = [16.4697402328,16.4701402404,16.4705402481,16.4709402557,16.4713402633,16.4717402709,16.4721402785,16.4725402862,16.4729402938,16.4733403014,16.473740309,16.4741403166,16.4745403243,16.4749403319,16.4753403395,16.4757403471,16.4761403547,16.4765403623,16.47694037,16.4773403776,16.4777403852,16.4781403928,16.4785404004,16.4789404081,16.4793404157,16.4797404233,16.4801404309,16.4805404385,16.4809404462,16.4813404538,16.4817404614,16.482140469,16.4825404766,16.4829404843,16.4833404919,16.4837404995,16.4841405071,16.4845405147,16.4849405224,16.48534053,16.4857405376,16.4861405452,16.4865405528,16.4869405604,16.4873405681,16.4877405757,16.4881405833,16.4885405909,16.4889405985,16.4893406062,16.4897406138,16.4901406214,16.490540629,16.4909406366,16.4913406443,16.4917406519,16.4921406595,16.4925406671,16.4929406747,16.4933406824,16.49374069,16.4941406976,16.4945407052,16.4949407128,16.4953407205,16.4957407281,16.4961407357,16.4965407433,16.4969407509,16.4973407585,16.4977407662,16.4981407738,16.4985407814,16.498940789,16.4993407966,16.4997408043,16.5001408119,16.5005408195,16.5009408271,16.5013408347,16.5017408424,16.50214085,16.5025408576,16.5029408652,16.5033408728,16.5037408805,16.5041408881,16.5045408957,16.5049409033,16.5053409109,16.5057409186,16.5061409262,16.5065409338,16.5069409414,16.507340949,16.5077409566,16.5081409643,16.5085409719,16.5089409795,16.5093409871,16.5097409947,16.5101410024,16.51054101,16.5109410176,16.5113410252,16.5117410328,16.5121410405,16.5125410481,16.5129410557,16.5133410633,16.5137410709,16.5141410786,16.5145410862,16.5149410938,16.5153411014,16.515741109,16.5161411166,16.5165411243,16.5169411319,16.5173411395,16.5177411471,16.5181411547,16.5185411624,16.51894117,16.5193411776,16.5197411852,16.5201411928,16.5205412005,16.5209412081,16.5213412157,16.5217412233,16.5221412309,16.5225412386,16.5229412462,16.5233412538,16.5237412614,16.524141269,16.5245412767,16.5249412843,16.5253412919,16.5257412995,16.5261413071,16.5265413147,16.5269413224,16.52734133,16.5277413376,16.5281413452,16.5285413528,16.5289413605,16.5293413681,16.5297413757,16.5301413833,16.5305413909,16.5309413986,16.5313414062,16.5317414138,16.5321414214,16.532541429,16.5329414367,16.5333414443,16.5337414519,16.5341414595,16.5345414671,16.5349414748,16.5353414824,16.53574149,16.5361414976,16.5365415052,16.5369415128,16.5373415205,16.5377415281,16.5381415357,16.5385415433,16.5389415509,16.5393415586,16.5397415662,16.5401415738,16.5405415814,16.540941589,16.5413415967,16.5417416043,16.5421416119,16.5425416195,16.5429416271,16.5433416348,16.5437416424,16.54414165,16.5445416576,16.5449416652,16.5453416729,16.5457416805,16.5461416881,16.5465416957,16.5469417033,16.5473417109,16.5477417186,16.5481417262,16.5485417338,16.5489417414,16.549341749,16.5497417567,16.5501417643,16.5505417719,16.5509417795,16.5513417871,16.5517417948,16.5521418024,16.55254181,16.5529418176,16.5533418252,16.5537418329,16.5541418405,16.5545418481,16.5549418557,16.5553418633,16.5557418709,16.5561418786,16.5565418862,16.5569418938,16.5573419014,16.557741909,16.5581419167,16.5585419243,16.5589419319,16.5593419395,16.5597419471,16.5601419548,16.5605419624,16.56094197,16.5613419776,16.5617419852,16.5621419929,16.5625420005,16.5629420081,16.5633420157,16.5637420233,16.564142031]
Y = [11579127.8554,11671781.7263,11764419.0191,11857026.0444,11949589.1124,12042094.5338,12134528.6188,12226877.6781,12319128.0219,12411265.9609,12503277.8053,12595149.8657,12686868.4525,12778419.8762,12869790.334,12960965.209,13051929.5278,13142668.3154,13233166.5969,13323409.3973,13413381.7417,13503068.6552,13592455.1627,13681526.2894,13770267.0602,13858662.5004,13946697.6348,14034357.4886,14121627.0868,14208491.4544,14294935.6166,14380944.5984,14466503.4248,14551597.1208,14636210.7116,14720329.3102,14803938.4081,14887023.5981,14969570.4732,15051564.6263,15132991.6503,15213837.1383,15294086.683,15373725.8775,15452740.3147,15531115.5875,15608837.2888,15685891.0116,15762262.3488,15837936.8934,15912900.2382,15987137.9762,16060635.7004,16133379.0036,16205353.4789,16276544.72,16346938.7731,16416522.8674,16485284.4226,16553210.8587,16620289.5956,16686508.0531,16751853.6511,16816313.8096,16879875.9485,16942527.4876,17004255.8468,17065048.446,17124892.7052,17183776.0442,17241685.8829,17298609.6412,17354534.739,17409448.5962,17463338.6327,17516192.2683,17567996.9463,17618741.7702,17668418.588,17717019.5043,17764536.6238,17810962.0514,17856287.8916,17900506.2493,17943609.2292,17985588.936,18026437.4744,18066146.9493,18104709.4653,18142117.1271,18178362.0396,18213436.3074,18247332.0352,18280041.3279,18311556.2901,18341869.0265,18370971.642,18398856.332,18425517.6188,18450952.493,18475158.064,18498131.4412,18519869.7341,18540370.0523,18559629.505,18577645.202,18594414.2525,18609933.7661,18624200.8523,18637212.6205,18648966.1802,18659458.6408,18668687.1119,18676648.7029,18683340.5233,18688759.6825,18692903.29,18695768.4553,18697352.5327,18697655.9558,18696681.2608,18694431.0245,18690907.8241,18686114.2363,18680052.838,18672726.2063,18664136.918,18654287.5501,18643180.6795,18630818.883,18617204.7377,18602340.8204,18586229.7081,18568873.9777,18550276.2061,18530438.9703,18509364.8471,18487056.4135,18463516.2464,18438747.4526,18412756.9228,18385553.1936,18357144.808,18327540.3094,18296748.2409,18264777.1456,18231635.5669,18197332.0479,18161875.1318,18125273.3619,18087535.2812,18048669.4331,18008684.3606,17967588.6071,17925390.7158,17882099.2297,17837722.6922,17792269.6464,17745748.6355,17698168.2027,17649537.512,17599868.3744,17549173.3069,17497464.8262,17444755.4492,17391057.6927,17336384.0736,17280747.1087,17224159.3148,17166633.2088,17108181.3075,17048816.1277,16988550.1864,16927396.0002,16865366.0862,16802472.961,16738729.1416,16674147.1447,16608739.4873,16542518.6861,16475497.2591,16407688.2541,16339106.0951,16269765.4262,16199680.8916,16128867.1358,16057338.8029,15985110.5372,15912196.9829,15838612.7844,15764372.5859,15689491.0316,15613982.7659,15537862.4329,15461144.6771,15383844.1425,15305975.4735,15227553.3143,15148592.3093,15069107.1026,14989112.3386,14908622.6595,14827652.5673,14746216.3337,14664328.209,14582002.4435,14499253.2874,14416094.9911,14332541.8049,14248607.9791,14164307.764,14079655.4098,13994665.1668,13909351.2855,13823728.016,13737809.6086,13651610.3137,13565144.3816,13478426.0625,13391469.6068,13304289.2646,13216899.2865,13129313.8865,13041546.3657,12953609.0623,12865514.2686,12777274.277,12688901.3798,12600407.8693,12511806.0378,12423108.1777,12334326.5812,12245473.5407,12156561.3486,12067602.297,11978608.6785,11889592.7852]
def gaussFunction(x, *p):
"""Define and return a Gaussian function.
This function returns the value of a Gaussian function, using the
A, mu and sigma value that is provided as *p.
Keyword arguments:
x -- number
p -- A, mu and sigma numbers
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
p0 = [np.max(Y), X[np.argmax(Y)],0.1]
coeff, var_matrix = curve_fit(gaussFunction, X, Y, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
print "Sigma is "+str(coeff[2])
# Original
low = bisect.bisect_left(newGaussX,coeff[1]-2*coeff[2])
high = bisect.bisect_right(newGaussX,coeff[1]+2*coeff[2])
print newGaussX[low], newGaussX[high]
# Absolute
low = bisect.bisect_left(newGaussX,coeff[1]-2*abs(coeff[2]))
high = bisect.bisect_right(newGaussX,coeff[1]+2*abs(coeff[2]))
print newGaussX[low], newGaussX[high]
Bottom-line, is taking the abs() of the sigma 'correct' or should this problem be solved in a different way?
You are fitting a function gaussFunction that does not care whether sigma is positive or negative. So whether you get a positive or negative result is mostly a matter of luck, and taking the absolute value of the returned sigma is fine. Also consider other possibilities:
(Suggested by Thomas Kühn): modify the model function so that it cares about the sign of sigma. Bringing it closer to the normalized Gaussian form would be enough: the formula A/np.sqrt(sigma)*np.exp(-(x-mu)**2/(2.*sigma**2)) would ensure that you get positive sigma only. A possible, mild downside is that the function takes a bit longer to compute.
Use the variance, sigma_squared, as a parameter:
A, mu, sigma_squared = p
return A*np.exp(-(x-mu)**2/(2.*sigma_squared))
This is probably easiest in terms of keeping the model equation simple. You will need to square your initial guess for that parameter, and take square root when you need sigma itself.
Aside: you hardcoded 0.1 as a guess for standard deviation. This probably should be based on data, like this:
peak = X[Y > np.exp(-0.5)*Y.max()]
guess_sigma = 0.5*(peak.max() - peak.min())
The idea is that within one standard deviation of the mean, the values of the Gaussian are greater than np.exp(-0.5) times the maximum value. So the first line locates this "peak" and the second takes half of its width as the guess for sigma.
For the above to work, X and Y should be already converted to NumPy arrays, e.g., X = np.array([16.4697402328,16.4701402404,..... This is a good idea in general: otherwise, you are making each NumPy method that receives X or Y make this conversion again.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this. It includes a Gaussian Model for curve-fitting that does normalize the Gaussian and also restricts sigma to be positive using a parameter transformation that is more gentle than abs(sigma). Your example would look like this
from lmfit.models import GaussianModel
xdat = np.array(X)
ydat = np.array(Y)
model = GaussianModel()
params = model.guess(ydat, x=xdat)
result = model.fit(ydat, params, x=xdat)
print(result.fit_report())
which will print a report with best-fit values and estimated uncertainties for all the parameters, and include FWHM.
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# function evals = 31
# data points = 237
# variables = 3
chi-square = 95927408861.607
reduced chi-square = 409946191.716
Akaike info crit = 4703.055
Bayesian info crit = 4713.459
[[Variables]]
sigma: 0.04880178 +/- 1.57e-05 (0.03%) (init= 0.0314006)
center: 16.5174203 +/- 8.01e-06 (0.00%) (init= 16.51754)
amplitude: 2.2859e+06 +/- 586.4103 (0.03%) (init= 670578.1)
fwhm: 0.11491942 +/- 3.51e-05 (0.03%) == '2.3548200*sigma'
height: 1.8687e+07 +/- 910.0152 (0.00%) == '0.3989423*amplitude/max(1.e-15, sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(sigma, amplitude) = 0.949
The values for center +/- 2*sigma would be found with
xlo = result.params['center'].value - 2 * result.params['sigma'].value
xhi = result.params['center'].value + 2 * result.params['sigma'].value
You can use the result to evaluate the model with fitted parameters and different X values:
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
newGaussY = result.eval(x=newGaussX)
I would also recommend using numpy.where to find the location of center+/-2*sigma instead of bisect:
low = np.where(newGaussX > xlo)[0][0] # replace bisect_left
high = np.where(newGaussX <= xhi)[0][-1] + 1 # replace bisect_right
I got the same problem and I came up with a trivial but effective solution, which is basically to use the variance in the gaussian function definition instead of the standard deviation, since the variance is always positive. Then, you get the std_dev by square rooting the variance, obtaining a positive value i.e., the std_dev will always be positive. So, problem solved easily ;)
I mean, create the function this way:
def gaussian(x, Heigh, Mean, Variance):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Variance))
Instead of:
def gaussian(x, Heigh, Mean, Std_dev):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Std_dev**2))
And then do the fit as usual.

Fitting a vector function with curve_fit in Scipy

I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.
If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!
I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.

Fitting a step function

I am trying to fit a step function using scipy.optimize.leastsq. Consider the following example:
import numpy as np
from scipy.optimize import leastsq
def fitfunc(p, x):
y = np.zeros(x.shape)
y[x < p[0]] = p[1]
y[p[0] < x] = p[2]
return y
errfunc = lambda p, x, y: fitfunc(p, x) - y # Distance to the target function
x = np.arange(1000)
y = np.random.random(1000)
y[x < 250.] -= 10
p0 = [500.,0.,0.]
p1, success = leastsq(errfunc, p0, args=(x, y))
print p1
the parameters are the location of the step and the level on either side. What's strange is that the first free parameter never varies, if you run that scipy will give
[ 5.00000000e+02 -4.49410173e+00 4.88624449e-01]
when the first parameter would be optimal when set to 250 and the second to -10.
Does anyone have any insight as to why this might not be working and how to get it to work?
If I run
print np.sum(errfunc(p1, x, y)**2.)
print np.sum(errfunc([250.,-10.,0.], x, y)**2.)
I find:
12547.1054663
320.679545235
where the first number is what leastsq is finding, and the second is the value for the actual optimal function it should be finding.
It turns out that the fitting is much better if I add the epsfcn= argument to leastsq:
p1, success = leastsq(errfunc, p0, args=(x, y), epsfcn=10.)
and the result is
[ 248.00000146 -8.8273455 0.40818216]
My basic understanding is that the first free parameter has to be moved more than the spacing between neighboring points to affect the square of the residuals, and epsfcn has something to do with how big steps to use to find the gradient, or something similar.
I don't think that least squares fitting is the way to go about coming up with an approximation for a step. I don't believe it will give you a satisfactory description of the discontinuity. Least squares would not be my first thought when attacking this problem.
Why wouldn't you use a Fourier series approximation instead? You'll always be stuck with Gibbs' phenomenon at the discontinuity, but the rest of the function can be approximated as well as you and your CPU can afford.
What exactly are you going to use this for? Some context might help.
I propose approximating the step function. Instead of
inifinite slope at the "change point" make it linear over
one x distance (1.0 in the example). E.g. if the x
parameter, xp, for the function is defined as the midpoint
on this line then the value at xp-0.5 is the lower y value
and the value at xp+0.5 is the higher y value and
intermediate values of the function in the
interval [xp-0.5; xp+0.5] is a linear
interpolation between these two points.
If it can be assumed that the step function (or its
approximation) goes from a lower value to a higher value
then I think the initial guess for the last two parameters
should be the lowest y value and the highest y value
respectively instead of 0.0 and 0.0.
I have 2 corrections:
1) np.random.random() returns random numbers in the range
0.0 to 1.0. Thus the mean is +0.5 and is also the value of
the third parameter (instead 0.0). And the second paramter
is then -9.5 (+0.5 - 10.0) instead of -10.0.
Thus
print np.sum(errfunc([250.,-10.,0.], x, y)**2.)
should be
print np.sum(errfunc([250.,-9.5,0.5], x, y)**2.)
2) In the original fitfunc() one value of y becomes 0.0 if x
is exactly equal to p[0]. Thus it is not a step function in
that case (more like a sum of two step functions). E.g. this
happens when the start value of the first parameter is 500.
Most probably your optimization is stuck in a local minima. I don't know what leastsq really works like, but if you give it an initial estimate of (0, 0, 0), it gets stuck there, too.
You can check the gradient at the initial estimate numerically (evaluate at +/- epsilon for a very small epsilon and divide bei 2*epsilon, take difference) and I bet it will be sth around 0.
use statsmodel ols. ols uses ordinary least square for curve fitting

Categories

Resources