Let's say I have S-curved shaped data like below :
S-Curved data
I would like too find the simplest way to fit this kind of curves AND use this fit to find the midpoint (aka the point where y=0.5). The fact is that I don't know beforehand where the midpoint.
Thanks a lot for your answers,
Cheers
This is clearly a case of fitting a logistic curve with L=1:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
data = np.loadtxt(r"\data.txt", delimiter=",")
x = data[:, 0]
y = data[:, 1]
def f(x: np.ndarray, k: float, x0: float):
return 1 / (1 + np.exp(-k*(x - x0)))
popt, pcov = curve_fit(f, x, y, p0 = [1, 120])
fig, ax = plt.subplots(figsize=(8, 5.6))
plt.scatter(x, y)
plt.plot(x, f(x, *popt), color="red")
plt.show()
x0 is given by popt[1], i.e. 121.18.
Related
I am trying to do a linear regression on some limited and scattered data. I know from theory that the gradient should be 1, but it may have a y-offset. I found a lot of resources on how to force an intercept for linear regression, but never on forcing a gradient. I need the linear regression statistics to be reported and the gradient to be precisely 1.
Would I need to manually calculate the statistics? Or is there a way to use some packages like "statsmodels," "scipy," or "scikit-learn"? Or do I need to use a Bayesian approach with previous knowledge of the gradient?
Here is a graphical example of what I am trying to achieve.
import numpy as np
import matplotlib.pyplot as plt
# Generate random data to illustrate the point
n = 20
x = np.random.uniform(10, 20, n)
y = x - np.random.normal(1, 1, n) # Add noise to the 1:1 relationship
plt.scatter(x, y, ec="k", label="Measured data")
true_x = np.array((8, 20))
plt.plot(true_x, true_x, "k--") # 1:1 line
plt.plot(true_x, true_x-1, "r:", label="Forced gradient") # Theoretical line
m, c = np.polyfit(x, y, 1)
plt.plot(true_x, true_x\*m + c, "g:", label="Linear regression")
plt.xlabel("Theoretical value")
plt.ylabel("Measured value")
plt.legend()
I suggest using scipy.optimize.curve_fit that has the benefit of being flexible and easy to use also for non-linear regressions. You just need to define a function that represents a line with a known gradient and an offset given as input:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a):
gradient = 1 # fixed gradient, not optimized
return gradient * x + a
xdata = np.linspace(0, 4, 50)
y = func(xdata, 2.5)
rng = np.random.default_rng()
y_noise = 0.2 * rng.normal(size=xdata.size)
ydata = y + y_noise
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
popt
plt.plot(xdata, func(xdata, *popt), 'r-',
label='fit: a=%5.3f' % tuple(popt))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
That generates the plot:
I used numpy's polyfit to fit some noisy data and then wanted to use polyval to evaluate the fit at some new points. For some reason, fitting works fine but polyval only gives correct results when I reverse the order of the coefficients of the polynomial:
import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
# a noisy line
x = np.linspace(0, 10, 100)
y = x + np.random.normal(0, 1, x.shape)
# calculate fit polynomial
fit_coeffs_poly = poly.polyfit(x, y, deg=1)
fit_polynomial_poly = poly.Polynomial(fit_coeffs_poly)(x)
# plot to check fit
plt.plot(x, y, label='noisy')
plt.plot(x, fit_polynomial_poly, '-r', label='polyfit')
plt.legend(loc='lower right')
plt.show()
The fit looks good:
polyval only works when the coefficients are reversed:
>>> for i in range(0, 10):
>>> print(np.polyval(fit_coeffs_poly, i))
0.9792056688016727
1.139755470535941
1.3003052722702093
1.4608550740044774
1.6214048757387456
1.781954677473014
1.9425044792072823
2.1030542809415502
2.2636040826758186
2.424153884410087
>>> for i in range(0, 10):
>>> print(np.polyval(fit_coeffs_poly[::-1], i))
0.16054980173426825
1.139755470535941
2.1189611393376135
3.0981668081392866
4.077372476940959
5.056578145742631
6.035783814544304
7.0149894833459765
7.9941951521476495
8.973400820949323
I can't help but feel like this is wrong somehow, it doesn't make sense for them to be backwards.
I dug around a lot and found out what was going on. It turns out numpy has two sets of polynomial tools, one in the base numpy library, and another in numpy.polynomial and they expect things in opposite order. Both polyfit and polyval are found in both libraries and appear to operate the same on this simple case, but their parameters are different:
From numpy:
def polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
def polyval(p, x)
Both polyfit and polyval expect and return polynomial coefficients ordered from high to low degree.
From numpy.polynomial:
def polyfit(x, y, deg, rcond=None, full=False, w=None)
def polyval(x, c, tensor=True)
Both polyfit and polyval expect and return polynomial coefficients ordered from low to high degree. Also note that polyval expects parameters in a different order.
Here's a quick demo:
import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
# a noisy line
x = np.linspace(0, 10, 100)
y = x + np.random.normal(0, 1, x.shape)
# calculate fit polynomial with base numpy
fit_coeffs_np = np.polyfit(x, y, deg=1)
fit_polynomial_np = poly.Polynomial(fit_coeffs_np[::-1])(x)
print("numpy", fit_coeffs_np)
# calculate fit polynomial with numpy.polynomial.polynomial
fit_coeffs_poly = poly.polyfit(x, y, deg=1)
fit_polynomial_poly = poly.Polynomial(fit_coeffs_poly)(x)
print("poly", fit_coeffs_poly)
# test some values
for i in range(0, 10):
print(np.polyval(fit_coeffs_np, i), poly.polyval(i, fit_coeffs_poly))
# make a nice plot
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, label='noisy')
plt.plot(x, fit_polynomial_poly, '-g', label='np.poly')
plt.plot(x, fit_polynomial_np, '-r', label='np')
plt.legend(loc='lower right')
plt.text(0, 10, 'np: {}'.format(fit_coeffs_np))
plt.text(0, 9, 'np.poly: {}'.format(fit_coeffs_poly))
plt.savefig('good.png')
plt.show()
-0.009244843991578149 -0.009244843991576678
1.0080792020754397 1.0080792020754408
2.0254032481424575 2.025403248142458
3.042727294209475 3.0427272942094756
4.060051340276494 4.060051340276493
5.077375386343512 5.07737538634351
6.094699432410529 6.094699432410528
7.112023478477547 7.112023478477545
8.129347524544565 8.129347524544563
9.146671570611582 9.14667157061158
The two appear to be nominally equivalent, but you can't mix and match without being aware of the slight differences.
If anyone is aware of why there are two different implementations, I would love to know.
I need to fit with scipy.optimize.curve_fit some data that look like the points in the figure. I use a function y(x) (see def below) which gives a constant y(x)=c for x<x0, otherwise a polynomial (eg a second tilted line y1 = mx+q).
I give a reasonable initial guess for the parameters (x0, c, m, q), as show in the figure. The result from the fit shows that all the parameters are optimized except for the first one x0.
Why so?
Is it how I define the function testfit(x, *p), where x0 (=p[0]) appears within another function?
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# generate some data:
x = np.linspace(0,100,1000)
y1 = np.repeat(0, 500)
y2 = x[500:] - 50
y = np.concatenate((y1,y2))
y = y + np.random.randn(len(y))
def testfit(x, *p):
''' piecewise function used to fit
it's a constant (=p[1]) for x < p[0]
or a polynomial for x > p[0]
'''
x = x.astype(float)
y = np.piecewise(x, [x < p[0], x >= p[0]], [p[1], lambda x: np.poly1d(p[2:])(x)])
return y
# initial guess, one horizontal and one tilted line:
p0_guess = (30, 5, 0.3, -10)
popt, pcov = curve_fit(testfit, x, y, p0=p0_guess)
print('params guessed : '+str(p0_guess))
print('params from fit : '+str(popt))
plt.plot(x,y, '.')
plt.plot(x, testfit(x, *p0_guess), label='initial guess')
plt.plot(x, testfit(x, *popt), label='final fit')
plt.legend()
Output
params guessed : (30, 5, 0.3, -10)
params from fit : [ 30. 0.04970411 0.80106256 -34.17194401]
OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
As suggested by kazemakase, I solved the problem with a smooth transition between the two functions I use to fit (one horizontal line followed by a polynomial). The trick was to multiply one function by sigmoid(x) and the other by 1-sigmoid(x), (where sigmoid(x) is defined below).
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,100,1000)
y1 = np.repeat(0, 500)
y2 = x[500:] - 50
y = np.concatenate((y1,y2))
y = y + np.random.randn(len(y))
def testfit(x, *p):
''' function to fit the indentation curve
p = [x0,c, poly1d_coeffs ]'''
x = x.astype(float)
y = p[1]*(1-sigmoid(x-p[0],k=1)) + np.poly1d(p[2:])(x) * sigmoid(x-p[0],k=1)
return y
def sigmoid(x, k=1):
return 1/(1+np.exp(-k*x))
p0_guess = (30, 5, 0.3, -10 )
popt, pcov = curve_fit(testfit, x, y, p0=p0_guess)
print('params guessed : '+str(p0_guess))
print('params from fit : '+str(popt))
plt.figure(1)
plt.clf()
plt.plot(x,y, 'y.')
plt.plot(x, testfit(x, *p0_guess), label='initial guess')
plt.plot(x, testfit(x, *popt), 'k', label='final fit')
plt.legend()
I had a similar problem. I ended up using np.gradient and a convolution to smooth the curve, then plotting it. Something like:
def mov_avg(n, data):
return np.convolve(data, np.ones((n,))/n, mode='valid')
If you want a more direct approach, you can try this:
def find_change(data):
def test_flag(pos):
grad = np.gradient(data) - np.gradient(data).mean()
return (grad[:pos]<0).sum() + (grad[pos:]>0).sum()
return np.vectorize(test_flag)(np.arange(len(data)-1)).argmax()
def find_gradient(pos, data):
return np.gradient(data[:pos]).mean(), np.gradient(data[pos:]).mean()
pos=find_change(x2)
print(pos, find_gradient(pos, data))
The first function calculates the point at which the gradient change by comparing the point gradient against the mean gradient, and finds the point from which the gradients are "mostly positive".
Hope it helps
Given two arrays x and y,I was trying to use np.polyfit function to fit the data,using the following way:
z = np.polyfit(x, y, 20)
f = np.poly1d(z)
but since i want to plot a line chart instead of a smooth curve, so then i use this function f to sample an array for plotting line.
x_new = np.linspace(x[0], x[-1], fitting_size)
y_new = np.zeros(fitting_size)
for t in range(fitting_size):
y_new[t] = f(x_new[t])
plt.plot(x_new, y_new, marker='v', ms=1)
The problem is that the above segment code stills gives me a smooth curve. How can i fix it? Thanks.
Unfortunately the intention behind the question is a bit unclear. However, if you want to perform a linear fit, you need to provide the degree deg=1 to polyfit. There is then no reason to sample from the fit; one can simply use the same input array and apply the fitting function to it.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1,5,20)
y = 3*x**2+np.random.rand(len(x))*10
z = np.polyfit(x, y, 1)
f = np.poly1d(z)
z2 = np.polyfit(x, y, 2)
f2 = np.poly1d(z2)
plt.plot(x,y, marker=".", ls="", c="k", label="data")
plt.plot(x, f(x), label="linear fit")
plt.plot(x, f2(x), label="quadratic fit")
plt.legend()
plt.show()
I got a question that I fight around for days with now.
How do I calculate the (95%) confidence band of a fit?
Fitting curves to data is the every day job of every physicist -- so I think this should be implemented somewhere -- but I can't find an implementation for this neither do I know how to do this mathematically.
The only thing I found is seaborn that does a nice job for linear least-square.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
data = {'x':x, 'y':y}
frame = pd.DataFrame(data, columns=['x', 'y'])
sns.lmplot('x', 'y', frame, ci=95)
plt.savefig("confidence_band.pdf")
But this is just linear least-square. When I want to fit e.g. a saturation curve like , I'm screwed.
Sure, I can calculate the t-distribution from the std-error of a least-square method like scipy.optimize.curve_fit but that is not what I'm searching for.
Thanks for any help!!
You can achieve this easily using StatsModels module.
Also see this example and this answer.
Here is an answer for your question:
import numpy as np
from matplotlib import pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
X = sm.add_constant(x)
res = sm.OLS(y, X).fit()
st, data, ss2 = summary_table(res, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="data")
ax.plot(X, fittedvalues, 'r-', label='OLS')
ax.plot(X, predict_ci_low, 'b--')
ax.plot(X, predict_ci_upp, 'b--')
ax.plot(X, predict_mean_ci_low, 'g--')
ax.plot(X, predict_mean_ci_upp, 'g--')
ax.legend(loc='best');
plt.show()
kmpfit's confidence_band() calculates the confidence band for non-linear least squares. Here for your saturation curve:
from pylab import *
from kapteyn import kmpfit
def model(p, x):
a, b = p
return a*(1-np.exp(b*x))
x = np.linspace(0, 10, 100)
y = .1*np.random.randn(x.size) + model([1, -.4], x)
fit = kmpfit.simplefit(model, [.1, -.1], x, y)
a, b = fit.params
dfdp = [1-np.exp(b*x), -a*x*np.exp(b*x)]
yhat, upper, lower = fit.confidence_band(x, dfdp, 0.95, model)
scatter(x, y, marker='.', color='#0000ba')
for i, l in enumerate((upper, lower, yhat)):
plot(x, l, c='g' if i == 2 else 'r', lw=2)
savefig('kmpfit confidence bands.png', bbox_inches='tight')
The dfdp are the partial derivatives ∂f/∂p of the model f = a*(1-e^(b*x)) with respect to each parameter p (i.e., a and b), see my answer to a similar question for background links. And here the output: