I have N datapoints in 3d that lie on a line. The y-direction is fixed, so I want to fit x,z against y.
Lets say we have 6 datapoints, that align with the y axis:
x=[0,0,0,0,0,0]
y=[1,2,3,4,5,6]
z=[0,0,0,0,0,0]
what I want to do:
I want to get the best set of fitting parameters, the gof and fitting error.
So far with a least squarefit, I get a reduced chi2 of < 1, which means I might be overfitting (or misunderstanding something).
Questions:
1.) For example, for the above example I receive a reduced chi2 of 0- this seems false to me?
2.) Also, I am wondering if a least square fit is adequate for this as well- maybe someone can shed some insight on this? Would svd be a better choice for this?
import scipy.optimize
import numpy as np
#define a model (line)
def linear(params, y):
a, b = params
data = [a * y[i] + b for i in range(0, len(y))]
return data
#define the residuals that need to me minimized
def fitting_cost(params, x, y, z):
a_x, b_x, a_z, b_z = params
x_pred = linear((a_x, b_x), y)
z_pred = linear((a_z, b_z), y)
res_x = [x_pred[i] - x[i] for i in range(0, 6)]
res_z = [z_pred[i] - z[i] for i in range(0, 6)]
return res_x + res_z
#do the fit and return parameters plus gof
def least_squares_fit(x, y, z):
sp = [0,0,0,0]
result = scipy.optimize.leastsq(fitting_cost, sp,
args=(x, y, z),
full_output=True)
s_sq = (result[2]['fvec'] ** 2).sum() / (
len(result[2]['fvec']) - len(result[0]))
return result[0], s_sq
I am trying to approximate the volume of a 2 variable definite integral sin^2(x)-cos^2(y) using while and for loops. I've changed the code quite often and with the most recent change, it broke. I am very new to python so I'm still figuring out how to work with arrays properly.
This is what I have untill now (EDIT: With alani's comment I managed to fix the error, but now I'm not receiving an answer when running the code)
import numpy as np
import scipy.integrate
def f(x,y):
return np.sin(x)**2-np.cos(y)**2
print(scipy.integrate.dblquad(f,0,1,0,2))
def Riemann(x0,xn,y0,yn,N):
e = 1;
while e > 1e-3:
x = np.linspace(0,1,N)
y = np.linspace(0,2,N)
dx = (x0-xn)/N
dy = (y0-yn)/N
for i in range(N):
V = (dx*dy)*(f(x,y))
np.sum(V)
e = abs(1-V)
print(Riemann(0,1,0,2,1000))
When running this code I receive:
(-0.2654480895858587, 9.090239973208559e-15)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-c654507b2f73> in <module>
19 np.sum(V)
20 e = abs(1-V)
---> 21 print(Riemann(0,1,0,2,10))
22
23
<ipython-input-9-c654507b2f73> in Riemann(x0, xn, y0, yn, N)
10 def Riemann(x0,xn,y0,yn,N):
11 e = 1;
---> 12 while e > 1e-3:
13 x = np.linspace(0,1,N)
14 y = np.linspace(0,2,N)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Your code has multiple issues, I'll address them so that you can improve. First off all, the formatting is pretty terrible. Put spaces after commas, separate things more with white-space, etc. You can see the difference in my code, and I'm by no means an expert at code formatting.
Second, your method is not doing what you think it's doing. Every time you iterate i, you create an entire array of values and you assign it to V, since x and y are both arrays. Neither x nor y are being updated here. The loop does the same thing every time, and V get re-assigned the same value every time. np.sum(V) never gets assigned anywhere, so the only thing getting updated at all in that loop is e. Of course, that bit is incorrect since you cannot subtract a vector from a scalar, since, as I wrote above, V is a vector.
Your function didn't use x0, y0, etc. for your bounds of integration, since your linspaces were hardcoded.
Now we come to the solution. There are two approaches to this problem. There's the "slow" pure Python way, where we just loop over our y's and x's and take function values multiplied by the base dx * dy. That version looks like this:
# This a naive version, using two for loops. It's very slow.
def Riemann(x0, xn, y0, yn, N):
xs = np.linspace(x0, xn, N)
ys = np.linspace(y0, yn, N)
dx = (x0 - xn) / N
dy = (y0 - yn) / N
V = 0
for y in ys:
for x in xs:
V += f(x, y)
return dx * dy * V
Note I moved the multiplication outside to save some on performance.
The other way is to use numpy, that version looks like this:
def Riemann(x0, xn, y0, yn, N):
points = itertools.product(np.linspace(x0, xn, N), np.linspace(y0, yn, N))
points = np.array(list(points))
xs = points[:, 0]
ys = points[:, 1]
dx = (x0 - xn) / N
dy = (y0 - yn) / N
return dx * dy * np.sum(f(xs, ys))
Here we avoid the double for-loop. Note that you must include import itertools for this to work. Here we use the Cartesian product to create all points we wish to evaluate, and then give those points to your function f which is designed to work with numpy arrays. We get a vector back from f of all the function values at each point, and we simply just sum all the elements, just like we did in the for-loop. Then we can multiply by the common base dx * dy and return that.
The only thing I do not understand about your code is what you want e to do, and how it relates to N. I'm guessing it's some sort of error tolerance, but why you were trying to subtract the total volume so far (even if your code did nothing of the sort) from 1 I can't understand.
I am trying to use scipy to numerically solve the following differential equation
x''+x=\sum_{k=1}^{20}\delta(t-k\pi), y(0)=y'(0)=0.
Here is the code
from scipy.integrate import odeint
import numpy as np
import matplotlib.pyplot as plt
from sympy import DiracDelta
def f(t):
sum = 0
for i in range(20):
sum = sum + 1.0*DiracDelta(t-(i+1)*np.pi)
return sum
def ode(X, t):
x = X[0]
y = X[1]
dxdt = y
dydt = -x + f(t)
return [dxdt, dydt]
X0 = [0, 0]
t = np.linspace(0, 80, 500)
sol = odeint(ode, X0, t)
x = sol[:, 0]
y = sol[:, 1]
plt.plot(t,x, t, y)
plt.xlabel('t')
plt.legend(('x', 'y'))
# phase portrait
plt.figure()
plt.plot(x,y)
plt.plot(x[0], y[0], 'ro')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
However what I got from python is zero solution, which is different from what I got from Mathematica. Here are the mathematica code and the graph
so=NDSolve[{x''(t)+x(t)=\sum _{i=1}^{20} DiraDelta (t-i \pi ),x(0)=0,x'(0)=0},x(t),{t,0,80}]
It seems to me that scipy ignores the Dirac delta function. Where am I wrong? Any help is appreciated.
Dirac delta is not a function. Writing it as density in an integral is still only a symbolic representation. It is, as mathematical object, a functional on the space of continuous functions. delta(t0,f)=f(t0), not more, not less.
One can approximate the evaluation, or "sifting" effect of the delta operator by continuous functions. The usual good approximations have the form N*phi(N*t) where N is a large number and phi a non-negative function, usually with a somewhat compact shape, that has integral one. Popular examples are box functions, tent functions, the Gauß bell curve, ... So you could take
def tentfunc(t): return max(0,1-abs(t))
N = 10.0
def rhs(t): return sum( N*tentfunc(N*(t-(i+1)*np.pi)) for i in range(20))
X0 = [0, 0]
t = np.linspace(0, 80, 1000)
sol = odeint(lambda x,t: [ x[1], rhs(t)-x[0]], X0, t, tcrit=np.pi*np.arange(21), atol=1e-8, rtol=1e-10)
x,v = sol.T
plt.plot(t,x, t, v)
which gives
Note that the density of the t array also influences the accuracy, while the tcrit critical points did not do much.
Another way is to remember that delta is the second derivative of max(0,x), so one can construct a function that is the twice primitive of the right side,
def u(t): return sum(np.maximum(0,t-(i+1)*np.pi) for i in range(20))
so that now the equation is equivalent to
(x(t)-u(t))'' + x(t) = 0
set y = x-u then
y''(t) + y(t) = -u(t)
which now has a continuous right side.
X0 = [0, 0]
t = np.linspace(0, 80, 1000)
sol = odeint(lambda y,t: [ y[1], -u(t)-y[0]], X0, t, atol=1e-8, rtol=1e-10)
y,v = sol.T
x=y+u(t)
plt.plot(t,x)
odeint :
does not handle sympy symbolic objects
it's unlikely it can ever handle Dirac Delta terms.
The best bet is probably to turn dirac deltas into boundary conditions: assume that the function is continuous at the location of the Dirac delta, but the first derivative jumps. Integrating over infinitesimal interval around the location of the delta function gives you the boundary condition for the derivative just left and just right from the delta.
My first py file is the function that I want to find the roots, like this:
def myfun(unknowns,a,b):
x = unknowns[0]
y = unknowns[1]
eq1 = a*y+b
eq2 = x**b
z = x*y + y/x
return eq1, eq2
And my second one is to find the value of x and y from a starting point, given the parameter value of a and b:
a = 3
b = 2
x0 = 1
y0 = 1
x, y = scipy.optimize.fsolve(myfun, (x0,y0), args= (a,b))
My question is: I actually need the value of z after plugging in the result of found x and y, and I don't want to repeat again z = x*y + y/x + ..., which in my real case it's a middle step variable without an explicit expression.
However, I cannot replace the last line of fun with return eq1, eq2, z, since fslove only find the roots of eq1 and eq2.
The only solution now is to rewrite this function and let it return z, and plug in x and y to get z.
Is there a good solution to this problem?
I believe that's the wrong approach. Since you have z as a direct function of x and y, then what you need is to retrieve those two values. In the listed case, it's easy enough: given b you can derive x as the inverse of eqn2; also given a, you can invert eqn1 to get y.
For clarity, I'm changing the names of your return variables:
ret1, ret2 = scipy.optimize.fsolve(myfun, (x0,y0), args= (a,b))
Now, invert the two functions:
# eq2 = x**b
x = ret2**(1/b)
# eq1 = a*y+b
y = (ret1 - b) / a
... and finally ...
z = x*y + y/x
Note that you should remove the z computation from your function, as it serves no purpose.
I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.