Simultaneous fitting to N datasets in Python

Simultaneous fitting to N datasets in Python - python

I have a single function that I want to fit to a number of different datasets, all with the same number of points. For example, I might want to fit a polynomial to all rows of an image. Is there an efficient and vectorized way of doing this with scipy or other packages, or do I have to resort to a single loop (or use multiprocessing to speed it up a bit)?

You can use numpy.linalg.lstsq:
import numpy as np
# independent variable
x = np.arange(100)
# some sample outputs with random noise
y1 = 3*x**2 + 2*x + 4 + np.random.randn(100)
y2 = x**2 - 4*x + 10 + np.random.randn(100)
# coefficient matrix, where each column corresponds to a term in your function
# this one is simple quadratic polynomial: 1, x, x**2
a = np.vstack((np.ones(100), x, x**2)).T
# result matrix, where each column is one set of outputs
b = np.vstack((y1, y2)).T
solutions, residuals, rank, s = np.linalg.lstsq(a, b)
# each column in solutions is the coefficients of terms
# for the corresponding output
for i, solution in enumerate(zip(*solutions),1):
print "y%d = %.1f + (%.1f)x + (%.1f)x^2" % ((i,) + solution)
# outputs:
# y1 = 4.4 + (2.0)x + (3.0)x^2
# y2 = 9.8 + (-4.0)x + (1.0)x^2

Related

Turning a for loop function in Vectorized form with numpy

I am trying to make my program faster with the use of numpy arrays however all the time I have tried modifying the vanilla python in the form of vectors it has given me errors. How can I vectorize the code so that I dont have to use the for loop.In the for loop code down below I have the linear regression and standard deviation formulas that are dependent on the PC_list values to be calculated.
PC_list= [457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000]
#x_mean and x_squared is used for the lin regressions and stand dev
x_mean = number/2*(1 + number)
x_squared_mean = number*(number+1)*(2*number+1)/6
for i in range(len(PC_list)-number):
y_mean = sum(PC_list[i:i+number])/number
xy_mean = sum([x * (i + 1) for i, x in enumerate(PC_list[i:i+number])])/number
#Linear regression slope(m) and b vert shift
m = (x_mean* y_mean- xy_mean)/((x_mean)**2- x_squared_mean)
b = y_mean - m*x_mean
#Standard Dev function = square root((first list value - y_mean)+(second list value - y_mean) + (third list value - y_mean)/n-1)
std = (sum([(k - y_mean)**2 for k in PC_list[i:i+number]])/(number-1))**0.5
#Upper and lower boundary calculations
Upper_Boundary = round((m*(i)+b + Upper*std),1)
Lower_Boundary = round((m*(i)+b + Lower*std),1)
#appends the upper and lower boundary to a list
upper.append(Upper_Boundary)
lower.append(Lower_Boundary)
#Boundary x and y positions appended in list for graphing
Boundary_x = number + i
Boundary_x_list.append(Boundary_x)

There is a good implementation of simple linear regression with Python and Numpy here: Simple Linear Regression in Python
The first thing I would recommend is converting your original dataset to a numpy array.
import numpy as np
X = np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
# Calculating mean of the array is made trivial
x_mean = X.mean()
# values of array are squared first and then we get the mean
x_squared_mean = np.power(X, 2).mean()
# covariance (b)
cov = np.sum((X - x_mean) * (y - y_mean)) / np.sum(np.power(X - x_mean, 2))
# variance (m)
variance = x_mean - (cov * x_mean)
# regression line
reg_line = cov + variance * X
This is just an example, but in general the first step is to convert your data to numpy arrays and then you get access to all the non-loop type functions that are implemented in C.

Calculating the derivative of points in python

I want to calculate the derivative of points, a few internet posts suggested using np.diff function. However, I tried using np.diff against manually calculated results (chose a random polynomial equation and differentiated it) to see if I end up with the same results. I used the following eq : Y = (X^3) + (X^2) + 7 and the results i ended up with were different. Any ideas why?. Is there any other method to calculate the differential.
In the problem, I am trying to solve, I have recieved data points of fitted spline function ( not the original data that need to be fitted by spline but the points of the already fitted spline). The x-values are at equal intervals. I only have the points and no equation, what I require is to calculate, the first, second and third derivatives. i.e dy/dx, d2y/dx2, d3y/dx3. Any ideas on how to do this?. Thanks in advance.
xval = [1,2,3,4,5]
yval = []
yval_dashList = []
#selected a polynomial equation
def calc_Y(X):
Y = (X**3) + (X**2) + 7
return(Y)
#calculate y values using equatuion
for i in xval:
yval.append(calc_Y(i))
#output: yval = [9,19,43,87,157]
#manually differentiated the equation or use sympy library (sym.diff(x**3 + x**2 + 7))
def calc_diffY(X):
yval_dash = 3*(X**2) + 2**X
#store differentiated y-values in a list
for i in xval:
yval_dashList.append(yval_dash(i))
#output: yval_dashList = [5,16,35,64,107]
#use numpy diff method on the y values(yval)
numpyDiff = np.diff(yval)
#output: [10,24,44,60]
The values of numpy diff method [10,24,44,60] is different from yval_dashList = [5,16,35,64,107]

The idea behind what you are trying to do is correct, but there are a couple of points to make it work as intended:
There is a typo in calc_diffY(X), the derivative of X**2 is 2*X, not 2**X:
def calc_diffY(X):
yval_dash = 3*(X**2) + 2*X
By doing this you don't obtain much better results:
yval_dash = [5, 16, 33, 56, 85]
numpyDiff = [10. 24. 44. 70.]
To calculate the numerical derivative you should do a "Difference quotient" which is an approximation of a derivative
numpyDiff = np.diff(yval)/np.diff(xval)
The approximation becomes better and better if the values of the points are more dense.
The difference between your points on the x axis is 1, so you end up in this situation (in blue the analytical derivative, in red the numerical):
If you reduce the difference in your x points to 0.1, you get this, which is much better:
Just to add something to this, have a look at this image showing the effect of reducing the distance of the points on which the derivative is numerically calculated, taken from Wikipedia:

I like #lgsp's answer. I will add that you can directly estimate the derivative without having to worry about how much space there is between the values. This just uses the symmetric formula for calculating finite differences, described at this wikipedia page.
Take note, though, of the way delta is specified. I found that when it is too small, higher-order estimates fail. There's probably not a 100% generic value that will always work well!
Also, I simplified your code by taking advantage of numpy broadcasting over arrays to eliminate for loops.
import numpy as np
# selecte a polynomial equation
def f(x):
y = x**3 + x**2 + 7
return y
# manually differentiate the equation
def f_prime(x):
return 3*x**2 + 2*x
# numerically estimate the first three derivatives
def d1(f, x, delta=1e-10):
return (f(x + delta) - f(x - delta)) / (2 * delta)
def d2(f, x, delta=1e-5):
return (d1(f, x + delta, delta) - d1(f, x - delta, delta)) / (2 * delta)
def d3(f, x, delta=1e-2):
return (d2(f, x + delta, delta) - d2(f, x - delta, delta)) / (2 * delta)
# demo output
# note that functions operate in parallel on numpy arrays -- no for loops!
xval = np.array([1,2,3,4,5])
print('y = ', f(xval))
print('y\' = ', f_prime(xval))
print('d1 = ', d1(f, xval))
print('d2 = ', d2(f, xval))
print('d3 = ', d3(f, xval))
And the outputs:
y = [ 9 19 43 87 157]
y' = [ 5 16 33 56 85]
d1 = [ 5.00000041 16.00000132 33.00002049 56.00000463 84.99995374]
d2 = [ 8.0000051 14.00000116 20.00000165 25.99996662 32.00000265]
d3 = [6. 6. 6. 6. 5.99999999]

How to get the coefficient vector of a PolyBoRi polynomial in Sage

I am trying to use SageMath for something that involves a lot of manipulation of boolean polynomials.
Here are some examples of what a coefficient vector is:
x0*x1*x2 + 1 has coefficient vector 10000001
x1 + x0 + 1 has coefficient vector 11100000
x1 + x0 has coefficient vector 01100000
(x0 is the least significant bit.)
The problem is that Sage's API doesn't seem to encourage direct manipulation of monomials or coefficient vectors, probably because the data structures it uses internally are ZDDs instead of bit arrays.
sage: P
x0*x1*x2 + x0*x1 + x0*x2 + x0 + x1*x2 + x1 + x2 + 1
sage: list(P.set())
[x0*x1*x2, x0*x1, x0*x2, x0, x1*x2, x1, x2, 1]
sage: P.terms()
[x0*x1*x2, x0*x1, x0*x2, x0, x1*x2, x1, x2, 1]
In this code, it appears that the problem is that the endianness is the opposite of what one may expect; which would be with x0 being the least significant bit.
The problem is actually more than that. For example:
#!/usr/bin/env python
from sage.all import *
from sage.crypto.boolean_function import BooleanFunction
# "input_str" is a truth table.
# Returns a polynomial that has that truth table.
def truth_str_to_poly(input_str):
# assumes that the length of input_str is a power of two
num_vars = int(log(len(input_str),2))
truth_table = []
# convert string to list of ints expected by BooleanFunction
for i in list(input_str):
truth_table.append(int(i))
B = BooleanFunction(truth_table)
P = B.algebraic_normal_form()
return P
# Return the polynomial with coefficient vector 1,1,...,1.
# (This is neccessary because we can't directly manipulate coef. vectors.)
def super_poly(num_vars):
input_str = ["0"]*(2**num_vars)
input_str[0] = "1"
input_str = "".join(input_str)
return truth_str_to_poly(input_str)
# Return the coefficient vector of P.
# This is the function that is broken.
def poly_to_coef_str(P, num_vars):
res = ""
#for i in super_poly(num_vars).terms():
for i in list(super_poly(num_vars).set()):
res += str(P.monomial_coefficient(i))
return res
num_vars = 3
print(super_poly(num_vars).monomials())
# This should have coefficient vector "01000000" = x0
input_poly = "01010101"
P = truth_str_to_poly(input_poly) # Gives correct result
res = poly_to_coef_str(P, 3)
print(" in:"+input_poly)
print("out:"+res)
Output:
[x0*x1*x2, x0*x1, x0*x2, x0, x1*x2, x1, x2, 1]
in:01010101
out:00010000
This means that it's not just getting the endianness wrong, it's somehow treating the place value of variables inconsistently.
Is there a better way to do this?

equivalent to numpy.linalg.lstsq that allows weighting

I am fitting a 2d polynomial with the numpy function linalg.lstsq:
coeffs = np.array([y*0+1, y, x, x**2, y**2]).T
coeff_r, r, rank, s =np.linalg.lstsq(coeffs, values)
Some points that I am trying to fit are more reliable than others.
Is there a way to weigh the points differently?
Thanks

lstsq is enough for this; the weights can be applied to the equations. That is, if in an overdetermined system
3*a + 2*b = 9
2*a + 3*b = 4
5*a - 4*b = 2
you care about the first equation more than about the others, multiply it by some number greater than 1. For example, by 5:
15*a + 10*b = 45
2*a + 3*b = 4
5*a - 4*b = 2
Mathematically, the system is the same, but the least squares solution will be different because it minimizes the sum of squares of the residuals, and the residual of the 1st equation got multiplied by 5.
Here is an example based on your code (with small adjustments to make it more NumPythonic). First, unweighted fit:
import numpy as np
x, y = np.meshgrid(np.arange(0, 3), np.arange(0, 3))
x = x.ravel()
y = y.ravel()
values = np.sqrt(x+y+2) # some values to fit
functions = np.stack([np.ones_like(y), y, x, x**2, y**2], axis=1)
coeff_r = np.linalg.lstsq(functions, values, rcond=None)[0]
values_r = functions.dot(coeff_r)
print(values_r - values)
This displays the residuals as
[ 0.03885814 -0.00502763 -0.03383051 -0.00502763 0.00097465 0.00405298
-0.03383051 0.00405298 0.02977753]
Now I give the 1st data point greater weight.
weights = np.ones_like(x)
weights[0] = 5
coeff_r = np.linalg.lstsq(functions*weights[:, None], values*weights, rcond=None)[0]
values_r = functions.dot(coeff_r)
print(values_r - values)
Residuals:
[ 0.00271103 -0.01948647 -0.04828936 -0.01948647 0.00820407 0.0112824
-0.04828936 0.0112824 0.03700695]
The first residual is now an order of magnitude smaller, of course at the expense of others residuals.

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.

If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))

I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph

I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:

If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.

There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simultaneous fitting to N datasets in Python - python

Related

Turning a for loop function in Vectorized form with numpy

Calculating the derivative of points in python

How to get the coefficient vector of a PolyBoRi polynomial in Sage

equivalent to numpy.linalg.lstsq that allows weighting

Fit a curve for data made up of two distinct regimes

Categories

Resources