I have the following dataset:
x = [1, 6, 11, 21, 101]
y = [5, 4, 3, 2, 1]
and my goal is to create a smooth curve that looks like this:
Is there a way to do it in Python?
I have attempted using the method shown in here, and here is the code:
from scipy.interpolate import spline
import matplotlib.pyplot as plt
import numpy as np
x = [1, 6, 11, 21, 101]
y = [5, 4, 3, 2, 1]
xnew = np.linspace(min(x), max(x), 100)
y_smooth = spline(x, y, xnew)
plt.plot(xnew, y_smooth)
plt.show()
but the output shows a weird line.
First, interpolate.spline() has been deprecated, so you should probably not use that. Instead use interpolate.splrep() and interpolate.splev(). It's not a difficult conversion:
y_smooth = interpolate.spline(x, y, xnew)
becomes
tck = interpolate.splrep(x, y)
y_smooth = interpolate.splev(xnew, tck)
But, that's not really the issue here. By default, scipy tries to fit a polynomial of degree 3 to your data, which doesn't really fit your data. But since there's so few points, it can fit your data fairly well even though it's a non-intuitive approximation. You can set the degree of polynomial that it tries to fit with a k=... argument to splrep(). But the same is true even of a polynomial of degree 2; it's trying to fit a parabola, and your data could possibly fit a parabola where there is a bow in the middle (which is what it does now, since the slope is so steep at the beginning and there's no datapoints in the middle).
In your case, your data is much more accurately represented as an exponential, so it'd be best to fit an exponential. I'd recommend using scipy.optimize.curve_fit(). It lets you specify your own fitting function which contains parameters and it'll fit the parameters for you:
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
x = [1, 6, 11, 21, 101]
y = [5, 4, 3, 2, 1]
xnew = np.linspace(min(x), max(x), 100)
def expfunc(x, a, b, c):
return a * np.exp(-b * x) + c
popt, pcov = curve_fit(expfunc, x, y)
plt.plot(xnew, expfunc(xnew, *popt))
plt.show()
Related
For a set of points, I want to get the straight line that is the closest approximation of the points using a least squares fit.
I can find a lot of overly complex solutions here on SO and elsewhere but I have not been able to find something simple. And this should be very simple.
x = np. array([1, 2, 3, 4])
y = np. array([23, 31, 42, 43 ])
slope, intercept = leastSquares(x, y)
Is there some library function that implements the above leastSquares()?
numpy.linalg.lstsq can compute such a fit for you. There is an example in the documentation that does exactly what you need.
https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html#numpy-linalg-lstsq
To summarize it here …
>>> x = np.array([0, 1, 2, 3])
>>> y = np.array([-1, 0.2, 0.9, 2.1])
>>> A = np.stack([x, np.ones(len(x))]).T
>>> m, c = np.linalg.lstsq(A, y, rcond=None)[0]
>>> m, c
(1.0 -0.95) # may vary
Well for one, I think for an ordinary least squares fit with a single line you can derive a closed-form solution for the coefficients, if I'm not utterly mistaken. Though there's some pitfalls with numerical stability.
If you look for least squares in general, you'll find more general and thus more complex solutions, because least squares can be done for many more models than just the linear one.
But maybe the sklearn package with its LinearRegression model might do easily what you want to do? https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
or for more detailed control the scipy package, https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html
import numpy as np
from scipy.linalg import lstq
# Turn x into 2d array; raise it to powers 0 (for y-axis-intercept)
# and 1 (for the slope part)
M = x[:, np.newaxis] ** [0, 1]
p, res, rnk, s = lstq(M, y)
intercept, slope = p[0], p[1]
Here's one way to implement the least squares regression:
import numpy as np
x = np. array([1, 2, 3, 4])
y = np. array([23, 31, 42, 43 ])
def leastSquares(x, y):
A = np.vstack([x, np.ones(len(x))]).T
y = y[:, np.newaxis]
slope, intercept = np.dot((np.dot(np.linalg.inv(np.dot(A.T,A)),A.T)),y)
return slope, intercept
slope, intercept = leastSquares(x, y)
You can try with Moore-Penrose pseudo-inverse:
from scipy import linalg
x = np. array([1, 2, 3, 4])
y = np. array([23, 31, 42, 43 ])
x = np.array([x, np.ones(len(x))])
B = linalg.pinv(x)
sol = np.reshape(y,(1,len(y))) # B
slope, intercept = sol[0,0], sol[0,1]
I can't figure out how to in python without creating a for loop. I'm hoping you can teach me the simpler way.
I trimmed the relevant stuff. I'm doing a polyfit and then want to use these a and b coefficients, coeff[0:1], to update an array and solve the relevant y's like: y = ax + b
I can brute force it and included two methods here, but they're both clunky.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
coeff = np.polyfit(np.arange(0, len(raw)), raw[:], 1) #fits slope of values in raw
fit = np.zeros(shape=(len(raw), 2))
fit[:,0] = np.arange(0,fit.shape[0]) # this creates an index so I can use the row index as the "x" variable
fit[:,1] = fit[:,0]*coeff[0] + fit[:,0]*coeff[1] # calculating y = ax * b in column [1]
## Alternate method with the for loop
for_fit = np.zeros(len(raw))
for i in range(0,len(raw)) :
for_fit[i] = i*coeff[0] + i*coeff[1]
I tried to make it a little bit cleaner. The main issue I saw is that you did not use the formula y = ax+b but rather y=ax+bx.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
x = np.arange(0, len(raw))
coeff = np.polyfit(x, raw[:], 1)
y = x*coeff[0] + coeff[1]
To visualise the result we can use:
import matplotlib.pyplot as plt
plt.plot(x, raw, 'bo')
plt.plot(x, y, 'r')
#EDIT
Are you looking for something like this?
y_arr = np.empty((10, len(x)))
for i in range(10):
...
y_arr[i] = y
I’d like to rotate a line graph horizontally. So far, I have the target angle but I’m not able to rotate the graph array (the blue graph in the blot).
import matplotlib.pyplot as plt
import numpy as np
x = [5, 6.5, 7, 8, 6, 5, 3, 4, 3, 0]
y = range(len(x))
best_fit_line = np.poly1d(np.polyfit(y, x, 1))(y)
angle = np.rad2deg(np.arctan2(y[-1] - y[0], best_fit_line[-1] - best_fit_line[0]))
print("angle: " + str(angle))
plt.figure(figsize=(8, 6))
plt.plot(x)
plt.plot(best_fit_line, "--", color="r")
plt.show()
The target calculations of the array should look like this (please ignore the red line):
If you have some advice, please let me know. Thanks.
This question is very helpful, in particular the answer by #Mr Tsjolder. Adapting that to your question, I had to subtract 90 from the angle you calculated to get the result you want:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import transforms
x = [5, 6.5, 7, 8, 6, 5, 3, 4, 3, 0]
y = range(len(x))
best_fit_line = np.poly1d(np.polyfit(y, x, 1))(y)
angle = np.rad2deg(np.arctan2(y[-1] - y[0], best_fit_line[-1] - best_fit_line[0]))
print("angle: " + str(angle))
plt.figure(figsize=(8, 6))
base = plt.gca().transData
rotation = transforms.Affine2D().rotate_deg(angle - 90)
plt.plot(x, transform = rotation + base)
plt.plot(best_fit_line, "--", color="r", transform = rotation + base)
Follow-up question: What if we just need the numerical values of the rotated points?
Then the matplotlib approach can still be useful. From the rotation object we introduced above, matplotlib can extract the transformation matrix, which we can use to transform any point:
# extract transformation matrix from the rotation object
M = transforms.Affine2DBase.get_matrix(rotation)[:2, :2]
# example: transform the first point
print((M * [0, 5])[:, 1])
[-2.60096617 4.27024297]
The slicing was done to get the dimensions we're interested in, since the rotation happens only in 2D. You can see that the first point from your original data gets transformed to (-2.6, 4.3), agreeing with my plot of the rotated graph above.
In this way you can rotate any point you're interested in, or write a loop to catch them all.
Arne's awnser is perfect if you like to rotate the graph with matplotlib. If not, you can take a look a this code:
import matplotlib.pyplot as plt
import numpy as np
def rotate_vector(data, angle):
# source:
# https://datascience.stackexchange.com/questions/57226/how-to-rotate-the-plot-and-find-minimum-point
# make rotation matrix
theta = np.radians(angle)
co = np.cos(theta)
si = np.sin(theta)
rotation_matrix = np.array(((co, -si), (si, co)))
# rotate data vector
rotated_vector = data.dot(rotation_matrix)
return rotated_vector
x = [5, 6.5, 7, 8, 6, 5, 3, 4, 3, 0]
y = range(len(x))
best_fit_line = np.poly1d(np.polyfit(y, x, 1))(y)
angle = np.rad2deg(np.arctan2(y[-1] - y[0], best_fit_line[-1] - best_fit_line[0]))
print("angle:", angle)
# rotate blue line
d = np.hstack((np.vstack(y), np.vstack(x)))
xr = rotate_vector(d, -(angle - 90))
# rotate red line
dd = np.hstack((np.vstack(y), np.vstack(best_fit_line)))
xxr = rotate_vector(dd, -(angle - 90))
plt.figure(figsize=(8, 6))
plt.plot(xr[:, 1]) # or plt.plot(xr[:, 0], xr[:, 1])
plt.plot(xxr[:, 1], "--", color="r")
plt.show()
A test code for this type of data:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = np.size(x)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max(y),mean,sigma])
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
I need to fit lots of data which is just like the y array given above to a Gaussian distribution.
Using the standard gaussian fitting routine using scipy.optimize gives this kind of fit:
I have tried many different initial values, but cannot get any kind of fit.
Does anyone have any ideas how I could get this data fitted to a Gaussian?
Thanks
The problem
Your fundamental problem is that you have a severely undetermined fitting problem. Think about it like this: you have three unknowns but only one datapoint. This is akin to solving for x, y, z when you only have one equation. Because the height of your gaussian can vary independently of it's width, there are infinitely many distributions, all with different widths that will satisfy the constraints of your fit.
More directly, your a and sigma parameters can both change the maximum height of the distribution, which is pretty much the only thing that matters in terms of achieving a good fit (at least once the distribution is centered and fairly narrow). Thus, the fitting routines in Scipy can't figure which to change at any given step.
The fix
The simplest way to solve the problem is to lock down one of your parameters. You don't need to change your equation, but you do need to make at least one of a, x0, or sigma a constant. The best choice of parameter to fix is probably x0, since it's trivial to determine the mean/median/mode of you data by just getting the x coordinate of the one datapoint that is non-zero in y. You'll also need to get a little more clever about how you set your initial guesses. Here's what that looks like:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
xdiff = x[1] - x[0]
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# the mean/median/mode all occur at the x coordinate of the one datapoint that is non-zero in y
mean = x[np.argmax(y)]
# sigma should be tiny, since we want a narrow distribution
sigma = xdiff
# the scaling factor should be roughly equal to the "height" of the one datapoint
a = y.max()
def gaus(x,a,sigma):
return a*np.exp(-(x-mean)**2/(2*sigma**2))
bounds = ((1, .015), (20, 1))
popt,pcov = curve_fit(gaus, x, y, p0=[a, sigma], maxfev=20000, bounds=bounds)
residual = ((gaus(x,*popt) - y)**2).sum()
plt.figure(figsize=(8,6))
plt.plot(x,y,'b+:',label='data')
xdist = np.linspace(x.min(), x.max(), 1000)
plt.plot(xdist,gaus(xdist,*popt),'C0', label='fit distribution')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.text(.1,6,"residual: %.6e" % residual)
plt.legend()
plt.show()
Output:
The better fix
You don't need a fit to get the kind of Gaussians you want. You can instead use a simple closed form expression to calculate the parameters that you need, as in the fitonegauss function in the code below:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def gauss(x, a, mean, sigma):
return a*np.exp(-(x - mean)**2/(2*sigma**2))
def fitonegauss(x, y, fwhm=None):
if fwhm is None:
# determine full width at half maximum from the spacing between the x points
fwhm = (x[1] - x[0])
# the mean/median/mode all occur at the x coordinate of the one datapoint that is non-zero in y
mean = x[np.argmax(y)]
# solve for sigma in terms of the desired full width at half maximum
sigma = fwhm/(2*np.sqrt(2*np.log(2)))
# max(pdf) == 1/(np.sqrt(2*np.pi)*sigma). Use that to determine a
a = y.max() #(np.sqrt(2*np.pi)*sigma)
return a, mean, sigma
N = 20
x = np.linspace(0,1,N)
y = np.zeros(N)
y[N//2] = 10
popt = fitonegauss(x, y)
plt.figure(figsize=(8,6))
plt.plot(x,y,'b+:',label='data')
xdist = np.linspace(x.min(), x.max(), 1000)
plt.plot(xdist,gauss(xdist,*popt),'C0', label='fit distribution')
residual = ((gauss(x,*popt) - y)**2).sum()
plt.plot(x, gauss(x,*popt),'ro:',label='fit')
plt.text(.1,6,"residual: %.6e" % residual)
plt.legend()
plt.show()
Output:
The advantages of this approach are many. It's far more computationally efficient than any fit could be, it will (for the most part) never fail, and it gives you far more control over the actual width of the distribution that you end up with.
The fitonegauss function is set up so that you can directly set the full width at half maximum of the fitted distribution. If you leave it unset, the code will automatically guess it from the spacing of the x data. This seems to produce reasonable results for your application.
Don't use a general "a" parameter, use the proper normal distribution equation instead:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = np.size(x)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
def gaus(x, x0, sigma):
return 1/np.sqrt(2 * np.pi * sigma**2)*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[mean,sigma])
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
What is the easiest and fastest way to interpolate between two arrays to get new array.
For example, I have 3 arrays:
x = np.array([0,1,2,3,4,5])
y = np.array([5,4,3,2,1,0])
z = np.array([0,5])
x,y corresponds to data-points and z is an argument. So at z=0 x array is valid, and at z=5 y array valid. But I need to get new array for z=1. So it could be easily solved by:
a = (y-x)/(z[1]-z[0])*1+x
Problem is that data is not linearly dependent and there are more than 2 arrays with data. Maybe it is possible to use somehow spline interpolation?
This is a univariate to multivariate regression problem. Scipy supports univariate to univariate regression, and multivariate to univariate regression. But you can instead iterate over the outputs, so this is not such a big problem. Below is an example of how it can be done. I've changed the variable names a bit and added a new point:
import numpy as np
from scipy.interpolate import interp1d
X = np.array([0, 5, 10])
Y = np.array([[0, 1, 2, 3, 4, 5],
[5, 4, 3, 2, 1, 0],
[8, 6, 5, 1, -4, -5]])
XX = np.array([0, 1, 5]) # Find YY for these
YY = np.zeros((len(XX), Y.shape[1]))
for i in range(Y.shape[1]):
f = interp1d(X, Y[:, i])
for j in range(len(XX)):
YY[j, i] = f(XX[j])
So YY are the result for XX. Hope it helps.