Order of input data effects result of 3d polynomial fit - python

I am working on a 3D polynomial fit. The data I have are the x and y coordinate of 5 stations, and the velocities at these 5 stations. What I want to do is fit a grid through these points. In turn I will use this grid to determine the velocity gradients using the predicted velocities at each grid point.
My code is:
xx_vel = np.array([[4.78,52.32] ,[5.18,52.10], [4.45,51.97], [4.92,51.97], [5.15,51.85]]) #location of stations in degrees longitude and latitude
X = xx_vel #coordinates
Z = np.array([-0.00, -0.766, -0.00, -1.732, -1.00]) #velocities at 5 stations
deg_of_poly = 3
poly = PolynomialFeatures(degree=deg_of_poly)
X_ = poly.fit_transform(X)
clf = linear_model.LinearRegression()
clf.fit(X_, Z)
x_pred = np.linspace(4, 6, 27) #defining grid points
y_pred = np.linspace(51.5, 52.7, 27) #defining grid points
predict_x, predict_y = np.meshgrid(x_pred, y_pred)
predict_xy = np.concatenate((predict_x.reshape(-1, 1), predict_y.reshape(-1, 1)), axis=1)
predict_x_ = poly.fit_transform(predict_xy)
predict_z = clf.predict(predict_x_)
predict_z_poly = predict_z.reshape(predict_x.shape)
Using this code I obtain the following fit:
This all seemed fine, until I changed the order of the input data. So if I for example switch the first and last stations so that my input arrays are:
xx_vel = np.array([[5.15,51.85],[5.18,52.10], [4.45,51.97], [4.92,51.97], [4.78,52.32]])
Z = np.array([-1.00, -0.766, -0.00, -1.732, -0.00])
I obtain a different fit. Is there something I am doing wrong? Or is there a way I can make sure I obtain the same results no matter in what order the data is given? I would think that this should not have an effect on the result.
Thanks in advance!

Related

Finding two linear fits on different domains in the same data

I'm trying to plot a 3rd-order polynomial, and two linear fits on the same set of data. My data looks like this:
,Frequency,Flux Density,log_freq,log_flux
0,1.25e+18,1.86e-07,18.096910013008056,-6.730487055782084
1,699000000000000.0,1.07e-06,14.84447717574568,-5.97061622231479
2,541000000000000.0,1.1e-06,14.73319726510657,-5.958607314841775
3,468000000000000.0,1e-06,14.670245853074125,-6.0
4,458000000000000.0,1.77e-06,14.660865478003869,-5.752026733638194
5,89400000000000.0,3.01e-05,13.951337518795917,-4.521433504406157
6,89400000000000.0,9.3e-05,13.951337518795917,-4.031517051446065
7,89400000000000.0,0.00187,13.951337518795917,-2.728158393463501
8,65100000000000.0,2.44e-05,13.813580988568193,-4.61261017366127
9,65100000000000.0,6.28e-05,13.813580988568193,-4.2020403562628035
10,65100000000000.0,0.00108,13.813580988568193,-2.96657624451305
11,25900000000000.0,0.000785,13.413299764081252,-3.1051303432547472
12,25900000000000.0,0.00106,13.413299764081252,-2.9746941347352296
13,25900000000000.0,0.000796,13.413299764081252,-3.099086932262331
14,13600000000000.0,0.00339,13.133538908370218,-2.469800301796918
15,13600000000000.0,0.00372,13.133538908370218,-2.4294570601181023
16,13600000000000.0,0.00308,13.133538908370218,-2.5114492834995557
17,12700000000000.0,0.00222,13.103803720955957,-2.653647025549361
18,12700000000000.0,0.00204,13.103803720955957,-2.6903698325741012
19,230000000000.0,0.133,11.361727836017593,-0.8761483590329142
22,90000000000.0,0.518,10.954242509439325,-0.28567024025476695
23,61000000000.0,1.0,10.785329835010767,0.0
24,61000000000.0,0.1,10.785329835010767,-1.0
25,61000000000.0,0.4,10.785329835010767,-0.3979400086720376
26,42400000000.0,0.8,10.627365856592732,-0.09691001300805639
27,41000000000.0,0.9,10.612783856719735,-0.045757490560675115
28,41000000000.0,0.7,10.612783856719735,-0.1549019599857432
29,41000000000.0,0.8,10.612783856719735,-0.09691001300805639
30,41000000000.0,0.6,10.612783856719735,-0.2218487496163564
31,41000000000.0,0.7,10.612783856719735,-0.1549019599857432
32,37000000000.0,1.0,10.568201724066995,0.0
33,36800000000.0,1.0,10.565847818673518,0.0
34,36800000000.0,0.98,10.565847818673518,-0.00877392430750515
35,33000000000.0,0.8,10.518513939877888,-0.09691001300805639
36,33000000000.0,1.0,10.518513939877888,0.0
37,31400000000.0,0.92,10.496929648073214,-0.036212172654444715
38,23000000000.0,1.4,10.361727836017593,0.146128035678238
39,23000000000.0,1.1,10.361727836017593,0.04139268515822508
40,23000000000.0,1.11,10.361727836017593,0.045322978786657475
41,23000000000.0,1.1,10.361727836017593,0.04139268515822508
42,22200000000.0,1.23,10.346352974450639,0.08990511143939793
43,22200000000.0,1.24,10.346352974450639,0.09342168516223506
44,21700000000.0,0.98,10.33645973384853,-0.00877392430750515
45,21700000000.0,1.07,10.33645973384853,0.029383777685209667
46,20000000000.0,1.44,10.301029995663981,0.15836249209524964
47,15400000000.0,1.32,10.187520720836464,0.12057393120584989
48,15000000000.0,1.5,10.176091259055681,0.17609125905568124
49,15000000000.0,1.5,10.176091259055681,0.17609125905568124
50,15000000000.0,1.42,10.176091259055681,0.15228834438305647
51,15000000000.0,1.43,10.176091259055681,0.1553360374650618
52,15000000000.0,1.42,10.176091259055681,0.15228834438305647
53,15000000000.0,1.47,10.176091259055681,0.1673173347481761
54,15000000000.0,1.38,10.176091259055681,0.13987908640123647
55,10700000000.0,2.59,10.02938377768521,0.4132997640812518
56,8870000000.0,2.79,9.947923619831727,0.44560420327359757
57,8460000000.0,2.69,9.927370363039023,0.42975228000240795
58,8400000000.0,2.8,9.924279286061882,0.4471580313422192
59,8400000000.0,2.53,9.924279286061882,0.40312052117581787
60,8400000000.0,2.06,9.924279286061882,0.31386722036915343
61,8300000000.0,2.58,9.919078092376074,0.41161970596323016
62,8080000000.0,2.76,9.907411360774587,0.4409090820652177
63,5010000000.0,3.68,9.699837725867246,0.5658478186735176
64,5000000000.0,0.81,9.698970004336019,-0.09151498112135022
65,5000000000.0,3.5,9.698970004336019,0.5440680443502757
66,5000000000.0,3.57,9.698970004336019,0.5526682161121932
67,4980000000.0,3.46,9.697229342759718,0.5390760987927766
68,4900000000.0,2.95,9.690196080028514,0.46982201597816303
69,4850000000.0,3.46,9.685741738602264,0.5390760987927766
70,4850000000.0,3.45,9.685741738602264,0.5378190950732742
71,4780000000.0,2.16,9.679427896612118,0.3344537511509309
72,4540000000.0,3.61,9.657055852857104,0.557507201905658
73,2700000000.0,3.5,9.431363764158988,0.5440680443502757
74,2700000000.0,3.7,9.431363764158988,0.568201724066995
75,2700000000.0,3.92,9.431363764158988,0.5932860670204573
76,2700000000.0,3.92,9.431363764158988,0.5932860670204573
77,2250000000.0,4.21,9.352182518111363,0.6242820958356683
78,1660000000.0,3.69,9.220108088040055,0.5670263661590603
79,1660000000.0,3.8,9.220108088040055,0.5797835966168101
80,1410000000.0,3.5,9.14921911265538,0.5440680443502757
81,1400000000.0,3.45,9.146128035678238,0.5378190950732742
82,1400000000.0,3.28,9.146128035678238,0.5158738437116791
83,1400000000.0,3.19,9.146128035678238,0.5037906830571811
84,1400000000.0,3.51,9.146128035678238,0.5453071164658241
85,1340000000.0,3.31,9.127104798364808,0.5198279937757188
86,1340000000.0,3.31,9.127104798364808,0.5198279937757188
87,750000000.0,3.14,8.8750612633917,0.49692964807321494
88,408000000.0,1.46,8.61066016308988,0.1643528557844371
89,408000000.0,1.46,8.61066016308988,0.1643528557844371
90,365000000.0,1.62,8.562292864456476,0.20951501454263097
91,365000000.0,1.56,8.562292864456476,0.1931245983544616
92,333000000.0,1.32,8.52244423350632,0.12057393120584989
93,302000000.0,1.23,8.48000694295715,0.08990511143939793
94,151000000.0,2.13,8.178976947293169,0.3283796034387377
95,73800000.0,3.58,7.868056361823042,0.5538830266438743
and my code is
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy.polynomial.polynomial as poly
def find_extrema(poly, bounds):
'''
Finds the extrema of the polynomial; ensure real.
https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently
'''
deriv = poly.deriv()
extrema = deriv.roots()
# Filter out complex roots
extrema = extrema[np.isreal(extrema)]
# Get real part of root
extrema = np.real(extrema)
# Apply bounds check
lb, ub = bounds
extrema = extrema[(lb <= extrema) & (extrema <= ub)]
return extrema
def find_maximum(poly, bounds):
'''
Find the maximum point; returns the value of the turnover frequency.
https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently
'''
extrema = find_extrema(poly, bounds)
# Either bound could end up being the minimum. Check those too.
extrema = np.concatenate((extrema, bounds))
value_at_extrema = poly(extrema)
maximum_index = np.argmax(value_at_extrema)
return extrema[maximum_index]
# LOAD THE DATA FROM FILE HERE
# CARRY ON...
xvar = 'log_freq'
yvar = 'log_flux'
x, y = pks[xvar], pks[yvar]
lower = min(x)
upper = max(x)
# Find the 3rd-order polynomial which fits the SED
coefs = poly.polyfit(x, y, 3) # find the coeffs
x_new = np.linspace(lower, upper, num=len(x)*10) # space to plot the fit
ffit = poly.Polynomial(coefs) # find the polynomial
# Find turnover frequency and peak flux
nu_to = find_maximum(ffit, (lower, upper))
F_p = ffit(nu_to)
# HERE'S THE TRICKY BIT
# Find the straight line to fit to the left of nu_to
left_linefit = poly.polyfit(x, y, 1)
x_left = np.linspace(lower, nu_to, num=len(x)*10) # space to plot the fit
ffit_thin = poly.Polynomial(left_linefit,
domain = (lower, nu_to)
)
# PLOTS THE POLYNOMIAL WELL
ax1 = plt.subplot(1, 1, 1)
ax1.scatter(pks[xvar], pks[yvar], label = 'PKS 0742+10', c = 'b')
ax1.plot(x_new, ffit(x_new), color = 'r')
ax1.plot(x_left, ffit_left(x_left), color = 'gold')
ax1.set_yscale('linear')
ax1.set_xscale('linear')
ax1.legend()
ax1.set_xlabel(r'$\log\nu$ ($\nu$ in Hz)')
ax1.set_ylabel(r'$\log F_{\nu}$ ($F_{\nu}$ in Jy)')
ax1.grid(axis = 'both', which = 'major')
The code produces the poly fit well:
I'm trying to plot the straight-line fits for the points on either side of the maximum, as shown schematically below:
I thought I could do it with
ffit_left = poly.Polynomial(left_linefit,
domain = (lower, nu_to)
)
and similar for ffit_right, but that produces
which is actually the straight-line fit for the whole dataset, plotted only for that domain. I don't want to manipulate the dataset, because eventually I'll have to do it on a lot of datasets.
The fitting part of the code comes from an answer to this question .
How can I fit a straight line to just set of points without manipulating the dataset?
My guess is that I have to make left_linefit = poly.polyfit(x, y, 1) recognise a domain, but I can't see anything in the numpy polyfit docs.
Sorry for the long question!
I am not sure to well understand your request. If you want to fit a piecewise function made of three linear segments a method is described in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf with theory and numerical examples.
Several cases are considered. Among them the case below might be convenient for you.
H(*) is the Heaviside step function.

How do i plot 2 separate average lines on my residuals between different x values - python

In the residual plot resulting from the below code, there is substantial drop in values around the halfway point
I would like to help visualise this for those less statistically inclined by plotting 2 average lines of the residual plot
one from x(0, 110)
and the second from x(110, 240)
Here is the code
FINAL LINEAR MODEL
x = merged[['Imp_Col_LNG', 'AveSH_LNG']].values
y = merged['Unproductive_LNG'].values
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x,y)
# plt.scatter(x, y)
yp=reg.predict(x)
# plt.plot(xp,yp)
# plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
# plt.show()
plt.scatter(yp, y)
s = yp.argsort()
plt.plot(yp[s], yp[s],color='k',ls='--')
from scipy.stats import norm
ub = yp + norm.ppf(0.5+0.95/2) * res.std(ddof=1)
lb = yp - norm.ppf(0.5+0.95/2) * res.std(ddof=1)
plt.plot(yp[s], ub[s],color='k',ls='--')
plt.plot(yp[s], lb[s],color='k',ls='--')
plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
plt.xlabel('Predicted Values')
plt.ylabel('Observed Values')
plt.title('LNG_Shuffles')
plt.show()
RESIDUAL PLOTS
res = pd.Series(y - yp)
checkresiduals(res)
plt.plot(res)
Since we are trying to plot the average of the residuals from (0, 110) and (110, 240), we first have to calculate the averages for each section.
Here, res stores the residual data in the form of a pd.Series object. To get the array information from it, we can use the to_numpy method of the pd.Series objects.
res_data = res.to_numpy()
Now, let's calculate the mean for each part.
first_average = res_data[:110].mean()
second_average = res_data[110:].mean()
Now, since we are going to plot this over two different ranges, we will have to convert these to numpy arrays before plotting.
plt.plot(np.arange(110), np.ones(110) * first_average)
plt.plot(np.arange(110, 240), np.ones(130) * second_average)
This should give you the piecewise residual average plot.

Python GPy module: how to plot model predictions over simple x-axis?

In Python, I was attempting to dive into the GPy library for estimating Gaussian Process models, when I encountered a stumbling block early on with simple plotting.
For my data, I generated a simple sine wave with a squared growth rate added in midway, and GPy successfully estimated the initial model.
Data generation:
## Generating data for regression
# First, regular sine wave + normal noise
x = np.linspace(0,40, num=300)
noise1 = np.random.normal(0,0.3,300)
y = np.sin(x) + noise1
# Second, an upward trending starting midway, with its own noise as well
temp = x[150:]
noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150)
y[150:] = y[150:] + noise2
plt.plot(x, y)
Initial model:
## Pre-processing
X = np.expand_dims(x, axis=1)
Y = np.expand_dims(y, axis=1)
## Model
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
model1 = GPy.models.GPRegression(X, Y, kernel)
## Plotting
fig = model1.plot()
GPy.plotting.show(fig, filename='basic_gp_regression_notebook')
However, this model is mis-specified, since the data was only created using sin(X) and X^2, and not just X, so I create the next model:
X_all = np.hstack((np.sin(X), np.square(X)))
model2 = GPy.models.GPRegression(X_all, Y, kernel)
fig = model2.plot()
GPy.plotting.show(fig, filename='basic_correct_gp_regression_notebook')
However, now, I am getting plotting errors,
Invalid value of type 'builtins.str' received for the 'size' property of scatter.marker Received value: '5'
I assume this is because the plot does not know to use "X" as the x-axis, having been supplied only sin(X) and X^2.
How could I fix this?

Python stats module: How to extract confidence/prediction intervals from GPy?

After having looked through all the docs and examples online, I have not been able to find a way to extract information regarding the confidence or prediction intervals from GPy models.
I generate dummy data like this,
## Generating data for regression
# First, regular sine wave + normal noise
x = np.linspace(0,40, num=300)
noise1 = np.random.normal(0,0.3,300)
y = np.sin(x) + noise1
## Second, an upward trending starting midway, with its own noise as well
temp = x[150:]
noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150)
y[150:] = y[150:] + noise2
plt.plot(x, y)
and then estimate a basic model,
## Pre-processing
X = np.expand_dims(x, axis=1)
Y = np.expand_dims(y, axis=1)
## Model
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
model1 = GPy.models.GPRegression(X, Y, kernel)
However, nothing makes it clear how to proceed from there... Another question here tried asking the same thing, but that answer does not work any more, and seems rather unsatisfactory, for such an important element of statistical modelling.
Given a model, and a set of target x values we want to generate the intervals at, you can extract the intervals using:
intervals = model.predict_quantiles( X = target_x_vals, quantiles = (2.5, 97.5) )
You can change the quantiles argument to get the appropriate width ones. The documentation for this function is found at: https://gpy.readthedocs.io/en/deploy/_modules/GPy/core/gp.html

Combining two arrays in python term by term

long =np.array(data.Longitude)
lat = np.array(data.Latitude)
coordinates = np.array(385)
for i in range(385):
coordinates[i] = np.array([lat[i], long[i]])
#x, y = kmeans2(whiten(coordinates), 3, iter = 20)
#plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
#plt.show()
I have a dataset with two columns and I wish to merge the latitude and longitude term by term to apply k means clustering after that.Please help with the array part
coordinates = np.array([lat, long])
or am I missing sth here...

Categories

Resources