GLM posterior predictive not plotting over complete range of data - python

[I ][1] I can not figure out why the glm posterior predictive is not plotting over the entire data, but only over a fraction of it. And there seem to be no parameters which can alter these. This is the code which generates the following problematic plot.
plt.figure(figsize=(7, 7))
x = np.linspace(0,10,30)
y = x + np.random.normal(2,0.6,len(x))
plt.scatter(x,y)
data = dict(x=x, y=y)
with pm.Model() as model:
pm.glm.GLM.from_formula('y ~ x', data)
trace = pm.sample(1000)
plt.plot(x, y, 'x', label='data')
pm.plot_posterior_predictive_glm(trace, samples=100,label='posterior predictive
regression lines')
plt.plot(x, trace['Intercept'].mean() + trace['x'].mean()*x, label='true regression
line', lw=3., c='y')
plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');
https://i.stack.imgur.com/2NLtP.png

Looking at the source code plot_posterior_predictive_glm default x-axis values are between 0 and 1. You can change that by calling the function as follows:
pm.plot_posterior_predictive_glm(trace,samples=100,eval=x,
label='posterior predictive regression lines')
Running your code with the above modification I get the following plot:

Related

Polynomial regression plot looking weird

I'm trying to plot a fitted polynomial using matplotlib:
my code:
x = data['LSTAT'].values.reshape(-1,1).copy()
y = data['MEDV'].values.reshape(-1,1).copy()
plt.figure(figsize=(8,5))
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(x)
#xp.sort(axis=0)
model = LinearRegression().fit(xp,y)
y_pred = model.predict(xp)
plt.scatter(x,y)
plt.plot(x, y_pred, color='r')
plt.show()
my resulting plot:
Now, I have tried the fix proposed in these two posts:
wrong polynomial regression plot
why is my draw of 3-degree polymonial so weird?
if I uncomment the xp.sort(axis=0), which is the proposed solution of 1), I get the following plot:
Which is not correct.
If I try the proposed solution of 2)
plt.plot(np.sort(x),y_pred[np.argsort(x)], color='r')
I get the following error:
ValueError: x and y can be no greater than 2D, but have shapes (506, 1) and (506, 1, 1)
I'm not sure what is going on...
The problem is the order that matplotlib plots.
I fixed this way, but I'm sure there are easier fixes:
#fixing indexes and sorting
x_pd = pd.Series(x.flatten())
y_pred_pd = pd.Series(y_pred.flatten())
x_sorted = x_pd.sort_values()
Y_pred = np.array(y_pred_pd[x_sorted.index])
x_arr = np.array(x_sorted)
#plotting
plt.scatter(x,y)
plt.plot(x,y_pred, color='r', alpha=0.5)
plt.plot(x_arr, Y_pred, color='g')
plt.show()
plt.close()
The resulting plot:

Quantile regression for 2nd order polynomial using StatsModels

I have followed the examples here by PJW for plotting a 2nd order polynomial quantile regression. The OLS model seems to be a good fit for my data but the quantile lines came out really wacky and I haven't been able to figure out where the code has gone wrong. I have attached my code below and the chart with only OLS regression line and the chart with the funky quantiles. Any help would be appreciated!
Scatter graph with 2nd order polynomial, regression line in red:
Same scatter graph with an OLS 2nd order polynomial regression line (black) and quantile lines (0.05, 0.5, 0.95) that are clearly wrong (red dotted):
def plot_poly_centiles(parameter_df):
# function to plot quantile lines using polynomial regression
par_name = parameter_df.columns[1]
# plot a scatter graph of the data
plt.subplots(figsize=(10, 6))
sns.scatterplot(x='age', y=par_name, data=parameter_df, marker='.', color='blue', alpha=0.1)
model = smf.quantreg(f'{par_name} ~ age + np.power(age, 2)', parameter_df)
result = model.fit(q=0.5)
print(result.summary())
# Quantile regression for 5 quantiles
quantiles = [.05, .25, .50, .75, .95]
# get all model result instances in a list
result_all = [model.fit(q=q) for q in quantiles]
result_ols = smf.ols(f'{par_name} ~ age + np.power(age, 2)', parameter_df).fit()
# create x for prediction
x = np.arange(parameter_df.age.min(), parameter_df.age.max(), 50)
predicted_df = pd.DataFrame({'age': x})
# plot quantile lines
for qm, result in zip(quantiles, result_all):
# get prediction for the model and plot
# here we use a dict which works the same way as the df in ols
y_cent = result.predict({'age': x})
plt.plot(x, y_cent, linestyle='--', linewidth=1, color='red')
# plot ols line
y_ols_predicted = result_ols.predict(predicted_df)
plt.plot(x, y_ols_predicted, color='k', linewidth=1, label='OLS')
plt.xlabel('age in days')
plt.ylabel(f'{par_name}')
plt.title(f'Polynomial regression centiles of {par_name} in children')
plt.show()
return parameter_df

Fitting a curve to only a few data points

I have a scatter plot with only 5 data points, which I would like to fit a curve to. I have tried both polyfit and the following code, but neither are able to produce a curve with so few data points
def func(x, a, b, c):
return a * np.exp(-b * x) + c
plt.plot(xdata, ydata, ".", label="Data");
optimizedParameters, pcov = opt.curve_fit(func, xdata, ydata);
plt.plot(xdata, func(xdata, *optimizedParameters), label="fit");
Attached is an example of the plot, along with an example of the kind of curve I am trying to produce (apologies for the bad drawing). Thanks!
Here is an example graphical Python fitter using the data in your comment, fitting to a Polytrope type of equation. In this example there is no need to take logs of the data. Here the X axis is plotted on a decade logarithmic scale. Please note that the data in the example code is in the form of floating point numbers.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([7e-09, 9e-09, 1e-08, 2e-8, 1e-6])
yData = numpy.array([790.0, 870.0, 2400.0, 2450.0, 3100.0])
def func(x, a, b, offset): # polytrope equation from zunzun.com
return a / numpy.power(x, b) + offset
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0, 1.0])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData), 1000)
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.xscale('log') # comment this out for default linear scaling
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
You would have to choose what you want to fit the curver after. From the look of your drawing it seems you are trying to shape it to be somewhat logarithmic.
Here's a picture of a logarithmic regression:
A logarithmmic regression would follow the form of y = A + B ln(x).
This is essentially a linear regression fitting were instead of fitting y vs. x
we are trying to fit y vs. ln(x).
So you may just take the natural log of the x-values of the points in your dataset and execute a linear regression algorithm on it. The yielding coefficients are then A and B for y=A + B ln(x).
Picture Credits:
http://mathworld.wolfram.com/LeastSquaresFittingLogarithmic.html
Edit: As James Phillips pointed out in his answer, it is also possible to model the curve in the form of y=Ax^(-B) + C since for so few points it cannot be determined wheter the graph has an horizontal asymptote or is always growing but decelerating. A lot of curves are possible (for instance y=A* B^(-x) +C could be another one) but you would need to choose what to model the data after.
The exponential function does not fit your data well. Consider another modeling function.
Given
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
%matplotlib inline
x_samp = np.array([7e-09, 9e-09, 1e-08, 2e-8, 1e-6])
y_samp = np.array([790, 870, 2400, 2450, 3100])
def func(x, a, b):
"""Return a exponential result."""
return a + b*np.log(x)
def func2(x, a, b, c):
"""Return a 'power law' result."""
return a/np.power(x, b) + c
Code
From #Allan Lago's logarithmic model:
# REGRESSION ------------------------------------------------------------------
x_lin = np.linspace(x_samp.min(), x_samp.max(), 50)
w, _ = opt.curve_fit(func, x_samp, y_samp)
print("Estimated Parameters", w)
# Model
y_model = func(x_lin, *w)
# PLOT ------------------------------------------------------------------------
# Visualize data and fitted curves
plt.plot(x_samp, y_samp, "ko", label="Data")
plt.plot(x_lin, y_model, "k--", label="Fit")
plt.xticks(np.arange(0, x_samp.max(), x_samp.max()/2))
plt.title("Least squares regression")
plt.legend(loc="upper left")
Estimated Parameters [8339.61062739 367.6992259 ]
Using #James Phillips' "Polytrope" model:
# REGRESSION ------------------------------------------------------------------
p0 = [1, 1, 1]
w, _ = opt.curve_fit(func2, x_samp, y_samp, p0=p0)
print("Estimated Parameters", w)
# Model
y_model = func2(x_lin, *w)
# PLOT ------------------------------------------------------------------------
# Visualize data and fitted curves
plt.plot(x_samp, y_samp, "ko", label="Data")
plt.plot(x_lin, y_model, "k--", label="Fit")
plt.xticks(np.arange(0, x_samp.max(), x_samp.max()/2))
plt.title("Least squares regression")
plt.legend()
Estimated Parameters [-3.49305043e-10 1.57259788e+00 3.05801283e+03]

How to plot a trendline on scatter-plot matplotlib based on KDE?

I am currently trying to plot a trend-line plot on my scatter-plot in MatPlotLib.
I am aware of numpy polyfit function. It does not do what I want.
So here what I have so far:
plot = plt.figure(figsize=(10,10)) #Set up the size of the figure
cmap = "viridis" #Set up the color map
plt.scatter(samples[1], samples[0], s=0.1, c=density_sm, cmap=cmap) #Plot the Cross-Plot
plt.colorbar().set_label('Density of points')
plt.axis('scaled')
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.axhline(0, color='green', alpha=0.5, linestyle="--")
plt.axvline(0, color='green', alpha=0.5, linestyle="--")
#Trend-line_1
z = np.polyfit(samples[1], samples[0], 1)
p = np.poly1d(z)
plt.plot(samples[0],p(samples[0]),color="#CC3333", linewidth=0.5)
#Trend-line_2
reg = sm.WLS(samples[0], samples[1]).fit()
plt.plot(samples[1], reg.fittedvalues)
And here is the result:
Scatter-plot with trends
What I want is:
Scatter-Plot_desired
Trend can easily be seen, but the question is what function to use?
The behaviour of polyfit is as excepted and the result is correct. The problem is that polyfit does not do, what you expect. All (typical) fitting routines minimize the vertical (y-axis) distance between the fit and the data points to be fit. What you seem to expect is however that it minimizes the euclidean distance between the fit and the data. See the difference in this figure:
Here see also code that illustrates the fact with random data. Note that the linear relationship of the data (parameter a) is recovered by the fit, which would not be the case for the euclidean fit. Therefore the seemingly off fit is to be prefered.
N = 10000
a = -1
b = 0.1
datax = 0.3*b*np.random.randn(N)
datay = a*datax+b*np.random.randn(N)
plot = plt.figure(1,figsize=(10,10)) #Set up the size of the figure
plot.clf()
plt.scatter(datax,datay) #Plot the Cross-Plot
popt = np.polyfit(datax,datay,1)
print("Result is {0:1.2f} and should be {1:1.2f}".format(popt[-2],a))
xplot = np.linspace(-1,1,1000)
def pol(x,popt):
popt = popt[::-1]
res = 0
for i,p in enumerate(popt):
res += p*x**i
return res
plt.plot(xplot,pol(xplot,popt))
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.tight_layout()
plt.show()
samples[0] is your "y" and samples[1] is your "x". In the trend line plot use samples[1].

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:
Here is my relevant code (slightly modified from linked answer above):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
prstd, iv_l, iv_u = wls_prediction_std(poly_2)
st, data, ss2 = summary_table(poly_2, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
The print statements evaluate to 0.0, as expected.
However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?
Update:
Following first answer from #kpie, I ordered my confidence and prediction interval arrays according to temperature:
data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}
df_intervals = pd.DataFrame(data=data_intervals)
df_intervals_sort = df_intervals.sort(columns='temp')
This achieved desired results:
You need to order your predict values based on temperature. I think*
So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.
You can then plot this function with:
def strPolynomialFromArray(coeffs):
return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])
from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)
You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.
I believe that the polynomial fitting is done with least squares fitting (process described here)
Good Luck!

Categories

Resources