Quantile regression for 2nd order polynomial using StatsModels

Quantile regression for 2nd order polynomial using StatsModels - python

I have followed the examples here by PJW for plotting a 2nd order polynomial quantile regression. The OLS model seems to be a good fit for my data but the quantile lines came out really wacky and I haven't been able to figure out where the code has gone wrong. I have attached my code below and the chart with only OLS regression line and the chart with the funky quantiles. Any help would be appreciated!
Scatter graph with 2nd order polynomial, regression line in red:
Same scatter graph with an OLS 2nd order polynomial regression line (black) and quantile lines (0.05, 0.5, 0.95) that are clearly wrong (red dotted):
def plot_poly_centiles(parameter_df):
# function to plot quantile lines using polynomial regression
par_name = parameter_df.columns[1]
# plot a scatter graph of the data
plt.subplots(figsize=(10, 6))
sns.scatterplot(x='age', y=par_name, data=parameter_df, marker='.', color='blue', alpha=0.1)
model = smf.quantreg(f'{par_name} ~ age + np.power(age, 2)', parameter_df)
result = model.fit(q=0.5)
print(result.summary())
# Quantile regression for 5 quantiles
quantiles = [.05, .25, .50, .75, .95]
# get all model result instances in a list
result_all = [model.fit(q=q) for q in quantiles]
result_ols = smf.ols(f'{par_name} ~ age + np.power(age, 2)', parameter_df).fit()
# create x for prediction
x = np.arange(parameter_df.age.min(), parameter_df.age.max(), 50)
predicted_df = pd.DataFrame({'age': x})
# plot quantile lines
for qm, result in zip(quantiles, result_all):
# get prediction for the model and plot
# here we use a dict which works the same way as the df in ols
y_cent = result.predict({'age': x})
plt.plot(x, y_cent, linestyle='--', linewidth=1, color='red')
# plot ols line
y_ols_predicted = result_ols.predict(predicted_df)
plt.plot(x, y_ols_predicted, color='k', linewidth=1, label='OLS')
plt.xlabel('age in days')
plt.ylabel(f'{par_name}')
plt.title(f'Polynomial regression centiles of {par_name} in children')
plt.show()
return parameter_df

Related

Polynomial regression plot looking weird

I'm trying to plot a fitted polynomial using matplotlib:
my code:
x = data['LSTAT'].values.reshape(-1,1).copy()
y = data['MEDV'].values.reshape(-1,1).copy()
plt.figure(figsize=(8,5))
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(x)
#xp.sort(axis=0)
model = LinearRegression().fit(xp,y)
y_pred = model.predict(xp)
plt.scatter(x,y)
plt.plot(x, y_pred, color='r')
plt.show()
my resulting plot:
Now, I have tried the fix proposed in these two posts:
wrong polynomial regression plot
why is my draw of 3-degree polymonial so weird?
if I uncomment the xp.sort(axis=0), which is the proposed solution of 1), I get the following plot:
Which is not correct.
If I try the proposed solution of 2)
plt.plot(np.sort(x),y_pred[np.argsort(x)], color='r')
I get the following error:
ValueError: x and y can be no greater than 2D, but have shapes (506, 1) and (506, 1, 1)
I'm not sure what is going on...

The problem is the order that matplotlib plots.
I fixed this way, but I'm sure there are easier fixes:
#fixing indexes and sorting
x_pd = pd.Series(x.flatten())
y_pred_pd = pd.Series(y_pred.flatten())
x_sorted = x_pd.sort_values()
Y_pred = np.array(y_pred_pd[x_sorted.index])
x_arr = np.array(x_sorted)
#plotting
plt.scatter(x,y)
plt.plot(x,y_pred, color='r', alpha=0.5)
plt.plot(x_arr, Y_pred, color='g')
plt.show()
plt.close()
The resulting plot:

Difficult to plot linear regression line on scatter plot with log scale

I have a example dataframe like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'a':[0.05, 0.11, 0.18, 0.20, 0.22, 0.27],
'b':[3.14, 1.56, 33.10, 430.00, 239.10, 2600.22]})
I would like to plot these properties as a scatter plot and then show the linear tendency line of these samples. And I need to put the data on the y axis (df['b']) on log scale.
Although, when I try to do that using the aid of np.polyfit, I get a strange line.
# Coefficients for polynomial function (degree 1)
coefs = np.polyfit(df['a'], df['b'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red',linestyle='--')
plt.xlabel('a')
plt.ylabel('b')
plt.yscale('log')
And if I convert df['b] to log before the plot, I can get the right linear tendency, but I would like to show the y-axis with the values of the last plot and not as converted log values as this one below:
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b_log'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
So basically, I need a plot like the last one, but the values on y-axis should be like the second plot and I still would get the right linear tendency. Anyone could help me?

You are doing two different things there: First, you are fitting a linear curve to your exponential data (which is presumably not what you want), then you are fitting a linear curve to your log data, which is ok.
In order to get the linear curve from the linear coefficients in the logarithmic plot, you can just do 10**fit_coefs(df['a']):
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], 10**fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
plt.yscale("log")

GLM posterior predictive not plotting over complete range of data

[I ][1] I can not figure out why the glm posterior predictive is not plotting over the entire data, but only over a fraction of it. And there seem to be no parameters which can alter these. This is the code which generates the following problematic plot.
plt.figure(figsize=(7, 7))
x = np.linspace(0,10,30)
y = x + np.random.normal(2,0.6,len(x))
plt.scatter(x,y)
data = dict(x=x, y=y)
with pm.Model() as model:
pm.glm.GLM.from_formula('y ~ x', data)
trace = pm.sample(1000)
plt.plot(x, y, 'x', label='data')
pm.plot_posterior_predictive_glm(trace, samples=100,label='posterior predictive
regression lines')
plt.plot(x, trace['Intercept'].mean() + trace['x'].mean()*x, label='true regression
line', lw=3., c='y')
plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');
https://i.stack.imgur.com/2NLtP.png

Looking at the source code plot_posterior_predictive_glm default x-axis values are between 0 and 1. You can change that by calling the function as follows:
pm.plot_posterior_predictive_glm(trace,samples=100,eval=x,
label='posterior predictive regression lines')
Running your code with the above modification I get the following plot:

How to plot a trendline on scatter-plot matplotlib based on KDE?

I am currently trying to plot a trend-line plot on my scatter-plot in MatPlotLib.
I am aware of numpy polyfit function. It does not do what I want.
So here what I have so far:
plot = plt.figure(figsize=(10,10)) #Set up the size of the figure
cmap = "viridis" #Set up the color map
plt.scatter(samples[1], samples[0], s=0.1, c=density_sm, cmap=cmap) #Plot the Cross-Plot
plt.colorbar().set_label('Density of points')
plt.axis('scaled')
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.axhline(0, color='green', alpha=0.5, linestyle="--")
plt.axvline(0, color='green', alpha=0.5, linestyle="--")
#Trend-line_1
z = np.polyfit(samples[1], samples[0], 1)
p = np.poly1d(z)
plt.plot(samples[0],p(samples[0]),color="#CC3333", linewidth=0.5)
#Trend-line_2
reg = sm.WLS(samples[0], samples[1]).fit()
plt.plot(samples[1], reg.fittedvalues)
And here is the result:
Scatter-plot with trends
What I want is:
Scatter-Plot_desired
Trend can easily be seen, but the question is what function to use?

The behaviour of polyfit is as excepted and the result is correct. The problem is that polyfit does not do, what you expect. All (typical) fitting routines minimize the vertical (y-axis) distance between the fit and the data points to be fit. What you seem to expect is however that it minimizes the euclidean distance between the fit and the data. See the difference in this figure:
Here see also code that illustrates the fact with random data. Note that the linear relationship of the data (parameter a) is recovered by the fit, which would not be the case for the euclidean fit. Therefore the seemingly off fit is to be prefered.
N = 10000
a = -1
b = 0.1
datax = 0.3*b*np.random.randn(N)
datay = a*datax+b*np.random.randn(N)
plot = plt.figure(1,figsize=(10,10)) #Set up the size of the figure
plot.clf()
plt.scatter(datax,datay) #Plot the Cross-Plot
popt = np.polyfit(datax,datay,1)
print("Result is {0:1.2f} and should be {1:1.2f}".format(popt[-2],a))
xplot = np.linspace(-1,1,1000)
def pol(x,popt):
popt = popt[::-1]
res = 0
for i,p in enumerate(popt):
res += p*x**i
return res
plt.plot(xplot,pol(xplot,popt))
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.tight_layout()
plt.show()

samples[0] is your "y" and samples[1] is your "x". In the trend line plot use samples[1].

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:
Here is my relevant code (slightly modified from linked answer above):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
prstd, iv_l, iv_u = wls_prediction_std(poly_2)
st, data, ss2 = summary_table(poly_2, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
The print statements evaluate to 0.0, as expected.
However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?
Update:
Following first answer from #kpie, I ordered my confidence and prediction interval arrays according to temperature:
data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}
df_intervals = pd.DataFrame(data=data_intervals)
df_intervals_sort = df_intervals.sort(columns='temp')
This achieved desired results:

You need to order your predict values based on temperature. I think*
So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.
You can then plot this function with:
def strPolynomialFromArray(coeffs):
return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])
from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)
You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.
I believe that the polynomial fitting is done with least squares fitting (process described here)
Good Luck!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Quantile regression for 2nd order polynomial using StatsModels - python

Related

Polynomial regression plot looking weird

Difficult to plot linear regression line on scatter plot with log scale

GLM posterior predictive not plotting over complete range of data

How to plot a trendline on scatter-plot matplotlib based on KDE?

Plotting confidence and prediction intervals with repeated entries

Categories

Resources