Polynomial regression plot looking weird - python

I'm trying to plot a fitted polynomial using matplotlib:
my code:
x = data['LSTAT'].values.reshape(-1,1).copy()
y = data['MEDV'].values.reshape(-1,1).copy()
plt.figure(figsize=(8,5))
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(x)
#xp.sort(axis=0)
model = LinearRegression().fit(xp,y)
y_pred = model.predict(xp)
plt.scatter(x,y)
plt.plot(x, y_pred, color='r')
plt.show()
my resulting plot:
Now, I have tried the fix proposed in these two posts:
wrong polynomial regression plot
why is my draw of 3-degree polymonial so weird?
if I uncomment the xp.sort(axis=0), which is the proposed solution of 1), I get the following plot:
Which is not correct.
If I try the proposed solution of 2)
plt.plot(np.sort(x),y_pred[np.argsort(x)], color='r')
I get the following error:
ValueError: x and y can be no greater than 2D, but have shapes (506, 1) and (506, 1, 1)
I'm not sure what is going on...

The problem is the order that matplotlib plots.
I fixed this way, but I'm sure there are easier fixes:
#fixing indexes and sorting
x_pd = pd.Series(x.flatten())
y_pred_pd = pd.Series(y_pred.flatten())
x_sorted = x_pd.sort_values()
Y_pred = np.array(y_pred_pd[x_sorted.index])
x_arr = np.array(x_sorted)
#plotting
plt.scatter(x,y)
plt.plot(x,y_pred, color='r', alpha=0.5)
plt.plot(x_arr, Y_pred, color='g')
plt.show()
plt.close()
The resulting plot:

Related

Python: log-lin plot with negative values

I am attempting to make a log-lin plot where my y-axis has negative values. I also want to fit a best-fit line to it.
Here's the lin-lin plot:
Here's the plot with the (wrong) code:
plt.scatter(x, y, c='indianred', alpha=0.5)
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"-", color="grey", alpha=0.5)
plt.yscale('log')
plt.show()
Even tho the line for the plot is wrong, it plots the y-axis as I want it (I believe). However, if I comment plt.plot(x,p(x),"-", color="grey", alpha=0.5), I get the following plot instead:
This is essentially the same I get when writing the code the proper way:
plt.scatter(x, y, c='indianred', alpha=0.5)
p = np.polyfit(x, np.log(y), 1)
plt.semilogy(x, np.exp(p[0] * x + p[1]), 'g--')
plt.yscale('log')
plt.show()
but I obviously get the following error due to my negative y-values:
RuntimeWarning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
What alternative do I have so my y-values can be better seen? A best-fit line is not a must. I just want to better observe the relationship between these variables (which should be correlated theoretically).

Difficult to plot linear regression line on scatter plot with log scale

I have a example dataframe like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'a':[0.05, 0.11, 0.18, 0.20, 0.22, 0.27],
'b':[3.14, 1.56, 33.10, 430.00, 239.10, 2600.22]})
I would like to plot these properties as a scatter plot and then show the linear tendency line of these samples. And I need to put the data on the y axis (df['b']) on log scale.
Although, when I try to do that using the aid of np.polyfit, I get a strange line.
# Coefficients for polynomial function (degree 1)
coefs = np.polyfit(df['a'], df['b'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red',linestyle='--')
plt.xlabel('a')
plt.ylabel('b')
plt.yscale('log')
And if I convert df['b] to log before the plot, I can get the right linear tendency, but I would like to show the y-axis with the values of the last plot and not as converted log values as this one below:
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b_log'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
So basically, I need a plot like the last one, but the values on y-axis should be like the second plot and I still would get the right linear tendency. Anyone could help me?
You are doing two different things there: First, you are fitting a linear curve to your exponential data (which is presumably not what you want), then you are fitting a linear curve to your log data, which is ok.
In order to get the linear curve from the linear coefficients in the logarithmic plot, you can just do 10**fit_coefs(df['a']):
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], 10**fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
plt.yscale("log")

How to plot a trendline on scatter-plot matplotlib based on KDE?

I am currently trying to plot a trend-line plot on my scatter-plot in MatPlotLib.
I am aware of numpy polyfit function. It does not do what I want.
So here what I have so far:
plot = plt.figure(figsize=(10,10)) #Set up the size of the figure
cmap = "viridis" #Set up the color map
plt.scatter(samples[1], samples[0], s=0.1, c=density_sm, cmap=cmap) #Plot the Cross-Plot
plt.colorbar().set_label('Density of points')
plt.axis('scaled')
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.axhline(0, color='green', alpha=0.5, linestyle="--")
plt.axvline(0, color='green', alpha=0.5, linestyle="--")
#Trend-line_1
z = np.polyfit(samples[1], samples[0], 1)
p = np.poly1d(z)
plt.plot(samples[0],p(samples[0]),color="#CC3333", linewidth=0.5)
#Trend-line_2
reg = sm.WLS(samples[0], samples[1]).fit()
plt.plot(samples[1], reg.fittedvalues)
And here is the result:
Scatter-plot with trends
What I want is:
Scatter-Plot_desired
Trend can easily be seen, but the question is what function to use?
The behaviour of polyfit is as excepted and the result is correct. The problem is that polyfit does not do, what you expect. All (typical) fitting routines minimize the vertical (y-axis) distance between the fit and the data points to be fit. What you seem to expect is however that it minimizes the euclidean distance between the fit and the data. See the difference in this figure:
Here see also code that illustrates the fact with random data. Note that the linear relationship of the data (parameter a) is recovered by the fit, which would not be the case for the euclidean fit. Therefore the seemingly off fit is to be prefered.
N = 10000
a = -1
b = 0.1
datax = 0.3*b*np.random.randn(N)
datay = a*datax+b*np.random.randn(N)
plot = plt.figure(1,figsize=(10,10)) #Set up the size of the figure
plot.clf()
plt.scatter(datax,datay) #Plot the Cross-Plot
popt = np.polyfit(datax,datay,1)
print("Result is {0:1.2f} and should be {1:1.2f}".format(popt[-2],a))
xplot = np.linspace(-1,1,1000)
def pol(x,popt):
popt = popt[::-1]
res = 0
for i,p in enumerate(popt):
res += p*x**i
return res
plt.plot(xplot,pol(xplot,popt))
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.tight_layout()
plt.show()
samples[0] is your "y" and samples[1] is your "x". In the trend line plot use samples[1].

how to fit data and then sample from the fitted function to draw curve

Given two arrays x and y,I was trying to use np.polyfit function to fit the data,using the following way:
z = np.polyfit(x, y, 20)
f = np.poly1d(z)
but since i want to plot a line chart instead of a smooth curve, so then i use this function f to sample an array for plotting line.
x_new = np.linspace(x[0], x[-1], fitting_size)
y_new = np.zeros(fitting_size)
for t in range(fitting_size):
y_new[t] = f(x_new[t])
plt.plot(x_new, y_new, marker='v', ms=1)
The problem is that the above segment code stills gives me a smooth curve. How can i fix it? Thanks.
Unfortunately the intention behind the question is a bit unclear. However, if you want to perform a linear fit, you need to provide the degree deg=1 to polyfit. There is then no reason to sample from the fit; one can simply use the same input array and apply the fitting function to it.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1,5,20)
y = 3*x**2+np.random.rand(len(x))*10
z = np.polyfit(x, y, 1)
f = np.poly1d(z)
z2 = np.polyfit(x, y, 2)
f2 = np.poly1d(z2)
plt.plot(x,y, marker=".", ls="", c="k", label="data")
plt.plot(x, f(x), label="linear fit")
plt.plot(x, f2(x), label="quadratic fit")
plt.legend()
plt.show()

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:
Here is my relevant code (slightly modified from linked answer above):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
prstd, iv_l, iv_u = wls_prediction_std(poly_2)
st, data, ss2 = summary_table(poly_2, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
The print statements evaluate to 0.0, as expected.
However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?
Update:
Following first answer from #kpie, I ordered my confidence and prediction interval arrays according to temperature:
data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}
df_intervals = pd.DataFrame(data=data_intervals)
df_intervals_sort = df_intervals.sort(columns='temp')
This achieved desired results:
You need to order your predict values based on temperature. I think*
So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.
You can then plot this function with:
def strPolynomialFromArray(coeffs):
return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])
from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)
You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.
I believe that the polynomial fitting is done with least squares fitting (process described here)
Good Luck!

Categories

Resources