Plotting a linear regression with dates in matplotlib.pyplot - python

How would I plot a linear regression with dates in pyplot? I wasn't able to find a definitive answer to this question. This is what I've tried (courtesy of w3school's tutorial on linear regression).
import matplotlib.pyplot as plt
from scipy import stats
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019', '01/01/2020']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

You first have to convert your dates into numbers to be able to do a regression (and to plot for that matter). Then you can instruct matplotlib to interpret the x-values as dates to get a nicely formatted axis:
import matplotlib.pyplot as plt
from scipy import stats
import datetime
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
# convert the dates to a number, using the datetime module
x = [datetime.datetime.strptime(i, '%M/%d/%Y').toordinal() for i in x]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x, mymodel)
# instruct matplotlib on how to convert the numbers back into dates for the x-axis
l = matplotlib.dates.AutoDateLocator()
f = matplotlib.dates.AutoDateFormatter(l)
ax.xaxis.set_major_locator(l)
ax.xaxis.set_major_formatter(f)
plt.show()

Related

Plotting regression line with log y scale

I have two plots I want to show (the original data and then its regression line). Whenever I run this code, the regression line doesn't run through the data at all-- I think this has to do with plotting the original data on a log-scale for the y axis (I tried including this when running polyfit, but I'm still having issues).
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(a, b)
plt.yscale('log')
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.plot(a, (slope*a)+intercept)
plt.show()
You are fitting log(b) = slope * a + intercept, which is equivalent to b = np.exp(slope*a + intercept).
In matploltib, you either have to make the plot using a linear scale, whith log(b) as a variable:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.scatter(a, np.log(b))
plt.plot(a, (slope*a)+intercept)
plt.show()
In this case, you do not use plt.yscale('log') as your axis is already scaled with respect to log(b).
On the other hand, you can plot the linear variables with a logarithmic scale:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.yscale('log')
plt.scatter(a, b)
plt.plot(a, np.exp((slope*a)+intercept))
plt.show()
import numpy as np
import matplotlib.pyplot as plt
def regression(m,x,b):
return m * x + b
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.scatter(a, np.log(b))
plt.plot(a, regression(a,slope,intercept))
plt.show()

Iterate through linear regression while outputting plots In Python (SciPy & MatPlotLib)

Trying to iterate through a for loop which runs 3 regressions over a pandas dataframe while printing a plot of the line for each variable.
year = crime_df.iloc[:,0]
violent_crime_rate = crime_df.iloc[:,3]
murder_rate = crime_df.iloc[:,5]
aggravated_assault_rate = crime_df.iloc[:,11]
x_axis = [violentcrimerate, murderrate, aggravatedassaultrate]
for x in x_axis:
slope, intercept, r_value, p_value, std_err = linregress(year, x)
fit = slope * year + intercept
fig, ax = plt.subplots()
fig.suptitle('x', fontsize=16, fontweight="bold")
ax.plot(year, x, linewidth=0, marker='o')
ax.plot(year, fit, 'b--')
plt.show()
Code produces 3 plots with title 'x' and distinct regression lines but I would like to know how to set relative titles (and labels) for each plot with respect to each variable within the loop. Unsure how to retrieve the variable names from the list I'm referencing. Tried str(x) in the suptitle line but that returned the values in the column rather than the list title.
something like this?
import numpy as np
import matplotlib.pyplot as plt
matrix = np.random.rand(4,12) # emulate some data
crime_df = pd.DataFrame(matrix)# emulate some data
year = crime_df.iloc[:,0]
violent_crime_rate = crime_df.iloc[:,3]
murder_rate = crime_df.iloc[:,5]
aggravated_assault_rate = crime_df.iloc[:,11]
names = ['violent_crime_rate','murder_rate','aggravated_assault_rate']
x_axis = [violent_crime_rate, murder_rate, aggravated_assault_rate]
def linregress(year,x): #emulate some data
return np.random.rand(5)
for ind, x in enumerate(x_axis):
slope, intercept, r_value, p_value, std_err = linregress(year, x)
fit = slope * year + intercept
fig, ax = plt.subplots()
fig.suptitle('x:'+str(names[ind]), fontsize=16, fontweight="bold")
ax.plot(year, x, linewidth=0, marker='o', label = names[ind] + ':1')
ax.plot(year, fit, 'b--', label = names[ind] + ':2')
ax.legend()
plt.show()

How to smoothen data in Python?

I am trying to smoothen a scatter plot shown below using SciPy's B-spline representation of 1-D curve. The data is available here.
The code I used is:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
data = np.genfromtxt("spline_data.dat", delimiter = '\t')
x = 1000 / data[:, 0]
y = data[:, 1]
x_int = np.linspace(x[0], x[-1], 100)
tck = interpolate.splrep(x, y, k = 3, s = 1)
y_int = interpolate.splev(x_int, tck, der = 0)
fig = plt.figure(figsize = (5.15,5.15))
plt.subplot(111)
plt.plot(x, y, marker = 'o', linestyle='')
plt.plot(x_int, y_int, linestyle = '-', linewidth = 0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
I tried changing the order of the spline and the smoothing condition, but I am not getting a smooth plot.
B-spline interpolation should be able to smoothen the data but what is wrong? Any alternate method to smoothen this data?
Use a larger smoothing parameter. For example, s=1000:
tck = interpolate.splrep(x, y, k=3, s=1000)
This produces:
Assuming we are dealing with noisy observations of some phenomena, Gaussian Process Regression might also be a good choice. Knowledge about the variance of the noise can be included into the parameters (nugget) and other parameters can be found using Maximum Likelihood estimation. Here's a simple example of how it could be applied:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.gaussian_process import GaussianProcess
data = np.genfromtxt("spline_data.dat", delimiter='\t')
x = 1000 / data[:, 0]
y = data[:, 1]
x_pred = np.linspace(x[0], x[-1], 100)
# <GP regression>
gp = GaussianProcess(theta0=1, thetaL=0.00001, thetaU=1000, nugget=0.000001)
gp.fit(np.atleast_2d(x).T, y)
y_pred = gp.predict(np.atleast_2d(x_pred).T)
# </GP regression>
fig = plt.figure(figsize=(5.15, 5.15))
plt.subplot(111)
plt.plot(x, y, marker='o', linestyle='')
plt.plot(x_pred, y_pred, linestyle='-', linewidth=0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
which will give:
In your specific case, you could also try changing the last argument of the np.linspace function to a smaller number, np.linspace(x[0], x[-1], 10), for example.
Demo code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
data = np.random.rand(100,2)
tempx = list(data[:, 0])
tempy = list(data[:, 1])
x = np.array(sorted([point*10 + tempx.index(point) for point in tempx]))
y = np.array([point*10 + tempy.index(point) for point in tempy])
x_int = np.linspace(x[0], x[-1], 10)
tck = interpolate.splrep(x, y, k = 3, s = 1)
y_int = interpolate.splev(x_int, tck, der = 0)
fig = plt.figure(figsize = (5.15,5.15))
plt.subplot(111)
plt.plot(x, y, marker = 'o', linestyle='')
plt.plot(x_int, y_int, linestyle = '-', linewidth = 0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
You could also smooth the data with a rolling_mean in pandas:
import pandas as pd
data = [...(your data here)...]
smoothendData = pd.rolling_mean(data,5)
the second argument of rolling_mean is the moving average (rolling mean) period. You can also reverse the data 'data.reverse', take a rolling_mean of the data that way, and combine it with the forward rolling mean. Another option is exponentially weighted moving averages:
Pandas: Exponential smoothing function for column
or using bandpass filters:
fft bandpass filter in python
http://docs.scipy.org/doc/scipy/reference/signal.html

Calculate confidence band of least-square fit

I got a question that I fight around for days with now.
How do I calculate the (95%) confidence band of a fit?
Fitting curves to data is the every day job of every physicist -- so I think this should be implemented somewhere -- but I can't find an implementation for this neither do I know how to do this mathematically.
The only thing I found is seaborn that does a nice job for linear least-square.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
data = {'x':x, 'y':y}
frame = pd.DataFrame(data, columns=['x', 'y'])
sns.lmplot('x', 'y', frame, ci=95)
plt.savefig("confidence_band.pdf")
But this is just linear least-square. When I want to fit e.g. a saturation curve like , I'm screwed.
Sure, I can calculate the t-distribution from the std-error of a least-square method like scipy.optimize.curve_fit but that is not what I'm searching for.
Thanks for any help!!
You can achieve this easily using StatsModels module.
Also see this example and this answer.
Here is an answer for your question:
import numpy as np
from matplotlib import pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
X = sm.add_constant(x)
res = sm.OLS(y, X).fit()
st, data, ss2 = summary_table(res, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="data")
ax.plot(X, fittedvalues, 'r-', label='OLS')
ax.plot(X, predict_ci_low, 'b--')
ax.plot(X, predict_ci_upp, 'b--')
ax.plot(X, predict_mean_ci_low, 'g--')
ax.plot(X, predict_mean_ci_upp, 'g--')
ax.legend(loc='best');
plt.show()
kmpfit's confidence_band() calculates the confidence band for non-linear least squares. Here for your saturation curve:
from pylab import *
from kapteyn import kmpfit
def model(p, x):
a, b = p
return a*(1-np.exp(b*x))
x = np.linspace(0, 10, 100)
y = .1*np.random.randn(x.size) + model([1, -.4], x)
fit = kmpfit.simplefit(model, [.1, -.1], x, y)
a, b = fit.params
dfdp = [1-np.exp(b*x), -a*x*np.exp(b*x)]
yhat, upper, lower = fit.confidence_band(x, dfdp, 0.95, model)
scatter(x, y, marker='.', color='#0000ba')
for i, l in enumerate((upper, lower, yhat)):
plot(x, l, c='g' if i == 2 else 'r', lw=2)
savefig('kmpfit confidence bands.png', bbox_inches='tight')
The dfdp are the partial derivatives ∂f/∂p of the model f = a*(1-e^(b*x)) with respect to each parameter p (i.e., a and b), see my answer to a similar question for background links. And here the output:

matplotlib contourf: get Z value under cursor

When I plot something with contourf, I see at the bottom of the plot window the current x and y values under the mouse cursor.
Is there a way to see also the z value?
Here an example contourf:
import matplotlib.pyplot as plt
import numpy as hp
plt.contourf(np.arange(16).reshape(-1,4))
The text that shows the position of the cursor is generated by ax.format_coord. You can override the method to also display a z-value. For instance,
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate as si
data = np.arange(16).reshape(-1, 4)
X, Y = np.mgrid[:data.shape[0], :data.shape[1]]
cs = plt.contourf(X, Y, data)
func = si.interp2d(X, Y, data)
def fmt(x, y):
z = np.take(func(x, y), 0)
return 'x={x:.5f} y={y:.5f} z={z:.5f}'.format(x=x, y=y, z=z)
plt.gca().format_coord = fmt
plt.show()
The documentation example shows how you can insert z-value labels into your plot
Script: http://matplotlib.sourceforge.net/mpl_examples/pylab_examples/contour_demo.py
Basically, it's
plt.figure()
CS = plt.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)
plt.title('Simplest default with labels')
Just a variant of wilywampa's answer. If you already have a pre-computed grid of interpolated contour values because your data is sparse or if you have a huge data matrix, this might be suitable for you.
import matplotlib.pyplot as plt
import numpy as np
resolution = 100
Z = np.arange(resolution**2).reshape(-1, resolution)
X, Y = np.mgrid[:Z.shape[0], :Z.shape[1]]
cs = plt.contourf(X, Y, Z)
Xflat, Yflat, Zflat = X.flatten(), Y.flatten(), Z.flatten()
def fmt(x, y):
# get closest point with known data
dist = np.linalg.norm(np.vstack([Xflat - x, Yflat - y]), axis=0)
idx = np.argmin(dist)
z = Zflat[idx]
return 'x={x:.5f} y={y:.5f} z={z:.5f}'.format(x=x, y=y, z=z)
plt.colorbar()
plt.gca().format_coord = fmt
plt.show()
Ex:

Categories

Resources