Iterate through linear regression while outputting plots In Python (SciPy & MatPlotLib) - python

Trying to iterate through a for loop which runs 3 regressions over a pandas dataframe while printing a plot of the line for each variable.
year = crime_df.iloc[:,0]
violent_crime_rate = crime_df.iloc[:,3]
murder_rate = crime_df.iloc[:,5]
aggravated_assault_rate = crime_df.iloc[:,11]
x_axis = [violentcrimerate, murderrate, aggravatedassaultrate]
for x in x_axis:
slope, intercept, r_value, p_value, std_err = linregress(year, x)
fit = slope * year + intercept
fig, ax = plt.subplots()
fig.suptitle('x', fontsize=16, fontweight="bold")
ax.plot(year, x, linewidth=0, marker='o')
ax.plot(year, fit, 'b--')
plt.show()
Code produces 3 plots with title 'x' and distinct regression lines but I would like to know how to set relative titles (and labels) for each plot with respect to each variable within the loop. Unsure how to retrieve the variable names from the list I'm referencing. Tried str(x) in the suptitle line but that returned the values in the column rather than the list title.

something like this?
import numpy as np
import matplotlib.pyplot as plt
matrix = np.random.rand(4,12) # emulate some data
crime_df = pd.DataFrame(matrix)# emulate some data
year = crime_df.iloc[:,0]
violent_crime_rate = crime_df.iloc[:,3]
murder_rate = crime_df.iloc[:,5]
aggravated_assault_rate = crime_df.iloc[:,11]
names = ['violent_crime_rate','murder_rate','aggravated_assault_rate']
x_axis = [violent_crime_rate, murder_rate, aggravated_assault_rate]
def linregress(year,x): #emulate some data
return np.random.rand(5)
for ind, x in enumerate(x_axis):
slope, intercept, r_value, p_value, std_err = linregress(year, x)
fit = slope * year + intercept
fig, ax = plt.subplots()
fig.suptitle('x:'+str(names[ind]), fontsize=16, fontweight="bold")
ax.plot(year, x, linewidth=0, marker='o', label = names[ind] + ':1')
ax.plot(year, fit, 'b--', label = names[ind] + ':2')
ax.legend()
plt.show()

Related

How to create a confidence interval with plt.fill_between inside a scatter plot

I created a scatter plot that uses data from two sources: x = []and y = []. In a second step, I added a linear regression line for the two lists of data above using the following code:
(m, b) = np.polyfit(x, y, 1)
Y_Polyval = np.polyval([m, b], x)
plt.plot(x, Y_Polyval, linewidth=3, c="black")
The result of that is a standard scatterplot as shown below.
Now I would like to add a 95% confidence interval to the black regression line, using plt.fill_between. I know that there are many topics on this, I read through many of them, but I cannot solve the problem, i.e., adapting a code to my particular code and regression line.
Adding
CI = 1.96 * np.std(y) / np.mean(y)
plt.fill_between(y, (y-CI), (y+CI), color='blue', alpha=0.1)
to my code results in the following output below.
The blueish confidence interval by plt.fill_between is somewhere drawn on the left side of the image, but not around the regression line. What I would like to achieve is that the confidence interval draws around the black regression line. The full code is shown subsequently:
import numpy as np
import matplotlib.pyplot as plt
# Scatter plot
x = [0.472202, 0.685151, 0.287613, 0.546364, 0.518002, 0.675128, 0.462418, 0.61817, 0.692822, 0.23433,
0.194009, 0.720232, 0.597321, 0.625955, 0.660571, 0.737754, 0.436876, 0.689937, 0.483067, 0.646723,
0.699367, 0.384102, 0.561493]
y = [0.131113, 0.123865, 0.150355, 0.138914, 0.140417, 0.119358, 0.130019, 0.129782, 0.113508, 0.13434,
0.15162, 0.125768, 0.128473, 0.128056, 0.114403, 0.142878, 0.139192, 0.118033, 0.132616, 0.133043,
0.133973, 0.146611, 0.129792]
(m, b) = np.polyfit(x, y, 1)
Y_Polyval = np.polyval([m, b], x)
plt.plot(x, Y_Polyval, linewidth=3, c="black")
CI = 1.96 * np.std(y) / np.mean(y)
plt.fill_between(y, (y-CI), (y+CI), color='blue', alpha=0.1)
plt.scatter(x, y, s=250, linewidths=2, zorder=2)
plt.show()
You should plot the predicted value Y_Polyval instead of the true value y and sort the (x, y) values to fill the areas:
plt.fill_between(x, (Y_Polyval-CI), (Y_Polyval+CI), color='blue', alpha=0.1)
Full Example
import numpy as np
import matplotlib.pyplot as plt
# Scatter plot
x = [0.472202, 0.685151, 0.287613, 0.546364, 0.518002, 0.675128, 0.462418, 0.61817, 0.692822, 0.23433,
0.194009, 0.720232, 0.597321, 0.625955, 0.660571, 0.737754, 0.436876, 0.689937, 0.483067, 0.646723,
0.699367, 0.384102, 0.561493]
y = [0.131113, 0.123865, 0.150355, 0.138914, 0.140417, 0.119358, 0.130019, 0.129782, 0.113508, 0.13434,
0.15162, 0.125768, 0.128473, 0.128056, 0.114403, 0.142878, 0.139192, 0.118033, 0.132616, 0.133043,
0.133973, 0.146611, 0.129792]
# Sort coordinate values
coords = [(a, b) for a, b in zip(x, y)]
coords = sorted(coords, key=lambda x: x[1], reverse=True)
x, y = zip(*coords)
(m, b) = np.polyfit(x, y, 1)
Y_Polyval = np.polyval([m, b], x)
plt.plot(x, Y_Polyval, linewidth=3, c="black")
plt.scatter(x, y, s=250, linewidths=2, zorder=2)
plt.fill_between(x, (Y_Polyval-CI), (Y_Polyval+CI), color='blue', alpha=0.1)

Plotting a linear regression with dates in matplotlib.pyplot

How would I plot a linear regression with dates in pyplot? I wasn't able to find a definitive answer to this question. This is what I've tried (courtesy of w3school's tutorial on linear regression).
import matplotlib.pyplot as plt
from scipy import stats
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019', '01/01/2020']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
You first have to convert your dates into numbers to be able to do a regression (and to plot for that matter). Then you can instruct matplotlib to interpret the x-values as dates to get a nicely formatted axis:
import matplotlib.pyplot as plt
from scipy import stats
import datetime
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
# convert the dates to a number, using the datetime module
x = [datetime.datetime.strptime(i, '%M/%d/%Y').toordinal() for i in x]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x, mymodel)
# instruct matplotlib on how to convert the numbers back into dates for the x-axis
l = matplotlib.dates.AutoDateLocator()
f = matplotlib.dates.AutoDateFormatter(l)
ax.xaxis.set_major_locator(l)
ax.xaxis.set_major_formatter(f)
plt.show()

Modify my code- better ways to insert changing values

I have a code that build scatter plot and display the linear regression trend line and the R square.
I calculate the R square manually by calculte the slope, intercept and the r_value as following:
#Try for Linear Regression Moddel- still couldn't display anything on any scatter plot.
x = merged_data['NDVI']
y = merged_data['nitrogen']
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print('slope:',slope)
print('intercept:',intercept)
print('R:',r_value)
print('R^2:',(r_value**2))
## Create Figure (empty canvas)
fig = plt.figure()
##Add set of axes to figure
axes = fig.add_axes([1,1,1,1]) # left, bottom, width, height (range 0 to 1)
##plot
plt.scatter(x,y,alpha=0.5)
plt.title('NDVI vs Nitrogen 17/6/2019')
plt.xlabel('NDVI')
#here I insert the calculted value manually according to the print values
plt.figtext(1.8,1.6, "y=-7.269X+10.11")
plt.figtext(1.8,1.55, "R^2=-0.017")
plt.ylabel('Nitrogen')
plt.show()
I have many different databases which I want to check this for them and I don't want to manually change everytime the test in the plot, is ther any way I can tell python automatically take those values and put them in the right place?
Check this out:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
# generate dataset 1
x1 = np.random.normal(0,1,1000)
epsilon1 = np.random.normal(0,1,1000)
y1 = x1 + epsilon1
# generate dataset 2
x2 = np.random.normal(0,1,1000)
epsilon2 = np.random.normal(0,1,1000)
y2 = -x2 + epsilon2
def give_me_scatter(x, y, title, xlabel, ylabel):
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print('slope:',slope)
print('intercept:',intercept)
print('R:',r_value)
print('R^2:',(r_value**2))
## Create Figure (empty canvas)
fig = plt.figure()
##Add set of axes to figure
axes = fig.add_axes([1,1,1,1]) # left, bottom, width, height (range 0 to 1)
##plot
plt.scatter(x,y,alpha=0.5)
plt.title(title)
plt.xlabel(xlabel)
#here I insert the calculted value manually according to the print values
plt.figtext(1.0,1.95, "y={0:.3}X+{1:.3}".format(slope, intercept))
plt.figtext(1.0,1.90, "R^2={0:.3}".format(r_value**2))
plt.ylabel(ylabel)
plt.show()
For dataset 1:
give_me_scatter(x1, y1, 'x1 vs y1 10/12/2019', 'x1', 'y1')
slope: 0.9505854192888193
intercept: -0.0499255665055585
R: 0.6949004149189184
R^2: 0.482886586654485
For dataset 2:
give_me_scatter(x2, y2, 'x2 vs y2 10/12/2019', 'x2', 'y2')
slope: -0.9288542869184935
intercept: -0.008475040216075778
R: -0.6781390024143394
R^2: 0.4598725065955155

How to smoothen data in Python?

I am trying to smoothen a scatter plot shown below using SciPy's B-spline representation of 1-D curve. The data is available here.
The code I used is:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
data = np.genfromtxt("spline_data.dat", delimiter = '\t')
x = 1000 / data[:, 0]
y = data[:, 1]
x_int = np.linspace(x[0], x[-1], 100)
tck = interpolate.splrep(x, y, k = 3, s = 1)
y_int = interpolate.splev(x_int, tck, der = 0)
fig = plt.figure(figsize = (5.15,5.15))
plt.subplot(111)
plt.plot(x, y, marker = 'o', linestyle='')
plt.plot(x_int, y_int, linestyle = '-', linewidth = 0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
I tried changing the order of the spline and the smoothing condition, but I am not getting a smooth plot.
B-spline interpolation should be able to smoothen the data but what is wrong? Any alternate method to smoothen this data?
Use a larger smoothing parameter. For example, s=1000:
tck = interpolate.splrep(x, y, k=3, s=1000)
This produces:
Assuming we are dealing with noisy observations of some phenomena, Gaussian Process Regression might also be a good choice. Knowledge about the variance of the noise can be included into the parameters (nugget) and other parameters can be found using Maximum Likelihood estimation. Here's a simple example of how it could be applied:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.gaussian_process import GaussianProcess
data = np.genfromtxt("spline_data.dat", delimiter='\t')
x = 1000 / data[:, 0]
y = data[:, 1]
x_pred = np.linspace(x[0], x[-1], 100)
# <GP regression>
gp = GaussianProcess(theta0=1, thetaL=0.00001, thetaU=1000, nugget=0.000001)
gp.fit(np.atleast_2d(x).T, y)
y_pred = gp.predict(np.atleast_2d(x_pred).T)
# </GP regression>
fig = plt.figure(figsize=(5.15, 5.15))
plt.subplot(111)
plt.plot(x, y, marker='o', linestyle='')
plt.plot(x_pred, y_pred, linestyle='-', linewidth=0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
which will give:
In your specific case, you could also try changing the last argument of the np.linspace function to a smaller number, np.linspace(x[0], x[-1], 10), for example.
Demo code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
data = np.random.rand(100,2)
tempx = list(data[:, 0])
tempy = list(data[:, 1])
x = np.array(sorted([point*10 + tempx.index(point) for point in tempx]))
y = np.array([point*10 + tempy.index(point) for point in tempy])
x_int = np.linspace(x[0], x[-1], 10)
tck = interpolate.splrep(x, y, k = 3, s = 1)
y_int = interpolate.splev(x_int, tck, der = 0)
fig = plt.figure(figsize = (5.15,5.15))
plt.subplot(111)
plt.plot(x, y, marker = 'o', linestyle='')
plt.plot(x_int, y_int, linestyle = '-', linewidth = 0.75, color='k')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
You could also smooth the data with a rolling_mean in pandas:
import pandas as pd
data = [...(your data here)...]
smoothendData = pd.rolling_mean(data,5)
the second argument of rolling_mean is the moving average (rolling mean) period. You can also reverse the data 'data.reverse', take a rolling_mean of the data that way, and combine it with the forward rolling mean. Another option is exponentially weighted moving averages:
Pandas: Exponential smoothing function for column
or using bandpass filters:
fft bandpass filter in python
http://docs.scipy.org/doc/scipy/reference/signal.html

Pandas plotting linear regression on scatter graph

I'm trying to plot a linear regression on a scatter graph.
def chart1(df, yr, listcols):
temp = df[(df['YEAR']==yr)]
fig, axes = plt.subplots(nrows=2, ncols=2, figsize = (12,12))
for e in [['WD','pk_h',0,0],['WD','of_h',0,1],['SAT','of_h',1,0],['SUN','of_h',1,1]]:
temp.ix[(temp['daytype']==e[0])&(temp['hourtype']==e[1]),listcols].plot(kind='scatter', title=str(yr)+' '+e[0]+' '+e[1], x=listcols[0], y=listcols[1], ax=axes[e[2],e[3]])
fig.tight_layout()
return temp
chartd = chart1(o2, 2017,['PROD', 'option_exercise'])
I can't figure out how to make it possible in my loop.
It should work this way:
In your for loop run a regression and store the results in 'res'. Manually caclulate the predicted y ('yhat') using the stored coefficients. Then chart both x vs. y and x vs. yhat:
import pandas.stats.api
def chart4(df, yr, day, Y, sensi):
temp = df[(df['YEAR']==yr)]
temp = temp[(temp['daytype']==day)]
fig = plt.figure(figsize=(15,13))
for i, var in enumerate(sensi):
res = ols(y=temp[Y], x=temp[var])
label = 'R2: ' + str(res.r2)
temp['yhat'] = temp[var]*res.beta[0] + res.beta[1]
axis=fig.add_subplot(4,3,i+1)
temp.plot(ax=axis,kind='scatter', x=var, y=Y, title=var)
temp.plot(ax=axis, kind='scatter', x=var, y='yhat', color='grey', s=1, label=label)
axis.set_xlabel(r'alpha', fontsize=18)
fig.tight_layout()
return

Categories

Resources