I have two plots I want to show (the original data and then its regression line). Whenever I run this code, the regression line doesn't run through the data at all-- I think this has to do with plotting the original data on a log-scale for the y axis (I tried including this when running polyfit, but I'm still having issues).
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(a, b)
plt.yscale('log')
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.plot(a, (slope*a)+intercept)
plt.show()
You are fitting log(b) = slope * a + intercept, which is equivalent to b = np.exp(slope*a + intercept).
In matploltib, you either have to make the plot using a linear scale, whith log(b) as a variable:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.scatter(a, np.log(b))
plt.plot(a, (slope*a)+intercept)
plt.show()
In this case, you do not use plt.yscale('log') as your axis is already scaled with respect to log(b).
On the other hand, you can plot the linear variables with a logarithmic scale:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.yscale('log')
plt.scatter(a, b)
plt.plot(a, np.exp((slope*a)+intercept))
plt.show()
import numpy as np
import matplotlib.pyplot as plt
def regression(m,x,b):
return m * x + b
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
slope, intercept = np.polyfit(a, np.log(b), 1)
plt.figure()
plt.scatter(a, np.log(b))
plt.plot(a, regression(a,slope,intercept))
plt.show()
Related
How would I plot a linear regression with dates in pyplot? I wasn't able to find a definitive answer to this question. This is what I've tried (courtesy of w3school's tutorial on linear regression).
import matplotlib.pyplot as plt
from scipy import stats
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019', '01/01/2020']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
You first have to convert your dates into numbers to be able to do a regression (and to plot for that matter). Then you can instruct matplotlib to interpret the x-values as dates to get a nicely formatted axis:
import matplotlib.pyplot as plt
from scipy import stats
import datetime
x = ['01/01/2019', '01/02/2019', '01/03/2019', '01/04/2019', '01/05/2019', '01/06/2019', '01/07/2019', '01/08/2019', '01/09/2019', '01/10/2019', '01/11/2019', '01/12/2019']
y = [12050, 17044, 14066, 16900, 19979, 17593, 14058, 16003, 15095, 12785, 12886, 20008]
# convert the dates to a number, using the datetime module
x = [datetime.datetime.strptime(i, '%M/%d/%Y').toordinal() for i in x]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(x, mymodel)
# instruct matplotlib on how to convert the numbers back into dates for the x-axis
l = matplotlib.dates.AutoDateLocator()
f = matplotlib.dates.AutoDateFormatter(l)
ax.xaxis.set_major_locator(l)
ax.xaxis.set_major_formatter(f)
plt.show()
The idea is to plot the curve: C(t) = (1 + cos(t))i + (1 + sin(t))j + (1 -sin(t)-cos(t))k. Following the instructions on the Plot Module at https://docs.sympy.org/latest/modules/plotting.html one can get it using plot3d_parametric_line:
Method 1:
%matplotlib notebook
from sympy import cos, sin
from sympy.plotting import plot3d_parametric_line
t = sp.symbols('t',real=True)
plot3d_parametric_line(1 + cos(t), 1 + sin(t), 1-sin(t)-cos(t), (t, 0, 2*sp.pi))
Though this is a valid method there is another way to plot it without using plot3d_parametric_line but ax.plot. What I have tried:
Method 2:
fig = plt.figure(figsize=(8, 6))
ax = fig.gca(projection='3d')
ax.set_xlim([-0.15, 2.25])
ax.set_ylim([-0.15, 2.25])
ax.set_zlim([-0.75, 2.50])
ax.plot(1+sp.cos(t),1+sp.sin(t),1-sp.sin(t)-sp.cos(t))
plt.show()
But TypeError: object of type 'Add' has no len() comes up...
How can I fix it so that I get the same curve than with method 1?
Thanks
You can use the 3d plotting from matplotlib after defining a linear NumPy mesh and computing your x, y, z variables
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
t = np.linspace(0, 2*np.pi, 100)
x = 1 + np.cos(t)
y = 1 + np.sin(t)
z = 1 - np.sin(t) - np.cos(t)
ax.plot(x, y, z)
plt.show()
In Python, I have estimated the parameters for the density of a model of my distribution and I would like to plot the density function above the histogram of the distribution. In R it is similar to using the option prop=TRUE.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# initialization of the list "data"
# estimation of the parameter, in my case, mean and variance of a normal distribution
plt.hist(data, bins="auto") # data is the list of data
# here I would like to draw the density above the histogram
plt.show()
I guess the trickiest part is to make it fit.
Edit: I have tried this according to the first answer:
mean = np.mean(logdata)
var = np.var(logdata)
std = np.sqrt(var) # standard deviation, used by numpy as a replacement of the variance
plt.hist(logdata, bins="auto", alpha=0.5, label="données empiriques")
x = np.linspace(min(logdata), max(logdata), 100)
plt.plot(x, mlab.normpdf(x, mean, std))
plt.xlabel("log(taille des fichiers)")
plt.ylabel("nombre de fichiers")
plt.legend(loc='upper right')
plt.grid(True)
plt.show()
But it doesn't fit the graph, here is how it looks:
** Edit 2 ** Works with the option normed=True in the histogram function.
If I understand you correctly you have the mean and standard deviation of some data. You have plotted a histogram of this and would like to plot the normal distribution line over the histogram. This line can be generated using matplotlib.mlab.normpdf(), the documentation can be found here.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 100)
plt.hist(data, bins="auto",normed=True)
plt.plot(x, mlab.normpdf(x, mean, sigma))
plt.show()
Which gives the following figure:
Edit: The above only works with normed = True. If this is not an option, we can define our own function:
def gauss_function(x, a, x0, sigma):
return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 1000)
test = gauss_function(x, max(data), mean, sigma)
plt.hist(data, bins="auto")
plt.plot(x, test)
plt.show()
All what you are looking for, already are in seaborn.
You just have to use distplot
import seaborn as sns
import numpy as np
data = np.random.normal(5, 2, size=1000)
sns.distplot(data)
I got a question that I fight around for days with now.
How do I calculate the (95%) confidence band of a fit?
Fitting curves to data is the every day job of every physicist -- so I think this should be implemented somewhere -- but I can't find an implementation for this neither do I know how to do this mathematically.
The only thing I found is seaborn that does a nice job for linear least-square.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
data = {'x':x, 'y':y}
frame = pd.DataFrame(data, columns=['x', 'y'])
sns.lmplot('x', 'y', frame, ci=95)
plt.savefig("confidence_band.pdf")
But this is just linear least-square. When I want to fit e.g. a saturation curve like , I'm screwed.
Sure, I can calculate the t-distribution from the std-error of a least-square method like scipy.optimize.curve_fit but that is not what I'm searching for.
Thanks for any help!!
You can achieve this easily using StatsModels module.
Also see this example and this answer.
Here is an answer for your question:
import numpy as np
from matplotlib import pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
x = np.linspace(0,10)
y = 3*np.random.randn(50) + x
X = sm.add_constant(x)
res = sm.OLS(y, X).fit()
st, data, ss2 = summary_table(res, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="data")
ax.plot(X, fittedvalues, 'r-', label='OLS')
ax.plot(X, predict_ci_low, 'b--')
ax.plot(X, predict_ci_upp, 'b--')
ax.plot(X, predict_mean_ci_low, 'g--')
ax.plot(X, predict_mean_ci_upp, 'g--')
ax.legend(loc='best');
plt.show()
kmpfit's confidence_band() calculates the confidence band for non-linear least squares. Here for your saturation curve:
from pylab import *
from kapteyn import kmpfit
def model(p, x):
a, b = p
return a*(1-np.exp(b*x))
x = np.linspace(0, 10, 100)
y = .1*np.random.randn(x.size) + model([1, -.4], x)
fit = kmpfit.simplefit(model, [.1, -.1], x, y)
a, b = fit.params
dfdp = [1-np.exp(b*x), -a*x*np.exp(b*x)]
yhat, upper, lower = fit.confidence_band(x, dfdp, 0.95, model)
scatter(x, y, marker='.', color='#0000ba')
for i, l in enumerate((upper, lower, yhat)):
plot(x, l, c='g' if i == 2 else 'r', lw=2)
savefig('kmpfit confidence bands.png', bbox_inches='tight')
The dfdp are the partial derivatives ∂f/∂p of the model f = a*(1-e^(b*x)) with respect to each parameter p (i.e., a and b), see my answer to a similar question for background links. And here the output:
I'm trying to do a little bit of distribution plotting and fitting in Python using SciPy for stats and matplotlib for the plotting. I'm having good luck with some things like creating a histogram:
seed(2)
alpha=5
loc=100
beta=22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=5000)
myHist = hist(data, 100, normed=True)
Brilliant!
I can even take the same gamma parameters and plot the line function of the probability distribution function (after some googling):
rv = ss.gamma(5,100,22)
x = np.linspace(0,600)
h = plt.plot(x, rv.pdf(x))
How would I go about plotting the histogram myHist with the PDF line h superimposed on top of the histogram? I'm hoping this is trivial, but I have been unable to figure it out.
just put both pieces together.
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
alpha, loc, beta=5, 100, 22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=5000)
myHist = plt.hist(data, 100, normed=True)
rv = ss.gamma(alpha,loc,beta)
x = np.linspace(0,600)
h = plt.plot(x, rv.pdf(x), lw=2)
plt.show()
to make sure you get what you want in any specific plot instance, try to create a figure object first
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
# setting up the axes
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
# now plot
alpha, loc, beta=5, 100, 22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=5000)
myHist = ax.hist(data, 100, normed=True)
rv = ss.gamma(alpha,loc,beta)
x = np.linspace(0,600)
h = ax.plot(x, rv.pdf(x), lw=2)
# show
plt.show()
One could be interested in plotting the distibution function of any histogram.
This can be done using seaborn kde function
import numpy as np # for random data
import pandas as pd # for convinience
import matplotlib.pyplot as plt # for graphics
import seaborn as sns # for nicer graphics
v1 = pd.Series(np.random.normal(0,10,1000), name='v1')
v2 = pd.Series(2*v1 + np.random.normal(60,15,1000), name='v2')
# plot a kernel density estimation over a stacked barchart
plt.figure()
plt.hist([v1, v2], histtype='barstacked', normed=True);
v3 = np.concatenate((v1,v2))
sns.kdeplot(v3);
plt.show()
from a coursera course on data visualization with python
Expanding on Malik's answer, and trying to stick with vanilla NumPy, SciPy and Matplotlib. I've pulled in Seaborn, but it's only used to provide nicer defaults and small visual tweaks:
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks')
# parameterise our distributions
d1 = sps.norm(0, 10)
d2 = sps.norm(60, 15)
# sample values from above distributions
y1 = d1.rvs(300)
y2 = d2.rvs(200)
# combine mixture
ys = np.concatenate([y1, y2])
# create new figure with size given explicitly
plt.figure(figsize=(10, 6))
# add histogram showing individual components
plt.hist([y1, y2], 31, histtype='barstacked', density=True, alpha=0.4, edgecolor='none')
# get X limits and fix them
mn, mx = plt.xlim()
plt.xlim(mn, mx)
# add our distributions to figure
x = np.linspace(mn, mx, 301)
plt.plot(x, d1.pdf(x) * (len(y1) / len(ys)), color='C0', ls='--', label='d1')
plt.plot(x, d2.pdf(x) * (len(y2) / len(ys)), color='C1', ls='--', label='d2')
# estimate Kernel Density and plot
kde = sps.gaussian_kde(ys)
plt.plot(x, kde.pdf(x), label='KDE')
# finish up
plt.legend()
plt.ylabel('Probability density')
sns.despine()
gives us the following plot:
I've tried to stick with a minimal feature set while producing relatively nice output, notably using SciPy to estimate the KDE is very easy.