Confidence interval for LOWESS in Python - python

How would I calculate the confidence intervals for a LOWESS regression in Python? I would like to add these as a shaded region to the LOESS plot created with the following code (other packages than statsmodels are fine as well).
import numpy as np
import pylab as plt
import statsmodels.api as sm
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)
plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()
I've added an example plot with confidence interval below from the webblog Serious Stats (it is created using ggplot in R).

LOESS doesn't have an explicit concept for standard error. It just doesn't mean anything in this context. Since that's out, your stuck with the brute-force approach.
Bootstrap your data. Your going to fit a LOESS curve to the bootstrapped data. See the middle of this page to find a pretty picture of what your doing. http://statweb.stanford.edu/~susan/courses/s208/node20.html
Once you have your large number of different LOESS curves, you can find the top and bottom Xth percentile.

This is a very old question but it's one of the first that pops up on google search. You can do this using the loess() function from scikit-misc. Here's an example (I tried to keep your original variable names, but I bumped up the noise a bit to make it more visible)
import numpy as np
import pylab as plt
from skmisc.loess import loess
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.4
l = loess(x,y)
l.fit()
pred = l.predict(x, stderror=True)
conf = pred.confidence()
lowess = pred.values
ll = conf.lower
ul = conf.upper
plt.plot(x, y, '+')
plt.plot(x, lowess)
plt.fill_between(x,ll,ul,alpha=.33)
plt.show()
result:

For a project of mine, I need to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of LowessSmoother:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
from tsmoothie.utils_func import sim_randomwalk
# generate 10 randomwalks of length 200
np.random.seed(33)
data = sim_randomwalk(n_series=10, timesteps=200,
process_noise=10, measure_noise=30)
# operate smoothing
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)
# generate intervals
low, up = smoother.get_intervals('prediction_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way. Hope this can help someone

Related

Can I plot the error band using the uncertainties of curve fitting (Python)?

Here's the least-square quadratic fitting result of my data: y = 0.06(+/- 0.16)x**2-0.65(+/-0.04)x+1.2(+/-0.001). I wonder is there a direct way to plot the fit as well as the error band? I found a similar example which used plt.fill_between method. However, in that example the boundaries are known, while in my case I'm not quite sure about the exact parameters which correspond to the boundaries. I don't know if I could use plt.fill_between or a different approach. Thanks!
You can use seaborn.regplot to calculate the fit and plot it directly (order=2 is second order fit):
Here is a dummy example:
import seaborn as sns
import numpy as np
xs = np.linspace(0, 10, 50)
ys = xs**2+xs+1+np.random.normal(scale=20, size=50)
sns.regplot(x=xs, y=ys, order=2)

matplotlib hist function argument density not working

plt.hist's density argument does not work.
I tried to use the density argument in the plt.hist function to normalize stock returns in my plot, but it didn't work.
The following code worked fine for me and give me the probability density function which I desired.
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(19680801)
# example data
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
x = mu + sigma * np.random.randn(437)
num_bins = 50
plt.hist(x, num_bins, density=1)
plt.show()
But when I tried it with stock data, it simply didn't work. The result gave the unnormalized data. I didn't find any abnormal data in my data array.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(returns, 50,density = True)
plt.show()
# "returns" is a np array consisting of 360 days of stock returns
This is a known issue in Matplotlib.
As stated in Bug Report: The density flag in pyplot.hist() does not work correctly
When density = False, the histogram plot would have counts on the Y-axis. But when density = True, the Y-axis does not mean anything useful. I think a better implementation would plot the PDF as the histogram when density = True.
The developers view this as a feature not a bug since it maintains compatibility with numpy. They have closed several the bug reports about it already with since it is working as intended. Creating even more confusion the example on the matplotlib site appears to show this feature working with the y-axis being assigned a meaningful value.
What you want to do with matplotlib is reasonable but matplotlib will not let you do it that way.
It is not a bug.
Area of the bars equal to 1.
Numbers only seem strange because your bin sizes are small
Since this isn't resolved; based on #user14518925's response which is actually correct, this is treating bin width as an actual valid number whereas from my understanding you want each bin to have a width of 1 such that the sum of frequencies is 1. More succinctly, what you're seeing is:
\sum_{i}y_{i}\times\text{bin size} =1
Whereas what you want is:
\sum_{i}y_{i} =1
therefore, all you really need to change is the tick labels on the y-axis. One way to this is to disable the density option :
density = false
and instead divide by the total sample size as such (shown in your example):
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(19680801)
# example data
mu = 0 # mean of distribution
sigma = 0.0000625 # standard deviation of distribution
x = mu + sigma * np.random.randn(437)
fig = plt.figure()
plt.hist(x, 50, density=False)
locs, _ = plt.yticks()
print(locs)
plt.yticks(locs,np.round(locs/len(x),3))
plt.show()
Another approach, besides that of tvbc, is to change the yticks on the plot.
import matplotlib.pyplot as plt
import numpy as np
steps = 10
bins = np.arange(0, 101, steps)
data = np.random.random(100000) * 100
plt.hist(data, bins=bins, density=True)
yticks = plt.gca().get_yticks()
plt.yticks(yticks, np.round(yticks * steps, 2))
plt.show()

Python piecewise function interpolation

i am trying to construct a function which gives me interpolated values of a piecewise linear function. I tried linear spline interpolation (which should be able to do exactly this?)- but without any luck. The problem is most visible on a log scale plot. Below there is the code of a small example i prepared:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
from scipy import interpolate
#Original Data
pwl_data = np.array([[0,1e3, 1e5, 1e8], [-90,-90, -90, -130]])
#spine interpolation
pwl_spline = interpolate.splrep(pwl_data[0], pwl_data[1])
spline_x = np.linspace (0,1e8, 10000)
legend = []
plt.plot(pwl_data[0],pwl_data[1])
plt.plot(spline_x,interpolate.splev(spline_x,pwl_spline ),'*')
legend.append("Data")
legend.append("Interpolated Data")
plt.xscale('log')
plt.legend(legend)
plt.grid(True)
plt.grid(b=True, which='minor', linestyle='--')
plt.show()
What am I doing wrong?
The spline fitting have to be performed on the linearized data, i.e. using log(x) instead of x:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
#Original Data
pwl_data = np.array([[1, 1e3, 1e5, 1e8], [-90, -90, -90, -130]])
x = pwl_data[0]
y = pwl_data[1]
log_x = np.log(x)
#spine interpolation
pwl_spline = interpolate.splrep(log_x, y)
spline_log_x = np.linspace(0, 18, 30)
spline_y = interpolate.splev(spline_log_x, pwl_spline )
plt.plot(log_x, y, '-o')
plt.plot(spline_log_x, spline_y, '-*')
plt.xlabel('log(x)');
note: I remove the zero from the data. Also, spline fitting could be not the best if you want a piecewise linear function, you could have a look at this question for example: https://datascience.stackexchange.com/q/8457/53362
For plotting with matplotlib, consider matplotlibs step which internally performs a piecewise constant interpolation.
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.step.html
you can invoke it simply via:
plt.step(x,y) given your inputs x and y.
In plotly the argument line_shape='hv' for the Scatter plot achieves similar results see https://plotly.com/python/line-charts/

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

Basic plotting of wavelet analysis output in matplotlib

I am discovering wavelets in practice thanks to the python module pywt.
I have browsed some examples of the pywt module usage, but I could not grasp the essential step: I don't know how to display the multidimensionnal output of a wavelet analysis with matplotlib, basically.
This is what I tried, (given one pyplot axe ax):
import pywt
data_1_dimension_series = [0,0.1,0.2,0.4,-0.1,-0.1,-0.3,-0.4,1.0,1.0,1.0,0]
# indeed my data_1_dimension_series is much longer
cA, cD = pywt.dwt(data_1_dimension_series, 'haar')
ax.set_xlabel('seconds')
ax.set_ylabel('wavelet affinity by scale factor')
ax.plot(axe_wt_time, zip(cA,cD))
or also
data_wt_analysis = pywt.dwt(data_1_dimension_series, 'haar')
ax.plot(axe_wt_time, data_wt_analysis)
Both ax.plot(axe_wt_time, data_wt_analysis) and ax.plot(axe_wt_time, zip(cA,cD)) are not appropriate and returns error. Both throws x and y must have the same first dimension
The thing is data_wt_analysis does contain several 1D series, one for each wavelet scale factor.
I surely could display as many graphs as there are scale factors. But I want them all in the same graph.
How could I simply display such data, in only one graph, with matplotlib ?
Something like the colourful square below:
You should extract the different 1D series from your array of interest, and use matplotlib as in most simple example
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
from doc.
You wish to superimpose 1D plots (or line plots). So, if you have lists l1, l2, l3, you will do
import matplotlib.pyplot as plt
plt.plot(l1)
plt.plot(l2)
plt.plot(l3)
plt.show()
For a scalogram: what i used was imshow(). This was not for wavelets, but same ID: a colormap.
I have found this sample for use of imshow() with wavelets, didn t try thought
from pylab import *
import pywt
import scipy.io.wavfile as wavfile
# Find the highest power of two less than or equal to the input.
def lepow2(x):
return 2 ** floor(log2(x))
# Make a scalogram given an MRA tree.
def scalogram(data):
bottom = 0
vmin = min(map(lambda x: min(abs(x)), data))
vmax = max(map(lambda x: max(abs(x)), data))
gca().set_autoscale_on(False)
for row in range(0, len(data)):
scale = 2.0 ** (row - len(data))
imshow(
array([abs(data[row])]),
interpolation = 'nearest',
vmin = vmin,
vmax = vmax,
extent = [0, 1, bottom, bottom + scale])
bottom += scale
# Load the signal, take the first channel, limit length to a power of 2 for simplicity.
rate, signal = wavfile.read('kitten.wav')
signal = signal[0:lepow2(len(signal)),0]
tree = pywt.wavedec(signal, 'db5')
# Plotting.
gray()
scalogram(tree)
show()

Categories

Resources