I have generated a probability density function in python using scipy using the code below:
import matplotlib.pyplot as plt
from scipy.stats import gumbel_l
import numpy as np
data = gumbel_l.rvs(size=100000)
data = np.sort(data)
plt.hist(data, bins=50, density=True)
plt.plot(data, gumbel_l.pdf(data))
plt.show()
My quesiton is if there is a way to get a one tailed distribution of this, namely the left side of the distribution, both to generate it and to fit a pdf to it.
You can create a custom rv_continuous subclass. The minimum requirement, is that the custom class provides a pdf. The pdf can be obtained from gumbel_l's pdf up till x = 0 and being zero for positive x. The pdf needs to be normalized to get its area equal to 1, for which we can divide by gumbel_l's cdf(0).
With only the pdf implemented, you'll notice that obtaining random variates (.rvs) will be rather slow. Scipy rv_continuous very slow explains this can be remedied by generating too many variates and throwing away the values that are too high, or by providing an implementation for the ppf.
As the ppf can be obtained straightforward from gumbel_l's ppf, the code below implements that solution. A similar approach can be used to truncate at another position, or even to truncate at two spots.
import matplotlib.pyplot as plt
from scipy.stats import gumbel_l, rv_continuous
import numpy as np
class gumbel_l_trunc_gen(rv_continuous):
"truncated gumbel_l distribution"
def __init__(self, name='gumbel_l_trunc'):
self.gumbel_l_cdf_0 = gumbel_l.cdf(0)
self.gumbel_trunc_normalize = 1 / self.gumbel_l_cdf_0
super().__init__(name=name)
def _pdf(self, x):
return np.where(x <= 0, gumbel_l.pdf(x) * self.gumbel_trunc_normalize, 0)
def _cdf(self, x):
return np.where(x <= 0, gumbel_l.cdf(x) * self.gumbel_trunc_normalize, 1)
def _ppf(self, x):
return gumbel_l.ppf(x * self.gumbel_l_cdf_0)
gumbel_l_trunc = gumbel_l_trunc_gen()
data = gumbel_l_trunc.rvs(size=100000)
x = np.linspace(min(data), 1, 500)
plt.hist(data, bins=50, density=True)
plt.plot(x, gumbel_l_trunc.pdf(x))
plt.show()
Related
I'm trying to determine the different time-periods present in a waveform (shown below) using findpeaks().
The dataset for which I'm finding the peaks and eventually time-period, is an Autocorrelation Plot (though it could have been any given plot). I generated it using the following code (which you can use as it is for your reference):
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
#PERFORMING AUTOCORRELATION:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
My approach to determining the time-period is as follows:
#1) Finding peak indices
indices = find_peaks(autocorr.flatten())[0]
#2) Determining Period
diff = [(indices[i - 1] - x) for i, x in enumerate(indices)][1:]
short_period = abs(np.mean(diff))/fs
short_period #is the distance between all the individual peaks.
In this way, I'm able to only find the time-period (distance) between individual peaks. But my need is to determine the time-period for all the multiple periods that the data may contain.
Can anyone please help, how to determine all possible time-periods present in the data?
I'm trying to remove the trend present in the waveform which looks like the following:
For doing so, I use scipy.signal.detrend() as follows:
autocorr = scipy.signal.detrend(autocorr)
But I don't see any significant flattening in trend. I get the following:
My objective is to have the trend completely eliminated from the waveform. And I need to also generalize it so that it can detrend any kind of waveform - be it linear, piece-wise linear, polynomial, etc.
Can you please suggest a way to do the same?
Note: In order to replicate the above waveform, you can simply run the following code that I used to generate it:
#Loading Libraries
import warnings
warnings.filterwarnings("ignore")
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
#DETRENDING:
autocorr = scipy.signal.detrend(autocorr)
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
Since it's an auto-correlation, it will always be even; so detrending with a breakpoint at lag=0 should get you part of the way there.
An alternative way to detrend is to use a high-pass filter; you could do this in two ways. What will be tricky is deciding what the cut-off frequency should be.
Here's a possible way to do this:
#Loading Libraries
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
# High-pass filter via discrete Fourier transform
# Drop all components from 0th to dropcomponent-th
def dft_highpass(x, dropcomponent):
fx = np.fft.rfft(x)
fx[:dropcomponent] = 0
return np.fft.irfft(fx)
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*signal.square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*signal.square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
# detrend w/ breakpoints
dautocorr = signal.detrend(autocorr, bp=len(lags)//2)
# detrend w/ high-pass filter
# use `filtfilt` to get zero-phase
b, a = signal.butter(1, 1e-3, 'high')
fautocorr = signal.filtfilt(b, a, autocorr)
# detrend with DFT HPF
rautocorr = dft_highpass(autocorr, len(autocorr) // 1000)
#Visualization
fig, ax = plt.subplots(3)
for i in range(3):
ax[i].plot(lags, autocorr, label='orig')
ax[0].plot(lags, dautocorr, label='detrend w/ bp')
ax[1].plot(lags, fautocorr, label='HPF')
ax[2].plot(lags, rautocorr, label='DFT')
for i in range(3):
ax[i].legend()
ax[i].set_ylabel('autocorr')
ax[-1].set_xlabel('lags')
giving
I have been trying to implement an analog Bessel filter with a cutoff frequency 2kHz using scipy.signal, and I am confused about what value of Wn to set, as the documentation states Wn (for analog filters) should be set to angular frequency (12000 rad/s approximately). But if I implement this to my 1 second of dummy data, with half a second pulse sampled at 500 000 Hz, I get a string of 0s and nans. What is it that I am missing?
import numpy as np
import scipy
import matplotlib.pyplot as plt
import scipy.signal
def make_signal(pulse_length, rate = 500000):
new_x = np.zeros(rate)
end_signal = 250000+pulse_length
new_x[250000:end_signal] = 1
data = new_x
print (np.shape(data))
# pad on both sides
data=np.concatenate((np.zeros(rate),data,np.zeros(rate)))
return data
def conv_time(t):
pulse_length = t * 500000
pulse_length = int(pulse_length)
return pulse_length
def make_data(ti): #give time in seconds
pulse_length=conv_time(ti)
print (pulse_length)
data = make_signal(pulse_length)
return data
time_scale = np.linspace(0,1,500000)
data = make_data(0.5)
[b,a] = scipy.signal.bessel(4, 12566.37, btype='low', analog=True, output='ba', norm='phase', fs=None)
output_signal = scipy.signal.filtfilt(b, a, data)
plt.plot(data[600000:800000])
plt.plot(output_signal[600000:800000])
When plotting response using freqs, it doesn't seem that bad to me; where am I making a mistake?
You are passing an analog filter to a function, scipy.signal.filtfilt, that expects a digital (i.e. discrete time) filter. If you are going to use filtfilt or lfilter, the filter must be digital.
To work with continuous time systems, take a look at the functions
scipy.signal.impulse (scipy.signal.impulse2)
scipy.signal.step (scipy.signal.step2)
scipy.signal.lsim (scipy.signal.lsim2)
(The 2 versions solve the same mathematical problem as the version without 2 but use a different method. In most cases, the version without 2 is fine and is much faster than the 2 version.)
Other related functions and classes are listed in the section Continuous-Time Linear Systems of the SciPy documentation.
For example, here's a script that plots the impulse and step responses of your Bessel filter:
import numpy as np
from scipy.signal import bessel, step, impulse
import matplotlib.pyplot as plt
order = 4
Wn = 2*np.pi * 2000
b, a = bessel(order, Wn, btype='low', analog=True, output='ba', norm='phase')
# Note: the upper limit for t was chosen after some experimentation.
# If you don't give a T argument to impulse or step, it will choose a
# a "pretty good" time span.
t = np.linspace(0, 0.00125, 2500, endpoint=False)
timp, yimp = impulse((b, a), T=t)
tstep, ystep = step((b, a), T=t)
plt.subplot(2, 1, 1)
plt.plot(timp, yimp, label='impulse response')
plt.legend(loc='upper right', framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.title('Impulse and step response of the Bessel filter')
plt.subplot(2, 1, 2)
plt.plot(tstep, ystep, label='step response')
plt.legend(loc='lower right', framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.xlabel('t')
plt.show()
The script generates this plot:
Following this post, I tried to create a logit-normal distribution by creating the LogitNormal class:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logit
from scipy.stats import norm, rv_continuous
class LogitNormal(rv_continuous):
def _pdf(self, x, **kwargs):
return norm.pdf(logit(x), **kwargs)/(x*(1-x))
class OtherLogitNormal:
def pdf(self, x, **kwargs):
return norm.pdf(logit(x), **kwargs)/(x*(1-x))
fig, ax = plt.subplots()
values = np.linspace(10e-10, 1-10e-10, 1000)
sigma, mu = 1.78, 0
ax.plot(
values, LogitNormal().pdf(values, loc=mu, scale=sigma), label='subclassed'
)
ax.plot(
values, OtherLogitNormal().pdf(values, loc=mu, scale=sigma),
label='not subclassed'
)
ax.legend()
fig.show()
However, the LogitNormal class does not produce the desired results. When I don't subclass rv_continuous it works. Why is that? I need the subclassing to work because I also need the other methods that come with it like rvs.
Btw, the only reason I am creating my own logit-normal distribution in Python is because the only implementations of that distribution that I could find were from the PyMC3 package and from the TensorFlow package, both of which are pretty heavy / overkill if you only need them for that one function. I already tried PyMC3, but apparently it doesn't do well with scipy I think, it always crashed for me. But that's a whole different story.
Forewords
I came across this problem this week and the only relevant issue I have found about it is this post. I have almost same requirement as the OP:
Having a random variable for Logit Normal distribution.
But I also need:
To be able to perform statistical test as well;
While being compliant with the scipy random variable interface.
As #Jacques Gaudin pointed out the interface for rv_continous (see distribution architecture for details) does not ensure follow up for loc and scale parameters when inheriting from this class. And this is somehow misleading and unfortunate.
Implementing the __init__ method of course allow to create the missing binding but the trade off is: it breaks the pattern scipy is currently using to implement random variables (see an example of implementation for lognormal).
So, I took time to dig into the scipy code and I have created a MCVE for this distribution. Although it is not totally complete (it mainly misses moments overrides) it fits the bill for both OP and my purposes while having satisfying accuracy and performance.
MCVE
An interface compliant implementation of this random variable could be:
class logitnorm_gen(stats.rv_continuous):
def _argcheck(self, m, s):
return (s > 0.) & (m > -np.inf)
def _pdf(self, x, m, s):
return stats.norm(loc=m, scale=s).pdf(special.logit(x))/(x*(1-x))
def _cdf(self, x, m, s):
return stats.norm(loc=m, scale=s).cdf(special.logit(x))
def _rvs(self, m, s, size=None, random_state=None):
return special.expit(m + s*random_state.standard_normal(size))
def fit(self, data, **kwargs):
return stats.norm.fit(special.logit(data), **kwargs)
logitnorm = logitnorm_gen(a=0.0, b=1.0, name="logitnorm")
This implementation unlock most of the scipy random variables potential.
N = 1000
law = logitnorm(0.24, 1.31) # Defining a RV
sample = law.rvs(size=N) # Sampling from RV
params = logitnorm.fit(sample) # Infer parameters w/ MLE
check = stats.kstest(sample, law.cdf) # Hypothesis testing
bins = np.arange(0.0, 1.1, 0.1) # Bin boundaries
expected = np.diff(law.cdf(bins)) # Expected bin counts
As it relies on scipy normal distribution we may assume underlying functions have the same accuracy and performance than normal random variable object. But it might indeed be subject to float arithmetic inaccuracy especially when dealing with highly skewed distributions at the support boundary.
Tests
To check out how it performs we draw some distribution of interest and check them.
Let's create some fixtures:
def generate_fixtures(
locs=[-2.0, -1.0, 0.0, 0.5, 1.0, 2.0],
scales=[0.32, 0.56, 1.00, 1.78, 3.16],
sizes=[100, 1000, 10000],
seeds=[789, 123456, 999999]
):
for (loc, scale, size, seed) in itertools.product(locs, scales, sizes, seeds):
yield {"parameters": {"loc": loc, "scale": scale}, "size": size, "random_state": seed}
And perform checks on related distributions and samples:
eps = 1e-8
x = np.linspace(0. + eps, 1. - eps, 10000)
for fixture in generate_fixtures():
# Reference:
parameters = fixture.pop("parameters")
normal = stats.norm(**parameters)
sample = special.expit(normal.rvs(**fixture))
# Logit Normal Law:
law = logitnorm(m=parameters["loc"], s=parameters["scale"])
check = law.rvs(**fixture)
# Fit:
p = logitnorm.fit(sample)
trial = logitnorm(*p)
resample = trial.rvs(**fixture)
# Hypothetis Tests:
ks = stats.kstest(check, trial.cdf)
bins = np.histogram(resample)[1]
obs = np.diff(trial.cdf(bins))*fixture["size"]
ref = np.diff(law.cdf(bins))*fixture["size"]
chi2 = stats.chisquare(obs, ref, ddof=2)
Some adjustments with n=1000, seed=789 (this sample is quite normal) are shown below:
If you look at the source code of the pdf method, you will notice that _pdf is called without the scale and loc keyword arguments.
if np.any(cond):
goodargs = argsreduce(cond, *((x,)+args+(scale,)))
scale, goodargs = goodargs[-1], goodargs[:-1]
place(output, cond, self._pdf(*goodargs) / scale)
It results that the kwargs in your overriding _pdf method is always an empty dictionary.
If you look a bit closer at the code, you will also notice that the scaling and location are handled by pdf as opposed to _pdf.
In your case, the _pdf method calls norm.pdf so the loc and scale parameters must somehow be available in LogitNormal._pdf.
You could for example pass scale and loc when creating an instance of LogitNormal and store the values as class attributes:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logit
from scipy.stats import norm, rv_continuous
class LogitNormal(rv_continuous):
def __init__(self, scale=1, loc=0):
super().__init__(self)
self.scale = scale
self.loc = loc
def _pdf(self, x):
return norm.pdf(logit(x), loc=self.loc, scale=self.scale)/(x*(1-x))
fig, ax = plt.subplots()
values = np.linspace(10e-10, 1-10e-10, 1000)
sigma, mu = 1.78, 0
ax.plot(
values, LogitNormal(scale=sigma, loc=mu).pdf(values), label='subclassed'
)
ax.legend()
fig.show()
I have a set whose samples are discrete values (in particular, the size of a queue over time). Now I'd like to find what distribution they belong to. To achieve this goal I'd act the same way I did for the other quantities, i.e. plotting a qqplot, launching
import statsmodels.api as sm
sm.qqplot(df, dist = 'geom', sparams = (.5,), line ='s', alpha = 0.3, marker ='.')
This works if dist is not a discrete random variables (e.g. 'exp' or 'norm') and indeed I used to get some results, but when the distribution is discrete (say, 'geom'), I get
AttributeError: 'geom_gen' object has no attribute 'fit'
I searched on the Internet how to make a qqplot (or something similar) to spot what distribution my samples belong to but I found nothing
def discreteQQ(x_sample):
p_test = np.array([])
for i in range(0, 1001):
p_test = np.append(p_test, i/1000)
i = i + 1
x_sample = np.sort(x_sample)
x_theor = stats.geom.rvs(.5, size=len(x_sample))
ecdf_sample = np.arange(1, len(x_sample) + 1)/(len(x_sample)+1)
x_theor = stats.geom.ppf(ecdf_sample, p=0.5)
for p in p_test:
plt.scatter(np.quantile(x_theor, p), np.quantile(x_sample, p), c = 'blue')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Sample quantiles')
plt.show()
Generate a theoretical geometric distribution using scipy.stats.geom, convert the sample and theoretical data using statsmodels' ProbPlot and pass these to statsmodels' qqplot_2samples.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.graphics.gofplots import qqplot_2samples
p_theor = 1/4 # The probability we check for
p_sample = 1/5 # The true probability of the sample distribution
# The experimental data
x_sample = stats.geom.rvs(p_sample, size=50)
# The model data
x_theor = stats.geom.rvs(p_theor, size=100)
qqplot_2samples(ProbPlot(x_sample), ProbPlot(x_theor), line='45')
plt.show()