I am trying to understand how to fit a probability distribution function, such as Pearson type 3, to a data set (specifically, mean annual rainfall in an area). I've read some questions about this, but I still miss something and the fitting doesn't get right. As for now my code is this (the specific data file can be downloaded from here):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearson3
year,mm = np.loadtxt('yearly_mm_sde_boker_month_1960_2016.csv',delimiter=',').T
fig,ax=plt.subplots(1,2,figsize=(2*1.62*3,3))
ax[0].plot(year,mm)
dump=ax[1].hist(mm)
size = len(year)
param = pearson3.fit(mm)
pdf_fitted = pearson3.pdf(year, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,len(year))
plt.legend(loc='upper right')
plt.show()
What am I missing?
Well, this works:
param = pearson3.fit(mm) # distribution fitting
# now, param[0] and param[1] are the mean and
# the standard deviation of the fitted distribution
x = np.linspace(0,200,100)
# fitted distribution
pdf_fitted = pearson3.pdf(x,*param[:-2], loc=param[-2], scale=param[-1])
# original distribution
#pdf = norm.pdf(x)
plt.title('Pearson 3 distribution')
plt.plot(x,pdf_fitted,'r-')#,x,pdf,'b--')
dump=plt.hist(mm,normed=1,alpha=.3)
Related
I'm trying to remove the trend present in the waveform which looks like the following:
For doing so, I use scipy.signal.detrend() as follows:
autocorr = scipy.signal.detrend(autocorr)
But I don't see any significant flattening in trend. I get the following:
My objective is to have the trend completely eliminated from the waveform. And I need to also generalize it so that it can detrend any kind of waveform - be it linear, piece-wise linear, polynomial, etc.
Can you please suggest a way to do the same?
Note: In order to replicate the above waveform, you can simply run the following code that I used to generate it:
#Loading Libraries
import warnings
warnings.filterwarnings("ignore")
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
#DETRENDING:
autocorr = scipy.signal.detrend(autocorr)
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
Since it's an auto-correlation, it will always be even; so detrending with a breakpoint at lag=0 should get you part of the way there.
An alternative way to detrend is to use a high-pass filter; you could do this in two ways. What will be tricky is deciding what the cut-off frequency should be.
Here's a possible way to do this:
#Loading Libraries
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
# High-pass filter via discrete Fourier transform
# Drop all components from 0th to dropcomponent-th
def dft_highpass(x, dropcomponent):
fx = np.fft.rfft(x)
fx[:dropcomponent] = 0
return np.fft.irfft(fx)
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*signal.square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*signal.square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
# detrend w/ breakpoints
dautocorr = signal.detrend(autocorr, bp=len(lags)//2)
# detrend w/ high-pass filter
# use `filtfilt` to get zero-phase
b, a = signal.butter(1, 1e-3, 'high')
fautocorr = signal.filtfilt(b, a, autocorr)
# detrend with DFT HPF
rautocorr = dft_highpass(autocorr, len(autocorr) // 1000)
#Visualization
fig, ax = plt.subplots(3)
for i in range(3):
ax[i].plot(lags, autocorr, label='orig')
ax[0].plot(lags, dautocorr, label='detrend w/ bp')
ax[1].plot(lags, fautocorr, label='HPF')
ax[2].plot(lags, rautocorr, label='DFT')
for i in range(3):
ax[i].legend()
ax[i].set_ylabel('autocorr')
ax[-1].set_xlabel('lags')
giving
I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
I have a set whose samples are discrete values (in particular, the size of a queue over time). Now I'd like to find what distribution they belong to. To achieve this goal I'd act the same way I did for the other quantities, i.e. plotting a qqplot, launching
import statsmodels.api as sm
sm.qqplot(df, dist = 'geom', sparams = (.5,), line ='s', alpha = 0.3, marker ='.')
This works if dist is not a discrete random variables (e.g. 'exp' or 'norm') and indeed I used to get some results, but when the distribution is discrete (say, 'geom'), I get
AttributeError: 'geom_gen' object has no attribute 'fit'
I searched on the Internet how to make a qqplot (or something similar) to spot what distribution my samples belong to but I found nothing
def discreteQQ(x_sample):
p_test = np.array([])
for i in range(0, 1001):
p_test = np.append(p_test, i/1000)
i = i + 1
x_sample = np.sort(x_sample)
x_theor = stats.geom.rvs(.5, size=len(x_sample))
ecdf_sample = np.arange(1, len(x_sample) + 1)/(len(x_sample)+1)
x_theor = stats.geom.ppf(ecdf_sample, p=0.5)
for p in p_test:
plt.scatter(np.quantile(x_theor, p), np.quantile(x_sample, p), c = 'blue')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Sample quantiles')
plt.show()
Generate a theoretical geometric distribution using scipy.stats.geom, convert the sample and theoretical data using statsmodels' ProbPlot and pass these to statsmodels' qqplot_2samples.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.graphics.gofplots import qqplot_2samples
p_theor = 1/4 # The probability we check for
p_sample = 1/5 # The true probability of the sample distribution
# The experimental data
x_sample = stats.geom.rvs(p_sample, size=50)
# The model data
x_theor = stats.geom.rvs(p_theor, size=100)
qqplot_2samples(ProbPlot(x_sample), ProbPlot(x_theor), line='45')
plt.show()
I am writing a code about a mono-energetic gamma beam which the dominated interaction is photoelectric absorption, mu=2 cm-1, and i need to generate 50000 random numbers and sample the interaction depth(which I do not know if i did it or not).
I know that the mean free path=mu-1, but I need to find the mean free path from the simulation and from mu and compare them, is what I did right in the code or not?
import random
import matplotlib.pyplot as plt
import numpy as np
mu=(2)
random.seed=()
data = np.random.randn(50000)*10
bins = np.arange(data.min(), data.max()+1e-8, 0.1)
meanfreepath = 1/mu
print(meanfreepath)
plt.hist(data, bins=bins)
plt.show()
Well, interaction depth distribution is Exponential one, not a gaussian.
So code would be
lmbda = 2 # cm^-1
beta = 1.0/lmbda
data = np.random.exponential(scale=beta, size=50000)
mfp = np.mean(data)
print(mfp)
# build histogram
More details at https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.exponential.html
Code above produced
0.4977168417102998
which looks like 2-1 to me
I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.