efficient sampling from beta-binomial distribution in python - python

for a stochastic simulation I need to draw a lot of random numbers which are beta binomial distributed.
At the moment I implemented it this way (using python):
import scipy as scp
from scipy.stats import rv_discrete
class beta_binomial(rv_discrete):
"""
creating betabinomial distribution by defining its pmf
"""
def _pmf(self, k, a, b, n):
return scp.special.binom(n,k)*scp.special.beta(k+a,n-k+b)/scp.special.beta(a,b)
so sampling a random number x can be done by:
betabinomial = beta_binomial(name="betabinomial")
x = betabinomial.rvs(0.5,0.5,3) # with some parameter
The problem is, that sampling one random number takes ca. 0.5ms, which is in my case dominating the whole simulation speed. The limiting element is the evaluation of the beta functions (or gamma functions within these).
Does anyone has a great idea how to speed up the sampling?

Well, here is working and lightly tested code which seems to be faster, using compound distribution property of Beta-Binomial.
We sample p from beta and then using it as parameter for binomial. If you would sample large sized vectors, it would be even faster.
import numpy as np
def sample_Beta_Binomial(a, b, n, size=None):
p = np.random.beta(a, b, size=size)
r = np.random.binomial(n, p)
return r
np.random.seed(777777)
q = sample_Beta_Binomial(0.5, 0.5, 3, size=10)
print(q)
Output is
[3 1 3 2 0 0 0 3 0 3]
Quick test
np.random.seed(777777)
n = 10
a = 2.
b = 2.
N = 100000
q = sample_Beta_Binomial(a, b, n, size=N)
h = np.zeros(n+1, dtype=np.float64) # histogram
for v in q: # fill it
h[v] += 1.0
h /= np.float64(N) # normalization
print(h)
prints histogram
[0.03752 0.07096 0.09314 0.1114 0.12286 0.12569 0.12254 0.1127 0.09548 0.06967 0.03804]
which is quite similar to green graph in the Wiki page on Beta-Binomial

Related

python - unwanted periodicity in the autocorrelation result

I'm calculating the autocorrelation sampled by every kth element.
Autocorrelation equation image
I'm currently debugging a function that calculates the autocorrelation. The function uses two methods: equation(1) - Rtau and equation(2) - Rktau. Rtau is the original method, while Rktau is a modified version that takes into account the sampling interval k.
My problem is that unwanted periodicity appears with the modified autocorrelation function - equation(2).
I know that the original autocorrelation function is defined by equation(1). However, to relax the high sample rate, equation(2) is used for autocorrelation calculation. This is because, with the WSS assumption, the autocorrelation function depends only on the interval τ instead of n (or absolute time).
import numpy as np
k = 7 # sampling interval
M = 10000 # list size for autocorrelation calculation
R = 2**9 # autocorrelation size
N = 2**20 # data size
data = np.random.normal(1e-1, 1e-2, N)
Rtau = []
Rktau = []
for tau in range(R):
r = data[:M] * data[tau:tau+M]
rk = data[:M*k:k] * data[tau:tau+M*k:k]
Rtau.append(np.mean(r))
Rktau.append(np.mean(rk))
As expected, the results from Rktau are quite similar to those from Rtau. However, I've noticed that Rktau shows unwanted periodicity that is determined by the sampling interval k. This periodicity isn't present in Rtau.
Unwanted periodicity in Rktau:
import numpy as np
from matplotlib import pyplot as plt
k = 7 # sampling rate divider
M = 10000 # list size for autocorrelation calculation
R = 2**9 # autocorrelation size
N = 2**20 # data size
data = np.random.normal(1e-1, 1e-2, N)
Rtau = []
Rktau = []
for tau in range(R):
r = data[:M] * data[tau:tau+M]
Rtau.append(np.mean(r))
for ktau in range(0, R, k):
rk = data[:M] * data[ktau:ktau+M]
Rktau.append(np.mean(rk))
plt.plot(range(R), Rtau)
plt.plot(range(0, R, k), Rktau)

Simulating an elementary stochastic process in Python

I'm trying to simulate a simple stochastic process in Python, but with no success. The process is the following:
x(t + δt) = r(t) * x(t)
where r(t) is a Bernoulli random variable that can assume the values 1.5 or 0.6.
I've tried the following:
n = 10
r = np.zeros( (1,n))
for i in range(0, n, 1):
if r[1,i] == r[1,0]:
r[1,i] = 1
else:
B = bernoulli.rvs(0.5, size=1)
if B == 0:
r[1,i] = r[1,i-1] * 0.6
else:
r[1,i] = r[1,i-1] * 1.5
Can you explain why this is wrong and a possible solution?
So , first thing is that the SDE should be perceived over time, so you also need to consider the discretization rather than just giving the number of steps through n .
Essentially, what you are asking is just a simple random walk with a Bernoulli random variable taking on the values 0.5 and 1.6 instead of a Gaussian (standard normal) random variable.
So I have created an answer here, using NumPy to create the Bernoulli random variable for efficiency (numpy is faster than scipy) and then running the simulation with a stepsize of 0.01 then plotting the solution using matplotlib.
One thing to note that this SDE is one dimensional so we can just store the state and time in separate vectors and plot them at the end.
# Function generating bernoulli trial (your r(t))
def get_bernoulli(p=0.5):
'''
Function using numpy (faster than scipy.stats)
to generate bernoulli random variable with values 0.5 or 1.6
'''
B = np.random.binomial(1, p, 1)
if B == 0:
return 0.6
else:
return 1.5
This is then used in the simulation as
import numpy as np
import matplotlib.pyplot as plt
dt = 0.01 #step size
x0 = 1# initialize
tfinal = 1
sqrtdt = np.sqrt(dt)
n = int(tfinal/dt)
# State and time vectors
xtraj = np.zeros(n+1, float)
trange = np.linspace(start=0,stop=tfinal ,num=n+1)
# initialized
xtraj[0] = x0
for i in range(n):
xtraj[i+1] = xtraj[i] * get_bernoulli(p=0.5)
plt.plot(trange,xtraj,label=r'$x(t)$')
plt.xlabel("time")
plt.ylabel(r"$X$")
plt.legend()
plt.show()
Where we assumed the Bernoulli trial is fair, but can be customized to add some more variation.

Difference in result between fmin and fminsearch in Matlab and Python

My objective is to perform an Inverse Laplace Transform on some decay data (NMR T2 decay via CPMG). For that, we were provided with the CONTIN algorithm. This algorithm was adapted to Matlab by Iari-Gabriel Marino, and it works very well. I want to adapt this code into Python. The core of the problem is with scipy.optimize.fmin, which is not minimizing the mean square deviation (MSD) in any way similar to Matlab's fminsearch. The latter results in a good minimization, while the former doesn't.
I have gone through line by line of my adapted code in Python, and the original Matlab. I checked every matrix and every output. I used this to identify that the critical point is in fmin. I also tried scipy.optimize.minimize and other minimization algorithms, but none gave even remotely satisfactory results.
I have made two MWE, for Python and Matlab, to make it reproducible to all. The example data were obtained from the documentation of the matlab function. Apologies if this is long code, but I don't really know how to shorten it without sacrificing readability and clarity. I tried to have the lines match as closely as possible. I am using Python 3.7.3, scipy v1.3.0, numpy 1.16.2, Matlab R2018b, on Windows 8.1. It's a relatively recent Anaconda install (<2 months).
My code:
import numpy as np
from scipy.optimize import fmin
import matplotlib.pyplot as plt
def msd(g, y, A, alpha, R, w, constraints):
""" msd: mean square deviation. This is the function to be minimized by fmin"""
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
r = np.diff(g, axis=0, n=2)
yfit = A # g
# Sum of weighted square residuals
VAR = np.sum(w * (y - yfit) ** 2)
# Regularizor
REG = alpha ** 2 * np.sum((r - R # g) ** 2)
# output to be minimized
return VAR + REG
# Objective: match this distribution
g0 = np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]).reshape((-1, 1))
s0 = np.logspace(-3, 6, len(g0)).reshape((-1, 1))
t = np.linspace(0.01, 500, 100).reshape((-1, 1))
sM, tM = np.meshgrid(s0, t)
A = np.exp(-tM / sM)
np.random.seed(1)
# Creates data from the initial distribution with some random noise.
data = (A # g0) + 0.07 * np.random.rand(t.size).reshape((-1, 1))
# Parameters and function start
alpha = 1E-2 # regularization parameter
s = np.logspace(-3, 6, 20).reshape((-1, 1)) # x of the ILT
g0 = np.ones(s.size).reshape((-1, 1)) # guess of y of ILT
y = data # noisy data
options = {'maxiter':1e8, 'maxfun':1e8} # for the fmin function
constraints=['g>0', 'zero_at_extremes'] # constraints for the MSD function
R=np.zeros((len(g0) - 2, len(g0)), order='F') # Regularizor
w=np.ones(y.reshape(-1, 1).size).reshape((-1, 1)) # Weights
sM, tM = np.meshgrid(s, t, indexing='xy')
A = np.exp(-tM/sM)
g0 = g0 * y.sum() / (A # g0).sum() # Makes a "better guess" for the distribution, according to algorithm
print('msd of input data:\n', msd(g0, y, A, alpha, R, w, constraints))
for i in range(5): # Just for testing. If this is extremely high, ~1000, it's still bad.
g = fmin(func=msd,
x0 = g0,
args=(y, A, alpha, R, w, constraints),
**options,
disp=True)[:, np.newaxis]
msdfit = msd(g, y, A, alpha, R, w, constraints)
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
g0 = g
print('New guess', g)
print('Final msd of g', msdfit)
# Visualize the fit
plt.plot(s, g, label='Initial approximation')
plt.plot(np.logspace(-3, 6, 11), np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]), label='Distribution to match')
plt.xscale('log')
plt.legend()
plt.show()
Matlab:
% Objective: match this distribution
g0 = [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0]';
s0 = logspace(-3,6,length(g0))';
t = linspace(0.01,500,100)';
[sM,tM] = meshgrid(s0,t);
A = exp(-tM./sM);
rng(1);
% Creates data from the initial distribution with some random noise.
data = A*g0 + 0.07*rand(size(t));
% Parameters and function start
alpha = 1e-2; % regularization parameter
s = logspace(-3,6,20)'; % x of the ILT
g0 = ones(size(s)); % initial guess of y of ILT
y = data; % noisy data
options = optimset('MaxFunEvals',1e8,'MaxIter',1e8); % constraints for fminsearch
constraints = {'g>0','zero_at_the_extremes'}; % constraints for MSD
R = zeros(length(g0)-2,length(g0));
w = ones(size(y(:)));
[sM,tM] = meshgrid(s,t);
A = exp(-tM./sM);
g0 = g0*sum(y)/sum(A*g0); % Makes a "better guess" for the distribution
disp('msd of input data:')
disp(msd(g0, y, A, alpha, R, w, constraints))
for k = 1:5
[g,msdfit] = fminsearch(#msd,g0,options,y,A,alpha,R,w,constraints);
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g);
end
g0 = g;
end
disp('New guess')
disp(g)
disp('Final msd of g')
disp(msdfit)
% Visualize the fit
semilogx(s, g)
hold on
semilogx(logspace(-3,6,11), [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0])
legend('First approximation', 'Distribution to match')
hold off
function out = msd(g,y,A,alpha,R,w,constraints)
% msd: The mean square deviation; this is the function
% that has to be minimized by fminsearch
% Constraints and any 'a priori' knowledge
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g); % must be g(i)>=0 for each i
end
r = diff(diff(g(1:end))); % second derivative of g
yfit = A*g;
% Sum of weighted square residuals
VAR = sum(w.*(y-yfit).^2);
% Regularizor
REG = alpha^2 * sum((r-R*g).^2);
% Output to be minimized
out = VAR+REG;
end
Here is the optimization in Python
Here is the optimization in Matlab
I have checked the output of MSD of g0 before starting, and both give the value of 2651. After minimization, Python goes up, to 4547, and Matlab goes down to 0.1381.
I think the problem is one of the following. It's in my implementation, that is, I am using fmin wrong, or there's some other passage I got wrong, but I can't figure out what. The fact the MSD increases when it should have decreased with a minimization function is damning. Reading the documentation, the scipy implementation is different from Matlab's (they use the Nelder Mead method described in Lagarias, per their documentation), while scipy uses the original Nelder Mead). Maybe that affects significantly? Or perhaps my initial guess is too bad for scipy's algorithm?
So, quite a long time since I posted this, but I wanted to share what I ended up learning and doing.
The Inverse Laplace Transform for CPMG data is a bit of a misnomer, and it's more properly called just inversion. The general problem is solving a Fredholm integral of the first kind. One way of doing this is the Tikhonov regularization method. Turns out, you can describe this problem quite easily using numpy, and solve it with a scipy package, so I don't have to "reinvent" the wheel with this.
I used the solution shown in this post, and the names here reflect that solution.
def tikhonov_regularized_inversion(
kernel: np.ndarray, alpha: float, data: np.ndarray
) -> np.ndarray:
data = data.reshape(-1, 1)
I = alpha * np.eye(*kernel.shape)
C = np.concatenate([kernel, I], axis=0)
d = np.concatenate([data, np.zeros_like(data)])
x, _ = nnls(C, d.flatten())
Here, kernel is a matrix containing all the possible exponential decay curves, and my solution judges the contribution of each decay curve in the data I received. First, I stack my data as a column, then pad it with zeros, creating the vector d. I then stack my kernel on top of a diagonal matrix containing the regularization parameter alpha along the diagonal, of the same size as the kernel. Last, I call the convenient nnls, a non negative least square solver in scipy.optimize. This is because there's no reason to have a negative contribution, only no contribution.
This solved my problem, it's quick and convenient.

Python ARMA MLE (Implementing Algorithms from Literature)

Overview
I am trying to implement autoregressive moving average (ARMA) parameter optimization using maximum likelihood estimation (MLE) via the Kalman Filter. I know that I can fit ARMA models using the statsmodels package in Python, but I want to write my own implementation of the ARMA likelihood and subsequent optimization as a prototype for a future C/C++ implementation. Also, when I look through the statsmodels documentation, I find that the statsmodels Kalman Filter Log Likelihood implements a slightly different expression than I have found in the literature.
Algorithms
In order to calculate the ARMA log likelihood, I am following the 1980 paper by Pearlman:
Pearlman, J. G. "An algorithm for the exact likelihood of a high-order autoregressive-moving average process." Biometrika 67.1 (1980): 232-233.). Available from JSTOR.
In order to calculate the initial P matrix, I am following an algorithm in
Gardner, G., Andrew C. Harvey, and Garry DA Phillips. "Algorithm AS 154: An algorithm for exact maximum likelihood estimation of autoregressive-moving average models by means of Kalman filtering." Journal of the Royal Statistical Society. Series C (Applied Statistics) 29.3 (1980): 311-322. Available from JSTOR.
For the initial parameter values, I am currently using the internal method that statsmodels ARMA models use to compute the initial guess for ARMA parameters. In the future I plan to move to my own implementation, but I am using _fit_starts_params while I debug my MLE.
For optimizing the MLE, I am simply using the L-BFGS solver in Scipy.
Code
import numpy as np
import statsmodels.api as sm
import statsmodels.tsa.arima_model
import scipy.optimize
class ARMA(object):
def __init__(self, endo, nar, nma):
np.random.seed(0)
# endogenous variables
self.endo = endo
# Number of AR terms
self.nar = nar
# Number of MA terms
self.nma = nma
# "Dimension" of the ARMA fit
self.dim = max(nar, nma+1)
# Current ARMA parameters
self.params = np.zeros(self.nar+self.nma, dtype='float')
def __g(self, ma_params):
'''
Build MA parameter vector
'''
g = np.zeros(self.dim, dtype='float')
g[0] = 1.0
if self.nma > 0:
g[1:self.nma+1] = ma_params
return g
def __F(self, ar_params):
'''
Build AR parameter matrix
'''
F = np.zeros((self.dim, self.dim), dtype='float')
F[:self.nar, 0] = ar_params
for i in xrange(1, self.dim):
F[i-1, i] = 1.0
return F
def __initial_P(self, R, T):
'''
Solve for initial P matrix
Solves P = TPT' + RR'
'''
v = np.zeros(self.dim*self.dim, dtype='float')
for i in xrange(self.dim):
for j in xrange(self.dim):
v[i+j*self.dim] = R[i]*R[j]
R = np.array([R])
S = np.identity(self.dim**2, dtype='float')-np.kron(T, T)
V = np.outer(R, R).ravel('F')
Pmat = np.linalg.solve(S,V).reshape(self.dim, self.dim, order='F')
return Pmat
def __likelihood(self, params):
'''
Compute log likehood for a parameter vector
Implements the Pearlman 1980 algorithm
'''
# these checks are pilfered from statsmodels
if self.nar > 0 and not np.all(np.abs(np.roots(np.r_[1, -params[:self.nar]]))<1):
print 'AR coefficients are not stationary'
if self.nma > 0 and not np.all(np.abs(np.roots(np.r_[1, -params[-self.nma:]]))<1):
print 'MA coefficients are not stationary'
ar_params = params[:self.nar]
ma_params = params[-self.nma:]
g = self.__g(ma_params)
F = self.__F(ar_params)
w = self.endo
P = self.__initial_P(g, F)
n = len(w)
z = np.zeros(self.dim, dtype='float')
R = np.zeros(n, dtype='float')
a = np.zeros(n, dtype='float')
K = np.dot(F, P[:, 0])
L = K.copy()
R[0] = P[0, 0]
for i in xrange(1, n):
a[i-1] = w[i-1] - z[0]
z = np.dot(F, z) + K*(a[i-1]/R[i-1])
Kupdate = -(L[0]/R[i-1])*np.dot(F, L)
Rupdate = -L[0]*L[0]/R[i-1]
P -= np.outer(L, L)/R[i-1]
L = np.dot(F, L) - (L[0]/R[i-1])*K
K += Kupdate
R[i] = R[i-1] + Rupdate
if np.abs(R[i] - 1.0) < 1e-9:
R[i:] = 1.0
break
for j in xrange(i, n):
a[j] = w[j] - z[0]
z = np.dot(F, z) + K*(a[i-1]/R[i-1])
likelihood = 0.0
for i in xrange(n):
likelihood += np.log(R[i])
likelihood *= -0.5
ssum = 0.0
for i in xrange(n):
ssum += a[i]*a[i]/R[i]
likelihood += -0.5*n*np.log(ssum)
return likelihood
def fit(self):
'''
Fit the ARMA model by minimizing the loglikehood
Uses scipy.optimize.minimize
'''
sm_arma = statsmodels.tsa.arima_model.ARMA(endog=self.endo, order=(self.nar, self.nma, 0))
params = statsmodels.tsa.arima_model.ARMA._fit_start_params_hr(sm_arma, order=(self.nar, self.nma, 0))
opt = scipy.optimize.minimize(fun=self.__likelihood, x0=params, method='L-BFGS-B')
print opt
# Test the code on statsmodels sunspots data
nar = 2
nma = 1
endo = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY'].tolist()
arma = ARMA(endo=endo, nar=nar, nma=nma)
arma.fit()
Issues
I find that the above example does not converge. In the third call of ARMA._likelihood, the code throws the following warning:
RuntimeWarning: invalid value encountered in log
likelihood += np.log(R[i])
which happens because ARMA._initial_P solves for a matrix where P[0][0] < 0.0. At this point, the current estimates of the AR parameters become non-stationary. All subsequent iterations then warn that the AR and MA parameters are non-stationary.
Questions
Is this implementation correct? I have checked that the initial P matrix satisfies the equation it is supposed to satisfy. For the likelihood calculation, I see several behaviors that I expect from the Pearlman paper:
R tends to one. For a pure AR process with p AR parameters, it achieves this limit in p steps. Basically, the break statement in the _likelihood function comes into effect after p iterations of the Pearlman algorithm steps.
L tends to the zero vector.
K tends to F.g. I check this by looking at abs(K - F.g) while calculating the likelihood.
After the warning about the negative value in the logarithm, the above limits are no longer obeyed.
I have also tried implementing a transformation of the ARMA parameters to prevent overflow/underflow, as recommended in
Jones, Richard H. "Maximum likelihood fitting of ARMA models to time series with missing observations." Technometrics 22.3 (1980): 389-395. Available from JSTOR.
This transformation seemed to no effect on the errors I observed.
If the implementation is correct, then how do I handle the negative R values? The issue seems to arise when scipy.optimize returns a parameter vector that corresponds to a P matrix for which the top diagonal element is negative. Is the optimization routine supposed to be bounded to prevent negative R values? I have also tried using complex logarithms for negative values as well as changing all numpy dtype parameters to 'complex'. For example:
def complex_log(val):
'''
Complex logarithm for negative values
Returns log(val) + I*pi
'''
if val < 0.0:
return complex(np.log(np.abs(val)), np.pi)
return np.log(val)
However, scipy.optimize cannot handle complex-valued functions, so this supposed fix has not worked so far. Any recommendations for preventing or handling these behaviors?
Thanks for reading this far. Any help is much appreciated.

Exercise on calculating and plotting cumulated empirical distribution

I was trying to finish an exercise in Jonh Stachurski's book (a textbook devoted to teach economists how to use Python). One of these is about how to calculate and plot cumulated empirical distribution. They provide a class called ecdf to calculate empirical distribution function
# Filename: ecdf.py
# Author: John Stachurski
# Date: December 2008
# Corresponds to: Listing 6.3
class ECDF:
def __init__(self, observations):
self.observations = observations
def __call__(self, x):
counter = 0.0
for obs in self.observations:
if obs <= x:
counter += 1
return counter / len(self.observations)
And the excercise reads
【Exercise 6.1.12】 Add a method to the ECDF class that uses Matplotlib to plot the em-
pirical distribution over a specified interval. Replicate the four graphs in figure 6.3
(modulo randomness).
the figure is need to be replicated is
and an illusion of algorithm
The following is my initial attempt
from ecdf import ECDF
import numpy as np
import matplotlib.pyplot as plt
from srs import SRS
from math import sqrt
from random import lognormvariate
# =========================
# parameters and arguments
# =========================
alpha, sigma2, s, delta = 0.3, 0.2, 0.5, 0.1
# numbers of draws
n = 1000
# length of each markov chain
t = 20
num_simu = [4,25,100,5000]
# Define F(k, z) = s k^alpha z + (1 - delta) k
F = lambda k, z: s * (k**alpha) * z + (1 - delta) * k
lognorm = lambda: lognormvariate(0, sqrt(sigma2))
# =====================
# create empirical distribution
# =====================
# different draw numbers
k = np.linspace(0,25,500)
for n in num_simu:
for x in range(n):
# list used to store capital stock (kt) in the last periods (t=20)
kt = []
solow_srs = SRS(F=F, phi=lognorm, X=1.0)
px = solow_srs.sample_path(t)
kt.append(px[-1])
# generate the empirical distribution function
F = ECDF(kt)
prob_kt_n = [F(i) for i in k] # need to determine range
# n refers to the n-th draw
# ==================================
# use for-loop to create subplots
# ==================================
#k = np.linspace(0,25,500)
#num_rows,num_cols = 2,2
The difficulties to me are 1) How can I store list/array of empirical distribution results for different draw numbers in the given graph. 2) How to create subplots using a for-loop. I also encountered some other tiny errors.
Thank you for your suggestions.
About (1), my advice is to create a dictionary (i.e. something like d = {} and then d[n] = ECDF(data) for each number n of observations).
Dunno about (2).

Categories

Resources