I'm trying to use pyMC to provide a Bayesian estimate of a covariance matrix given some data. I'm roughly following the stock covariance example provided in this online guide (link here), but I have a more simplistic example model that I made up. I've got two values that I draw from a multivariate normal distribution, and I've constructed it in such a way that I know the covariance/correlation between the two variables.
I've posted my short code below. Essentially what I'm doing is constructing an artificial data set where the correlation matrix should be [[1, -0.5], [-0.5, 1]]. At the end of the mcmc sampling, I get a predicted value for the off-diagonal term that is quite a bit different. I've looked at the convergence criteria, and it looks like the autocorrelation is low and the distribution is stationary. However, I will admit I'm still wrapping my head around all the nuances here and there could be aspects of this that are still beyond my grasp.
This question is related to and very much based on these other two SO questions (One and Two). I felt the need to ask my own question despite the similarity because I'm not getting the answer I expect to get. If any of you computational statisticians out there can help provide insight into this problem it would be greatly appreciated!
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns
p=2
prior_mu=np.ones(p)
prior_sdev=np.ones(p)
prior_corr_inv=np.eye(p)
def cov2corr(A):
"""
covariance matrix to correlation matrix.
"""
d = np.sqrt(A.diagonal())
A = ((A.T / d).T) / d
#A[ np.diag_indices(A.shape[0]) ] = np.ones( A.shape[0] )
return A
# construct artificial data set
muVector=[10,5]
sdevVector=[3.,5.]
corrMatrix=np.matrix([[1,-0.5],[-0.5, 1]])
cov_matrix=np.diag(sdevVector)*corrMatrix*np.diag(sdevVector)
n_obs = 500
x = np.random.multivariate_normal(muVector,cov_matrix,n_obs)
prior_mu = np.array(muVector)
prior_std = np.array(sdevVector)
inv_cov_matrix = pm.Wishart( "inv_cov_matrix", n_obs, np.diag(prior_std**2) )
mu = pm.Normal( "returns", prior_mu, 1, size = 2)
# create the model and sample
obs = pm.MvNormal( "observed returns", mu, inv_cov_matrix, observed = True, value = x )
model = pm.Model( [obs, mu, inv_cov_matrix] )
mcmc = pm.MCMC(model)
mcmc.use_step_method(pm.AdaptiveMetropolis,inv_cov_matrix)
mcmc.sample( 1e5, 2e4, 10)
# Determine prediction - Does not equal corrMatrix!
inv_cov_samples = mcmc.trace("inv_cov_matrix")[:]
mean_covariance_matrix = np.linalg.inv( inv_cov_samples.mean(axis=0) )
prediction = cov2corr(mean_covariance_matrix*n_obs)
Related
Getting my head around the DCP rules. I am looking at the portfolio optimisation example provided on the CVXPY website (see below original codes). Had a look at some of the other queries that deals with DCP rules but couldn't get the answer I wanted.
I tried replacing the Sigma (i.e. covariance) in their code (which is randomly generated) with cov generated from some historic returns for some of the asset classes. Everything else is the same.
Yet I get cvxpy.error.DCPError: Problem does not follow DCP rules.
I have also added pics of the two Sigmas (one generated randomly by the CVXPY code and the other Sigma(1) is the historic cov array I use)
Both are 9*9 arrays, but as I mentioned replacing the randomly generated array with an array with historic numbers gives me that error, all other codes kept the same. Any idea what's causing this issue?
# Generate data for long only portfolio optimization.
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
np.random.seed(1)
n = 10
mu = np.abs(np.random.randn(n, 1))
Sigma = np.random.randn(n, n)
Sigma = Sigma.T.dot(Sigma)
# Long only portfolio optimization.
import cvxpy as cp
w = cp.Variable(n)
gamma = cp.Parameter(nonneg=True)
ret = mu.T*w
risk = cp.quad_form(w, Sigma)
prob = cp.Problem(cp.Maximize(ret - gamma*risk),
[cp.sum(w) == 1,
w >= 0])
# Compute trade-off curve.
SAMPLES = 100
risk_data = np.zeros(SAMPLES)
ret_data = np.zeros(SAMPLES)
gamma_vals = np.logspace(-2, 3, num=SAMPLES)
for i in range(SAMPLES):
gamma.value = gamma_vals[i]
prob.solve()
risk_data[i] = cp.sqrt(risk).value
ret_data[i] = ret.value
# Plot long only trade-off curve.
import matplotlib.pyplot as plt
#%matplotlib inline
#%config InlineBackend.figure_format = 'svg'
markers_on = [29, 40]
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(risk_data, ret_data, 'g-')
for marker in markers_on:
plt.plot(risk_data[marker], ret_data[marker], 'bs')
ax.annotate(r"$\gamma = %.2f$" % gamma_vals[marker], xy=(risk_data[marker]+.08, ret_data[marker]-.03))
for i in range(n):
plt.plot(cp.sqrt(Sigma[i,i]).value, mu[i], 'ro')
plt.xlabel('Standard deviation')
plt.ylabel('Return')
plt.show()
Observation: it is possible that the variance-covariance matrix in a portfolio optimization is not positive semi-definite. In theory the convariance matrix can be shown to be positive semi-definite. However, due to floating-point rounding errors, we may actually see (slightly) negative eigenvalues. (Note: a positive semi-definite matrix has non-negative eigenvalues).
I know of three approaches to handle this:
Use a statistical technique called shrinkage,
Perturb the diagonal a bit by adding small constant to each diagonal element,
Use a variant of the standard portfolio model based on mean adjusted returns.
For details see link.
I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.
I have created two mini encoding networks for Standard autoencoder and VAE and ploted each. Would just like to know if my understanding is correct for this mini case. Note it's only one epoch and it ends with encoding.
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(0)
fig, (ax,ax2) = plt.subplots(2,1)
def relu(x):
c = np.where(x>0,x,0)
return c
#Standard autoencoder
x = np.random.randint(0,2,[100,5])
w_autoencoder = np.random.normal(0,1,[5,2])
bottle_neck = relu(x.dot(w_autoencoder))
ax.scatter(bottle_neck[:,0],bottle_neck[:,1])
#VAE autoencoder
w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
epsilon_sample = np.random.normal(0,1,[100,2])
latent_space = mu+np.log2(sigma)*epsilon_sample
ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')
w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
epsilon_sample = np.random.normal(0,1,[100,2])
latent_space = mu+np.log2(sigma)*epsilon_sample
ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')
Since your motive is "understanding", I should say you are in the right direction and working on this sort of implementation definitely helps you in understanding. But I strongly believe "understanding" has to be achieved first in books/papers and only then via the implementation/code.
On a quick glance, your standard autoencoder looks fine. You are making an assumption via your implementation that your latent code would be in the range of (0,infinity) using relu(x).
However, while doing the implementation of VAE, you can't achieve the latent code with relu(x) function. This is where your "theoretical" understanding is missing. In standard VAE, we make an assumption that the latent code is a sample from a Gaussian distribution and as such we approximate the parameters of that Gaussian distribution i.e. mean and covariance. Further, we also make another assumption that this Gaussian distribution is factorial which means the covariance matrix is diagonal. In your implementation, you are approximating the mean and diagonal covariance as:
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
which seems fine but while getting sample (reparameterization trick), not sure why you introduced np.log2(). Since you are using ReLU() activation, you may end up with 0 in your sigma variable and when you do np.log2(0), you will get inf. I believe you were motivated by some available code where they do:
mu = relu(x.dot(w_vae1)) #same as yours
logOfSigma = x.dot(w_vae2) #you are forcing your network to learn log(sigma)
Now since you are approximating log of sigma, you can allow your output to be negative because to get the sigma, you would do something like np.exp(logOfSigma) and this would ensure you would always get positive values in your diagonal covariance matrix. Now to do sampling, you can simply do:
latent_code = mu + np.exp(logOfSigma)*epsilon_sample
Hope this helps!
Ok, so my current curve fitting code has a step that uses scipy.stats to determine the right distribution based on the data,
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]
for dist in sorted(zip(distributions, mles), key=lambda d: d[1]):
print dist
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
print [mod[0].name for mod in sorted(zip(distributions, mles), key=lambda d: d[1])]
Where data is a list of numeric values. This is working great so far for fitting unimodal distributions, confirmed in a script that randomly generates values from random distributions and uses curve_fit to redetermine the parameters.
Now I would like to make the code able to handle bimodal distributions, like the example below:
Is it possible to get a MLE for a pair of models from scipy.stats in order to determine if a particular pair of distributions are a good fit for the data?, something like
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
distributionPairs = [[modelA.name, modelB.name] for modelA in distributions for modelB in distributions]
and use those pairs to get an MLE value of that pair of distributions fitting the data?
It's not a complete answer but it may help you to solve your problem. Let say you know your problem is generated by two densities.
A solution would be to use k-mean or EM algorithm.
Initalization.
You initialize your algorithm by affecting every observation to one or the other density. And you initialize the two densities (you initialize the parameters of the density, and one of the parameter in your case is "gaussian", "laplace", and so on...
Iteration.
Then, iterately, you run the two following steps :
Step 1.
Optimize the parameters assuming that the affectation of every point is right. You can now use any optimization solver. This step provide you with an estimation of the best two densities (with given parameter) that fit your data.
Step 2.
You classify every observation to one density or the other according to the greatest likelihood.
You repeat until convergence.
This is very well explained in this web-page
https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
If you do not know how many densities have generated your data, the problem is more difficult. You have to work with penalized classification problem, which is a bit harder.
Here is a coding example in an easy case : you know that your data comes from 2 different Gaussians (you don't know how many variables are generated from each density). In your case, you can adjust this code to loop on every possible pair of density (computationally longer, but would empirically work I presume)
import scipy.stats as st
import numpy as np
#hard coded data generation
data = np.random.normal(-3, 1, size = 1000)
data[600:] = np.random.normal(loc = 3, scale = 2, size=400)
#initialization
mu1 = -1
sigma1 = 1
mu2 = 1
sigma2 = 1
#criterion to stop iteration
epsilon = 0.1
stop = False
while not stop :
#step1
classification = np.zeros(len(data))
classification[st.norm.pdf(data, mu1, sigma1) > st.norm.pdf(data, mu2, sigma2)] = 1
mu1_old, mu2_old, sigma1_old, sigma2_old = mu1, mu2, sigma1, sigma2
#step2
pars1 = st.norm.fit(data[classification == 1])
mu1, sigma1 = pars1
pars2 = st.norm.fit(data[classification == 0])
mu2, sigma2 = pars2
#stopping criterion
stop = ((mu1_old - mu1)**2 + (mu2_old - mu2)**2 +(sigma1_old - sigma1)**2 +(sigma2_old - sigma2)**2) < epsilon
#result
print("The first density is gaussian :", mu1, sigma1)
print("The first density is gaussian :", mu2, sigma2)
print("A rate of ", np.mean(classification), "is classified in the first density")
Hope it helps.
I've closely followed this book (http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/MorePyMC.ipynb) but have found myself running into problems when trying to use Pymc for my own problem.
I've got a bunch of order values from customers who have placed an order and they look reasonably like a Gamma distribution. I'm running an AB test and want to see how the distribution of order values changes - enter Pymc. I was following the example in the book but found it didn't really work for me - first attempt was this:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
## Have made slightly different to be able to see differing distributions
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Identical prior assumptions
prior_a = pm.Gamma('prior_a', 3.5, 0.015)
prior_b = pm.Gamma('prior_b', 3.5, 0.015)
## The difference in the test groups is the most important bit
#pm.deterministic
def delta(p_A = prior_a, p_B = prior_b):
return p_A - p_B
## Add observations
observation_a = pm.Gamma('observation_a', prior_a, value=observations_A, observed=True)
observation_b = pm.Gamma('observation_b', prior_b, value=observations_A, observed=True)
mcmc = pm.MCMC([prior_a, prior_b, delta, observation_a, observation_b])
mcmc.sample(20000,1000)
Looking at the mean of the trace for prior_a and prior_b I see values of around 3.97/3.98 and when I look at the stats of these priors I see a similar story. However, upon defining the priors, calling the rand() method on the prior gives me the kind of values I would expect (between 100 and 400). Basically, one of the updating stages (I'm least certain about the observation stages) is doing something I don't expect.
Having struggled with this for a bit I found this page (http://matpalm.com/blog/2012/12/27/dead_simple_pymc/) and decided a different approach may be in order:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Initial assumptions
A_Rate = pm.Uniform('A_Rate', 2, 4)
B_Rate = pm.Uniform('B_Rate', 2, 4)
A_Shape = pm.Uniform('A_Shape', 0.005, 0.05)
B_Shape = pm.Uniform('B_Shape', 0.005, 0.05)
p_A = pm.Gamma('p_A', A_Rate, A_Shape, value=observations_A, observed=True)
p_B = pm.Gamma('p_B', A_Rate, B_Shape, value=observations_B, observed=True)
## Sample
mcmc = pm.MCMC([p_A, p_B, A_Rate, B_Rate, A_Shape, B_Shape])
mcmc.sample(20000, 1000)
## Plot the A_Rate, B_Rate, A_Shape, B_Shape
## Using those, determine the Gamma distribution
## Plot both - and draw 1000000... samples from each.
## Perform statistical tests on these.
So instead of going straight for the Gamma distribution, we're looking to find the parameters (I think). This seems to work a treat in that it gives me values in the traces of the right order of magnitude. However, now I can plot a histogram of samples for alpha for both test groups and for beta but that's not really what I'm after. I want to be able to plot each of the test group's 'gamma-like' distributions, calculated from a prior and the values I supply. I also want to be able to plot a 'delta' as the AB testing example shows. I feel a deterministic variable on the second example is going to be my best bet but I don't really know the best way to go about constructing this.
Long story short - I've got data drawn from a Gamma distribution that I'd like to AB test. I've got a gamma prior view of the data, though could be persuaded that I've got a normal prior view if that's easier. I'd like to update identical priors with the data I've collected, in a sensible way, and plot the distributions and the difference between them.
Cheers,
Matt