I have created two mini encoding networks for Standard autoencoder and VAE and ploted each. Would just like to know if my understanding is correct for this mini case. Note it's only one epoch and it ends with encoding.
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(0)
fig, (ax,ax2) = plt.subplots(2,1)
def relu(x):
c = np.where(x>0,x,0)
return c
#Standard autoencoder
x = np.random.randint(0,2,[100,5])
w_autoencoder = np.random.normal(0,1,[5,2])
bottle_neck = relu(x.dot(w_autoencoder))
ax.scatter(bottle_neck[:,0],bottle_neck[:,1])
#VAE autoencoder
w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
epsilon_sample = np.random.normal(0,1,[100,2])
latent_space = mu+np.log2(sigma)*epsilon_sample
ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')
w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
epsilon_sample = np.random.normal(0,1,[100,2])
latent_space = mu+np.log2(sigma)*epsilon_sample
ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')
Since your motive is "understanding", I should say you are in the right direction and working on this sort of implementation definitely helps you in understanding. But I strongly believe "understanding" has to be achieved first in books/papers and only then via the implementation/code.
On a quick glance, your standard autoencoder looks fine. You are making an assumption via your implementation that your latent code would be in the range of (0,infinity) using relu(x).
However, while doing the implementation of VAE, you can't achieve the latent code with relu(x) function. This is where your "theoretical" understanding is missing. In standard VAE, we make an assumption that the latent code is a sample from a Gaussian distribution and as such we approximate the parameters of that Gaussian distribution i.e. mean and covariance. Further, we also make another assumption that this Gaussian distribution is factorial which means the covariance matrix is diagonal. In your implementation, you are approximating the mean and diagonal covariance as:
mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))
which seems fine but while getting sample (reparameterization trick), not sure why you introduced np.log2(). Since you are using ReLU() activation, you may end up with 0 in your sigma variable and when you do np.log2(0), you will get inf. I believe you were motivated by some available code where they do:
mu = relu(x.dot(w_vae1)) #same as yours
logOfSigma = x.dot(w_vae2) #you are forcing your network to learn log(sigma)
Now since you are approximating log of sigma, you can allow your output to be negative because to get the sigma, you would do something like np.exp(logOfSigma) and this would ensure you would always get positive values in your diagonal covariance matrix. Now to do sampling, you can simply do:
latent_code = mu + np.exp(logOfSigma)*epsilon_sample
Hope this helps!
Related
I'm trying to fit an asymmetric gaussian to my data. My data is just a numpy array called wave (x) and a numpy array called spec (y) that looks like an asymmetric gaussian.
This is the image with the data with an asymmetric gaussian fitted with curve_fit (this has a continuum too, but this is not important right now.
This is the function:
def agauss(amp, cen, b_sigma, r_sigma, x):
y = np.zeros(len(x))
for i in range(len(x)):
if x[i] < cen:
y[i] = amp*np.exp(-((x[i] - cen)**2)/(2*b_sigma**2))
else:
y[i] = amp*np.exp(-((x[i] - cen)**2)/(2*r_sigma**2))
return y
I'm using this code to fit the parameters:
with pm.Model() as asym:
cen = pm.Uniform('cen', lower=5173, upper=5179)
bsigma = pm.HalfCauchy('bsigma', beta=3)
rsigma = pm.HalfCauchy('rsigma', beta=3)
amp = pm.Uniform('amp', lower=1e-19, upper=1e-16)
err = pm.HalfCauchy('err', beta=0.0000001)
ag_pred = pm.Normal('ag_pred', mu=agauss(amp, cen, bsigma, rsigma, wave), sigma=err, observed=spec)
agdata = pm.sample(3000, cores=2)
But I get the error "Variables do not support boolean operations" in the theano.tensor module. How should I define the function in order to fit the paremeters? There is a better way to do this? Thanks!!
144 err = pm.HalfCauchy('err', beta=0.0000001)
145
--> 146 ag_pred = pm.Normal('ag_pred', mu=agauss(amp, cen, bsigma, rsigma, wave), sigma=err, observed=spec)
147
148 agdata = pm.sample(3000, cores=2)
~/Documents/OIII_emitters/m2fs_reduction/test/assets/scripts/analysis.py in agauss(amp, cen, b_sigma, r_sigma, x)
118 y = np.zeros(len(x))
119 for i in range(len(x)):
--> 120 if x[i] < cen:
121 y[i] = amp*np.exp(-((x[i] - cen)**2)/(2*b_sigma**2))
122 else:
~/anaconda3/envs/data_science/lib/python3.7/site-packages/theano/tensor/var.py in __bool__(self)
92 else:
93 raise TypeError(
---> 94 "Variables do not support boolean operations."
95 )
96
TypeError: Variables do not support boolean operations.
Here is an approach that follows the switchpoint trick of pymc3. It's basically an non-linear extension of this SO question. Below I show the code and how it works with some mock data that I created. I don't expect the mock data to be too similar with your actual data but it should give you a starting point. This approach does have some modelling limitations (i.e. fitting two gaussians) but again they could go away with more realistic input data.
Firstly I create some mock input data. Note that two gaussians attached to each other will have different normalisations and will also approach the continuum of your spectrum somewhat differently. I didn't attempt to include the latter subtlety in the mock data but I kept the normalisation consistent. Another complication is the fact that the amplitude of the two gaussians is different. This probably won't be a problem with real data but here I simply adda constant to match them. From these mock data we would like to recover parameters cen_mock, sigma_mock_1 and sigma_mock_2. Finally note that I kept the same frequency limits as in your question.
import numpy as np
import pymc3 as pm
import theano.tensor as tt
import matplotlib.pyplot as plt
def gaussian_pdf(x,sigma,cen):
g1 = (1/(sigma*np.sqrt(2*np.pi)))*np.exp(
-0.5*np.square(x-cen)/np.square(sigma))
return g1
# create mock data
wavelength_min = 5173
wavelength_max = 5179
cen_mock = 5175
x1 = np.arange(5173,cen_mock,0.01)
x2 = np.arange(cen_mock,5179,0.01)
x = np.arange(wavelength_min,wavelength_max,0.01)
sigma_mock_1 = 1.4
g1 = gaussian_pdf(x[x<=cen_mock],sigma_mock_1,cen_mock)
sigma_mock_2 = 1.9
g2 = gaussian_pdf(x[x>cen_mock],sigma_mock_2,cen_mock)
Once we have some mock data we build the model in pymc3 and sample:
# construct model
with pm.Model() as asym:
switchpoint = pm.DiscreteUniform("switchpoint",lower=x.min(),upper=x.max())
sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
sigma_1 = pm.HalfNormal('sigma_1', sd=10)
sigma_2 = pm.HalfNormal('sigma_2', sd=10)
sd = pm.math.switch(switchpoint > x, sigma_1, sigma_2)
likelihood = pm.Normal(
'y', mu=(1/(sd*np.sqrt(2*np.pi)))*tt.exp(
-0.5*tt.square(x-switchpoint)/tt.square(sd)),
sd=sigma, observed=y_mock)
with asym:
step1 = pm.NUTS([sigma_1,sigma_2])
step2 = pm.Metropolis([switchpoint])
trace = pm.sample(4000, step=[step1,step2],cores=4,tune=4000,chains=4)
The parameters of the model is the swithcpoint which represents where the gaussians change and the standard deviations. So the sampler will be estimating the "right" or "left" standard deviation based on the value of x.
We now check the results:
df = pm.summary(trace)
x1_m = x[x<=df.loc['switchpoint']['mean']]
x2_m = x[x>df.loc['switchpoint']['mean']]
g1_m = gaussian_pdf(x1_m,df.loc['sigma_1']['mean'],df.loc['switchpoint']['mean'])
g2_m = gaussian_pdf(x2_m,df.loc['sigma_2']['mean'],df.loc['switchpoint']['mean'])
amp_m = g1_m.max() - g2_m.max()
plt.plot(x[x<=cen_mock],g1,label='left gaussian mock')
plt.plot(x[x>cen_mock],g2,label='right gaussian mock')
plt.plot(x1_m,g1_m,label='left gaussian model')
plt.plot(x2_m,g2_m+amp_m,label='right gaussian model')
plt.text(5177,0.23,'sd_mock1='+str(sigma_mock_1))
plt.text(5177,0.2,'sd_mock2='+str(sigma_mock_2))
plt.legend(frameon=False)
There are of course much better ways of checking the sampling results (which are actually bayesian) but this is a quick and dirty way to get a feel of what is going on. Note that after the modelling I still need to adda constant to one of the gaussians to join them naturally. For three different pairs of standard deviations I get the following plots:
Now for some discussion. To evaluate how good this approach is we need to have some better idea of the real data. As you can see if the standard deviations differ significantly then the model starts failing. However for somewhat similar standard deviation and the edge case of the standard deviation being equal the model is somewhat acceptable. Furthermore another important input is whether the number of frequency datapoints is equal for the two gaussians. This will determine how each gaussian approaches the continuum. It will also determine whether there any need to arbitrarily adda constant to the second gaussian in order to join them in a more natural way when fitting on real data.
To sum up, this is an approach that follows closely your desired parametrisation, but with some extra work in evaluation using mock data. From your question it seems to me that the switchpoint approach is required, but only when applied to real data (or more realistic mock data) one can know with some certainty if it's sufficient.
I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.
I'm trying to fit some observations, which have measurement errors, to some other data with no measurement error. How do I take into account the measurement error in pyMC3? I have the following approach, which seems to give me reasonable results, but is it the right way to go about it?
n_samples = 20000
with pymc3.Model() as predictive_model:
intercept = pymc3.Normal('Intercept',mu=1.0,sd=0.2)
exponent = pymc3.Normal('A',mu=4.2,sd=0.15)
likelihood = pymc3.Normal('Observed',
mu=intercept*x_values**exponent,
observed=observed_values,
sd=observed_errors)
start = pymc3.find_MAP()
step = pymc3.NUTS(scaling=start)
trace_predictive = pymc3.sample(n_samples, step, start=start,njobs=4)
where x_values, observed_values and observed_errors are 1D numpy arrays of the same length.
It looks like you have a model C x^A, but you believe the data you collected looked like C x^A + eps. It also looks like you know the measurement error exactly, somehow (this surprises me!)
If your goal is to infer something about the intercept C, exponent A, and measurement noise eps, I would write the model like this:
with pymc3.Model() as predictive_model:
intercept = pymc3.Normal('Intercept',mu=1.0,sd=0.2)
exponent = pymc3.Normal('A',mu=4.2,sd=0.15)
eps = pymc3.HalfNormal('eps', 10.)
likelihood = pymc3.Normal('Observed',
mu=intercept*x_values**exponent,
sd=eps
observed=observed_values - observed_errors)
trace_predictive = pymc3.sample(n_samples, njobs=4)
(note that there are better ways to initialize than the MAP now, and they get chosen automatically!)
I'm working through a book called Bayesian Analysis in Python. The book focuses heavily on the package PyMC3 but is a little vague on the theory behind it so I'm quite confused.
Say I have data like this:
data = np.array([51.06, 55.12, 53.73, 50.24, 52.05, 56.40, 48.45, 52.34, 55.65, 51.49, 51.86, 63.43, 53.00, 56.09, 51.93, 52.31, 52.33, 57.48, 57.44, 55.14, 53.93, 54.62, 56.09, 68.58, 51.36, 55.47, 50.73, 51.94, 54.95, 50.39, 52.91, 51.5, 52.68, 47.72, 49.73, 51.82, 54.99, 52.84, 53.19, 54.52, 51.46, 53.73, 51.61, 49.81, 52.42, 54.3, 53.84, 53.16])
And I'm looking at a model like this:
Using Metropolis Sampling,
how can I fit a model that estimates mu and sigma.
Here is my guess at pseudo-code from what I've read:
M, S = 50, 1
G = 1
# These are priors right?
mu = stats.norm(loc=M, scale=S)
sigma = stats.halfnorm(scale=G)
target = stats.norm
steps = 1000
mu_samples = [50]
sigma_samples = [1]
for i in range(steps):
# proposed sample...
mu_i, sigma_i = mu.rvs(), sigma.rvs()
# Something happens here
# How do I calculate the likelidhood??
"..."
# some evaluation of a likelihood ratio??
a = "some"/"ratio"
acceptance_bar = np.random.random()
if a > acceptance_bar:
mu_samples.append(mu_i)
sigma_samples.append(sigma_i)
What am I missing??
I hope the following example helps you. In this example I am going to assume we know the value of sigma, so we only have a prior for mu.
sigma = data.std() # we are assuming we know sigma
steps = 1000
mu_old = data.mean() # initial value, just a good guest
mu_samples = []
# we evaluate the prior for the initial point
prior_old = stats.norm(M, S).pdf(mu_old)
# we evaluate the likelihood for the initial point
likelihood_old = np.prod(stats.norm(mu_old, sigma).pdf(data))
# Bayes' theorem (omitting the denominator) for the initial point
post_old = prior_old * likelihood_old
for i in range(steps):
# proposal distribution, propose a new value from the old one
mu_new = stats.norm.rvs(mu_old, 0.1)
# we evaluate the prior
prior_new = stats.norm(M, S).pdf(mu_new)
# we evaluate the likelihood
likelihood_new = np.prod(stats.norm(mu_new, sigma).pdf(data))
# Bayes' theorem (omitting the denominator)
post_new = prior_new * likelihood_new
# the ratio of posteriors (we do not need to know the normalizing constant)
a = post_new / post_old
if np.random.random() < a:
mu_old = mu_new
post_old = post_new
mu_samples.append(mu_old)
Notes:
Notice that I have defined a proposal distribution, which in this case is a Gaussian centered at mu_old with a standard deviation of 0.1 (an arbitrary value). In practice the efficiency of MH depends heavily on the proposal distribution, thus PyMC3 (as well as other practical implementations of MH) use some heuristic to tune the proposal distribution.
For simplicity I used the pdf in this example but in practice is convenient to work with the logpdf. This avoid underflow problems without changing the results.
The likelihood is computed as a product
The ratio you were missing is the ratio of posteriors
If you do not accept the new proposed value, then you save (again) the old value.
Remember to check this repository for the Errata and for updated versions of the code. The major difference with the updated code respect to the code in the book, is that now the preferred way to run a model with PyMC3 is to just use pm.sample() and let PyMC3 choose the samplers and initialization points for you.
I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.
Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.
So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?
I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.
So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.
I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function
from numpy.linalg import inv
def lnprob_trunc_norm(x, mean, bounds, C):
if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
return -np.inf
else:
return -0.5*(x-mean).dot(inv(C)).dot(x-mean)
Here, C is the covariance matrix of the multivariate normal. Then, you can run something like
S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))
pos, prob, state = S.run_mcmc(pos, Nsteps)
for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,
pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)
or sampled from an untruncated multivariate normal,
pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)
and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.
The number of steps for convergence is up to you.
Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.
I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.
The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.
The code is available at: https://github.com/brunzema/truncated-mvn-sampler.
An Example:
d = 10 # dimensions
# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)
# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf
# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)
Plotting the first dimension results in:
Reference:
Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148
Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.
My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from , which can handle general linear constraints in the form of , even when you have non-full rank D and more constraints than the dimensionality.
import numpy as np
from trun_mvnt import rtmvn, rtmvt
########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix
D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])
n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin)
# Numpy array n-by-p as result!
random_sample
########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal
D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])
n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!
A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.