Learning Discrete HMM parameters in PyMC - python

I am trying to learn the parameters of a simple discrete HMM using PyMC. I am modeling the rainy-sunny model from the Wiki page on HMM. The model looks as follows:
I am using the following priors.
theta_start_state ~ beta(20,10)
theta_transition_rainy ~beta(8,2)
theta_transition_sunny ~beta(2,8)
theta_emission_rainy ~ Dirichlet(3,4,3)
theta_emission_sunny ~ Dirichlet(10,6,4)
Initially, I use this setup to create a training set as follows.
## Some not so informative priors!
# Prior on start state
theta_start_state = pm.Beta('theta_start_state',12,8)
# Prior on transition from rainy
theta_transition_rainy = pm.Beta('transition_rainy',8,2)
# Prior on transition from sunny
theta_transition_sunny = pm.Beta('transition_sunny',2,8)
# Prior on emission from rainy
theta_emission_rainy = pm.Dirichlet('emission_rainy',[3,4,3])
# Prior on emission from sunny
theta_emission_sunny = pm.Dirichlet('emission_sunny',[10,6,4])
# Start state
x_train_0 = pm.Categorical('x_0',[theta_start_state, 1-theta_start_state])
N = 100
# Create a train set for hidden states
x_train = np.empty(N, dtype=object)
# Creating a train set of observations
y_train = np.empty(N, dtype=object)
x_train[0] = x_train_0
for i in xrange(1, N):
if x_train[i-1].value==0:
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_rainy, 1- theta_transition_rainy])
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_sunny, 1- theta_transition_sunny])
for i in xrange(0,N):
if x_train[i].value == 0:
# Rain
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_rainy)
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_sunny)
However, I am not able to understand how to learn these parameters using PyMC. I made a start as follows.
def y(x=x_train, value =y_train):
N = len(x)
out = np.empty(N, dtype=object)
for i in xrange(0,N):
if x[i].value == 0:
# Rain
out[i] = pm.Categorical('y_%d' %i, theta_emission_rainy)
out[i] = pm.Categorical('y_%d' %i, theta_emission_sunny)
return out
The complete notebook containing this code can be found here.
Aside: The gist containing HMM code for a Gaussian is really hard to understand! (not documented)
Based on the answers below, I tried changing my code as follows:
def y(value=y_train, hidden_states = x_train):
def logp(value, hidden_states):
logprob = 0
for i in xrange(0,len(hidden_states)):
if hidden_states[i].value == 0:
# Rain
logprob = logprob + pm.categorical_like(value[i], theta_emission_rainy)
# Sunny
logprob = logprob + pm.categorical_like(value[i], theta_emission_sunny)
return logprob
The next step would be to create a model and then run the MCMC algorithm. However, the above
edited code would also not work. It gives a ZeroProbability error.
I am not sure if I have interpreted the answers correctly.

Just some thoughts on this:
Sunny and Rainy are mutually exclusive and exhaustive hidden states. Why don't you encode them as a single categorical weather variable which can take one of two values (coding for sunny, rainy) ?
In your likelihood function, you seem to observe Rainy / Sunny. The way I see it in your model graph, these seem to be the hidden, not the observed variables (that would be "walk", "shop" and "clean")
In your likelihood function, you need to sum (for all time steps t) the log-probability of the observed values (of walk, shop and clean respectively) given the current (sampled) values of rainy/sunny (i.e., Weather) at the same time step t.
If you want to learn parameters for the model, you might want to consider switching to PyMC3 which would be better suited for automatically computing gradients for your logp function. But in this case (since you chose conjugate priors) this is not really neccessary. If you don't know what Conjugate Priors are, or are in need of an overview, ask Wikipedia for List of Conjugate Priors, it has a great article on that.
Depending on what you want to do, you have a few choices here. If you want to sample from the posterior distribution of all parameters, just specify your MCMC model as you did, and press the inference button, after that just plot and summarize the marginal distributions of the parameters you're interested in, and you are done.
If you are not interested in marginal posterior distributions, but rather in finding the joint MAP paramters, you might consider using Expectation Maximization (EM) learning or Simulated Annealing. Both should work reasonably well within the MCMC Framework.
For EM Learning simply repeat these steps until convergence:
E (Expectation) Step: Run the MCMC chain for a while and collect samples
M (Maximization) Step: Update the hyperparamters for your Beta and Dirichlet Priors as if these samples had been actual observations. Look up the Beta and Dirichlet Distributions if you don't know how to do that.
I would use a small learning rate factor so you don't fall into the first local optimum (now we're approaching Simulated Annealing): Instead of treating the N samples you generated from the MCMC chain as actual observations, treat them as K observations (for a value K << N) by scaling the updates to the hyperparameters down by a learning rate factor of K/N.

The first thing that pops out at me is the return value of your likelihood. PyMC expects a scalar return value, not a list/array. You need to sum the array before returning it.
Also, when you use a Dirichlet as a prior for the Categorical, PyMC detects this and fills in the last probability. Here's how I would code your x_train/y_train loops:
p = []
for i in xrange(1, N):
# This will return the first variable if prev=0, and the second otherwise
p.append(pm.Lambda('p_%i' % i, lambda prev=x_train[i-1]: (theta_transition_rainy, theta_transition_sunny)[bool(prev)]))
x_train[i] = pm.Categorical('x_train_%i' % i, p[-1])
So, you grab the appropriate probabilities with a Lambda, and use it as the argument for the Categorical.


Understading hyperopt's TPE algorithm

I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

PyMC3 - Differences in ways observations are passed to model -> difference in results?

I'm trying to understand if there is any meaningful difference in the ways of passing data into a model - either aggregated or as single trials (note this will only be a sensical question for certain distributions e.g. Binomial).
Predicting p for a yes/no trail, using a simple model with a Binomial distribution.
What is the difference in the computation/results of the following models (if any)?
I choose the two extremes, either passing in a single trail at once (reducing to Bernoulli) or passing in the sum of the entire series of trails, to exemplify my meaning though I am interested in the difference in between these extremes also.
# set up constants
p_true = 0.1
N = 3000
observed = scipy.stats.bernoulli.rvs(p_true, size=N)
Model 1: combining all observations into a single data point
with pm.Model() as binomial_model1:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', N, p, observed=np.sum(observed))
trace1 = pm.sample(40000)
Model 2: using each observation individually
with pm.Model() as binomial_model2:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', 1, p, observed=observed)
trace2 = pm.sample(40000)
There is isn't any noticeable difference in the trace or posteriors in this case. I attempted to dig into the pymc3 source code to try to see how the observations were being processed but couldn't find the right part.
Possible expected answers:
pymc3 aggregates the observations under the hood for Binomial anyway so their is no difference
the resultant posterior surface (which is explored in the sample process) is identical in each case -> there is no meaningful/statistical difference in the two models
there are differences in the resultant statistics because of this and that...
This is an interesting example! Your second suggestion is correct: you can actually work out the posterior analytically, and it will be distributed according to
Beta(sum(observed), N - sum(observed))
in either case.
The difference in modelling approach would show up if you used, for example, pm.sample_ppc, in that the first would be distributed according to Binomial(N, p) and the second would be N draws of Binomial(1, p).

PyMC: Estimating population parameters where each observation is the sum of two Weibull-distributed variables

I have a list of n observations, each of which is the sum of two Weibull-distributed variables:
x[i] = t1[i] + t2[i]
t1[i] ~ Weibull(shape1, scale1)
t2[i] ~ Weibull(shape2, scale2)
My goal is:
1) Estimate the shape and scale parameters for both Weibull distributions (shape1, scale1, shape2, scale2),
2) For each observation x[i], estimate t1[i] (and t2[i] follows from this).
(Aside: Each observation x[i] is the age of cancer diagnosis, and t1[i] and t2[i] are two different time periods in the development of the tumor. The actual model involves mutation data as well, but before I try that out, I want to make sure that I can use PyMC for this simpler problem.)
I am using PyMC2 to make these estimates, and it looks like the run converges, but to incorrect results. I do not know whether there is a problem with my PyMC model syntax, with the MCMC settings, or both. I tried adapting this advice on using Potentials to model latent variables. First I define x[i] and t1[i] for each observation:
for i in xrange(n):
x[i] = pm.Index('x_%i'%i, x=data, index=i) # data is a list of observations
t1[i] = pm.Weibull('t1_%i'%i, alpha=shape1, beta=scale1)
# Ensure that initial guess for t1 is not more than the observed sum:
if t1[i].value >= x[i].value:
t1[i].value = 0.95 * x[i].value
Then I define a Deterministic for t2[i] = x[i] - t1[i]:
for i in xrange(n):
def subtractfunc(t1=t1, x=x, ii=i):
return x[ii] - t1[ii]
t2[i] = pm.Lambda('t2_%i'%i, subtractfunc)
And last I define the Potential for t2[i]:
t2dist = np.empty(n, dtype=object)
for i in xrange(n):
def weibfunc(t2=t2, shape2=shape2, scale2=scale2, ii=i):
return pm.weibull_like(t2[ii], alpha=shape2, beta=scale2)
t2dist[i] = pm.Potential(logp = weibfunc,
name = 't2dist_%i'%i,
parents = {'shape2':shape2, 'scale2':scale2, 't2':t2},
doc = 'weibull potential for t2',
verbose = 0,
cache_depth = 2)
You can see my full code here. I test by simulating 60 independent observations, with shape1 = 1, scale1 = 30, shape2 = 6.5, scale2 = 10, and I run 1e5 iterations of AdaptiveMetropolis. The results converge to a mean of shape1=1.94, scale1=37.9, shape2=0.55, scale2=36.1, and the 95% HPDs do not include the true values. This resulting distribution is not even in the right ballpark, as this histogram shows. (Blue shows the simulated data x[i] that I used, while the red shows the completely different inferred distribution from a representative iteration in the MCMC run.)
Running again with a different random seed, I get shape1=4.65, scale1=23.3, shape2=0.83, scale2=21.3. This distribution is somewhat closer to the truth. Is there some way to change the MCMC settings to consistently get decent results for this sort of problem? Any advice about using PyMC more effectively is much appreciated.
Update -- tried an "assisted" MCMC run:
I also tried assisting the MCMC run by initializing population-level parameters with values close to the truth. The results are somewhat better, but I now find a systematic bias. The histogram below shows the true distribution of observations (blue) against the fitted distribution (red). The right tail fits nicely, but the fit fails to capture the sharp peak at the left side. This bias occurs consistently, for population sizes n = 60 and 100. I am not sure if this is more of a PyMC question or a general MCMC algorithm issue.

Calculating Nearest Match to Mean/Stddev Pair With LibSVM

I'm new to SVMs, and I'm trying to use the Python interface to libsvm to classify a sample containing a mean and stddev. However, I'm getting nonsensical results.
Is this task inappropriate for SVMs or is there an error in my use of libsvm? Below is the simple Python script I'm using to test:
#!/usr/bin/env python
# Simple classifier test.
# Adapted from the svm_test.py file included in the standard libsvm distribution.
from collections import defaultdict
from svm import *
# Define our sparse data formatted training and testing sets.
labels = [1,2,3,4]
train = [ # key: 0=mean, 1=stddev
problem = svm_problem(labels, train)
test = [
({0:3, 1:3.11},1),
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10, probability = 1)
model = svm_model(problem, param)
for test_sample,correct_label in test:
pred_label, pred_probability = model.predict_probability(test_sample)
correct[kn] += pred_label == correct_label
# Show results.
print '-'*80
print 'Accuracy:'
for kn,correct_count in correct.iteritems():
print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test))
The domain seems fairly simple. I'd expect that if it's trained to know a mean of 2.5 means label 1, then when it sees a mean of 2.4, it should return label 1 as the most likely classification. However, each kernel has an accuracy of 0%. Why is this?
A couple of side notes, is there a way to hide all the verbose training output dumped by libsvm in the terminal? I've searched libsvm's docs and code, but I can't find any way to turn this off.
Also, I had wanted to use simple strings as the keys in my sparse dataset (e.g. {'mean':2.5,'stddev':3.5}). Unfortunately, libsvm only supports integers. I tried using the long integer representation of the string (e.g. 'mean' == 1109110110971110), but libsvm seems to truncate these to normal 32-bit integers. The only workaround I see is to maintain a separate "key" file that maps each string to an integer ('mean'=0, 'stddev'=1). But obviously that'll be a pain since I'll have to maintain and persist a second file along with the serialized classifier. Does anyone see an easier way?
The problem seems to be coming from combining multiclass prediction with probability estimates.
If you configure your code not to make probability estimates, it actually works, e.g.:
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
model = svm_model(problem, param)
for test_sample,correct_label in test:
# Here -> change predict_probability to just predict
pred_label = model.predict(test_sample)
correct[kn] += pred_label == correct_label
With this change, I get:
polynomial 1.000000 (4 of 4)
rbf 1.000000 (4 of 4)
linear 1.000000 (4 of 4)
Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.
If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.
By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.
For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use
distance(p1, p2) = KL(p1, p2) + KL(p1, p2)
For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)
Long story short:
Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.
To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.

