Related
I am trying to estimate a model in tensorflow using NUTS by providing it a likelihood function. I have checked the likelihood function is returning reasonable values. I am following the setup here for setting up NUTS:
https://rlhick.people.wm.edu/posts/custom-likes-tensorflow.html
and some of the examples here for setting up priors, etc.:
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Multilevel_Modeling_Primer.ipynb
My code is in a colab notebook here:
https://drive.google.com/file/d/1L9JQPLO57g3OhxaRCB29do2m808ZUeex/view?usp=sharing
I get the error: OperatorNotAllowedInGraphError: iterating overtf.Tensoris not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function. This is my first time using tensorflow and I am quite lost interpreting this error. It would also be ideal if I could pass the starting parameter values as a single input (example I am working off doesn't do it, but I assume it is possible).
Update
It looks like I had to change the position of the #tf.function decorator. The sampler now runs, but it gives me the same value for all samples for each of the parameters. Is it a requirement that I pass a joint distribution through the log_prob() function? I am clearly missing something. I can run the likelihood through bfgs optimization and get reasonable results (I've estimated the model via maximum likelihood with fixed parameters in other software). It looks like I need to define the function to return a joint distribution and call log_prob(). I can do this if I set it up as a logistic regression (logit choice model is logistically distributed in differences). However, I lose the standard closed form.
My function is as follows:
#tf.function
def mmnl_log_prob(init_mu_b_time,init_sigma_b_time,init_a_car,init_a_train,init_b_cost,init_scale):
# Create priors for hyperparameters
mu_b_time = tfd.Sample(tfd.Normal(loc=init_mu_b_time, scale=init_scale),sample_shape=1).sample()
# HalfCauchy distributions are too wide for logit discrete choice
sigma_b_time = tfd.Sample(tfd.Normal(loc=init_sigma_b_time, scale=init_scale),sample_shape=1).sample()
# Create priors for parameters
a_car = tfd.Sample(tfd.Normal(loc=init_a_car, scale=init_scale),sample_shape=1).sample()
a_train = tfd.Sample(tfd.Normal(loc=init_a_train, scale=init_scale),sample_shape=1).sample()
# a_sm = tfd.Sample(tfd.Normal(loc=init_a_sm, scale=init_scale),sample_shape=1).sample()
b_cost = tfd.Sample(tfd.Normal(loc=init_b_cost, scale=init_scale),sample_shape=1).sample()
# Define a heterogeneous random parameter model with MultivariateNormalDiag()
# Use MultivariateNormalDiagPlusLowRank() to define nests, etc.
b_time = tfd.Sample(tfd.MultivariateNormalDiag( # b_time
loc=mu_b_time,
scale_diag=sigma_b_time),sample_shape=num_idx).sample()
# Definition of the utility functions
V1 = a_train + tfm.multiply(b_time,TRAIN_TT_SCALED) + b_cost * TRAIN_COST_SCALED
V2 = tfm.multiply(b_time,SM_TT_SCALED) + b_cost * SM_COST_SCALED
V3 = a_car + tfm.multiply(b_time,CAR_TT_SCALED) + b_cost * CAR_CO_SCALED
print("Vs",V1,V2,V3)
# Definition of loglikelihood
eV1 = tfm.multiply(tfm.exp(V1),TRAIN_AV_SP)
eV2 = tfm.multiply(tfm.exp(V2),SM_AV_SP)
eV3 = tfm.multiply(tfm.exp(V3),CAR_AV_SP)
eVD = eV1 + eV2 +
eV3
print("eVs",eV1,eV2,eV3,eVD)
l1 = tfm.multiply(tfm.truediv(eV1,eVD),tf.cast(tfm.equal(CHOICE,1),tf.float32))
l2 = tfm.multiply(tfm.truediv(eV2,eVD),tf.cast(tfm.equal(CHOICE,2),tf.float32))
l3 = tfm.multiply(tfm.truediv(eV3,eVD),tf.cast(tfm.equal(CHOICE,3),tf.float32))
ll = tfm.reduce_sum(tfm.log(l1+l2+l3))
print("ll",ll)
return ll
The function is called as follows:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
init = [0.,0.,0.,0.,0.,.5]
##
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
I have an answer to my question! It is simply a matter of different nomenclature. I need to define my model as a softmax function, which I knew was what I would call a "logit model", but it just wasn't clicking for me. The following blog post gave me the epiphany:
http://khakieconomics.github.io/2019/03/17/Putting-it-all-together.html
My first time trying to train a dataset containing 8 variables in a time-series of 20 years or so using GRU RNN. The biomass value is what I'm trying to predict based on the other variables. I'm trying first with 1 layer GRU. I'm not using softmax for the output layer. MSE is used for my cost function.
It is basic GRU with forward propagation and backward gradient update. Here are the main function I defined:
'x_t is the input training dataset with a dimension of 7572x8. So T = 7572, input_dim = 8, hidden_dim =128. y_train is my train label.'
def forward_prop_step(self, x_t,y_train, s_t1_prev,V, U, W, b, c,learning_rate):
T = x_t.shape[0]
z_t1 = np.zeros((T,self.hidden_dim))
r_t1 = np.zeros((T,self.hidden_dim))
h_t1 = np.zeros((T,self.hidden_dim))
s_t1 = np.zeros((T+1,self.hidden_dim))
o_s = np.zeros((T,self.input_dim))
for i in xrange(T):
x_e = x_t[i].T
z_t1[i] = sigmoid(U[0].dot(x_e) + W[0].dot(s_t1[i]) + b[0])#128x1
r_t1[i] = sigmoid(U[1].dot(x_e) + W[1].dot(s_t1[i]) + b[1])#128x1
h_t1[i] = np.tanh(U[2].dot(x_e) + W[2].dot(s_t1[i] * r_t1[i]) + b[2])#128x1
s_t1[i+1] = (np.ones_like(z_t1[i]) - z_t1[i]) * h_t1[i] + z_t1[i] * s_t1[i]#128x1
o_s[i] = np.dot(V,s_t1[i+1]) + c#8x1
return [o_s,z_t1,r_t1,h_t1,s_t1]
def bptt(self, x,y_train,o,z_t1,r_t1,h_t1,s_t1,V, U, W, b, c):
bptt_truncate = 360
T = x.shape[0]#length of time scale of input data (train)
dLdU = np.zeros(U.shape)
dLdV = np.zeros(V.shape)
dLdW = np.zeros(W.shape)
dLdb = np.zeros(b.shape)
dLdc = np.zeros(c.shape)
y_train_sp = np.repeat(y_train,self.input_dim)
for t in np.arange(T)[::-1]:
dLdy = 2 * (o[t] - y_train_sp[t])
dydV = s_t1[t]
dydc = 1.0
dLdV += np.outer(dLdy,dydV)
dLdc += dLdy*dydc
for i in np.arange(max(0, t-bptt_truncate), t+1)[::-30]:#every month in the past year
s_t1_pre = s_t1[i]
dydst1 = V #8x128
dst1dzt1 = -h_t1[i] + s_t1_pre #128x1
dst1dht1 = np.ones_like(z_t1[i]) - z_t1[i] #128x1
dzt1dU = np.outer(z_t1[i]*(1.0-z_t1[i]),x[i]) #128x8
#print dzt1dU.shape
dzt1dW = np.outer(z_t1[i]*(1.0-z_t1[i]),s_t1_pre) #128x128
dzt1db = z_t1[i]*(1.0-z_t1[i]) #128x1
dht1dU = np.outer((1.0-h_t1[i] ** 2),x[i]) #128x8
dht1dW = np.outer((1.0-h_t1[i] ** 2),s_t1_pre * r_t1[i]) #128x128
dht1db = 1.0-h_t1[i] ** 2 #128x1
dht1drt1 = (1.0-h_t1[i] ** 2)*(W[2].dot(s_t1_pre))#128x1
drt1dU = np.outer((r_t1[i]*(1.0-r_t1[i])),x[i]) #128x8
drt1dW = np.outer((r_t1[i]*(1.0-r_t1[i])),s_t1_pre) #128x128
drt1db = (r_t1[i]*(1.0-r_t1[i]))#128x1
dLdW[0] += np.outer(dydst1.T.dot(dLdy),dzt1dW.dot(dst1dzt1)) #128x128
dLdU[0] += np.outer(dydst1.T.dot(dLdy),dst1dzt1.dot(dzt1dU)) #128x8
dLdb[0] += (dydst1.T.dot(dLdy))*dst1dzt1*dzt1db#128x1
dLdW[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dW)#128x128
dLdU[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dU) #128x8
dLdb[1] += (dydst1.T.dot(dLdy))*dst1dht1*dht1drt1*drt1db#128x1
dLdW[2] += np.outer(dydst1.T.dot(dLdy),dht1dW.dot(dst1dht1)) #128x128
dLdU[2] += np.outer(dydst1.T.dot(dLdy),dst1dht1.dot(dht1dU))#128x8
dLdb[2] += (dydst1.T.dot(dLdy))*dst1dht1*dht1db#128x1
return [ dLdV,dLdU, dLdW, dLdb, dLdc ]
def predict( self, x):
pred = np.amax(x, axis = 1)
pred_f = relu(pred)
return pred_f
Parameters V,U,W,b,c are updated by gradient dLdV,dLdU,dLdW,dLdb,dLdc calculated by bptt.
I have tried different weight initialization (xavier or just random), tried different time truncation. But all lead to the same outcome. Probably the weight update wasn't right? The network set-up seems simple though. Really struggle on understanding the predication and translate to actual biomass too. The function predict is what I defined to translate the output layer from the GRU network to biomass value by taking the maximum value. But the output layer gives similar value for almost all time iterations. Not sure the best way to do the job though. Thanks for any help or suggestions in advance.
I doubt anyone on stackoverflow is going to debug a custom implementation of GRU for you. If you were using Tensorflow or another high level library, I might take a stab at it, or if it was a simple fully connected network, but as it is all I can do is give some advice on how to proceed with debugging.
First, it sounds like you're running a brand new implementation on your own data set right off the bat. Instead, try starting out by testing your network on trivial, synthetic data sets first. Can it learn an identity function? A response which is simply the weighted average of the three previous time stamps? And so on. It's easier to debug small simple problems. Once you know your implementation can learn things that a GRU based recurrent network should be able to learn, then you can start using your own data.
Second, this comment of yours was very insightful:
Probably the weight update wasn't right?
While it's impossible to say for sure, this is a very common - perhaps the most common - source of bugs in for backprop implementations. Andrew Ng recommends gradient checking to debug an implementation like this. Essentially, this involves numerically approximating the gradient. It's computationally inefficient but relies only on a correct implementation of forward propagation, which makes it very useful for debugging. For one, if the algorithm converges when the numerically approximated gradient is used, you can be more sure that your forward prop is correct and focus on debugging backprop. (On the other hand, if it is still does not succeed, it is likely the issue in your forward prop function.) For another, once the algorithm is working with the numerically approximated gradient then you can compare the output of your analytic gradient function with it and debug any discrepancies. This makes it a lot easier because you now know the correct answer that it should return.
I am using PyMC3 to calculate something which I won't get into here but you can get the idea from this link if interested.
The '2-lambdas' case is basically a switch function, which needs to be compiled to a Theano function to avoid dtype errors and looks like this:
import theano
from theano.tensor import lscalar, dscalar, lvector, dvector, argsort
#theano.compile.ops.as_op(itypes=[lscalar, dscalar, dscalar], otypes=[dvector])
def lambda_2_distributions(tau, lambda_1, lambda_2):
"""
Return values of `lambda_` for each observation based on the
transition value `tau`.
"""
out = zeros(num_observations)
out[: tau] = lambda_1 # lambda before tau is lambda1
out[tau:] = lambda_2 # lambda after (and including) tau is lambda2
return out
I am trying to generalize this to apply to 'n-lambdas', where taus.shape[0] = lambdas.shape[0] - 1, but I can only come up with this horribly slow numpy implementation.
#theano.compile.ops.as_op(itypes=[lvector, dvector], otypes=[dvector])
def lambda_n_distributions(taus, lambdas):
out = zeros(num_observations)
np_tau_indices = argsort(taus).eval()
num_taus = taus.shape[0]
for t in range(num_taus):
if t == 0:
out[: taus[np_tau_indices[t]]] = lambdas[t]
elif t == num_taus - 1:
out[taus[np_tau_indices[t]]:] = lambdas[t + 1]
else:
out[taus[np_tau_indices[t]]: taus[np_tau_indices[t + 1]]] = lambdas[t]
return out
Any ideas on how to speed this up using pure Theano (avoiding the call to .eval())? It's been a few years since I've used it and so don't know the right approach.
Using a switch function is not recommended, as it breaks the nice geometry of the parameters space and makes sampling using modern sampler like NUTS difficult.
Instead, you can try model it using a continuous relaxation of a switch function. The main idea here would be to model the rate before the first switch point as a baseline; and add the prediction from a logistic function after each switch point:
def logistic(L, x0, k=500, t=np.linspace(0., 1., 1000)):
return L/(1+tt.exp(-k*(t_-x0)))
with pm.Model() as m2:
lambda0 = pm.Normal('lambda0', mu, sd=sd)
lambdad = pm.Normal('lambdad', 0, sd=sd, shape=nbreak-1)
trafo = Composed(pm.distributions.transforms.LogOdds(), Ordered())
b = pm.Beta('b', 1., 1., shape=nbreak-1, transform=trafo,
testval=[0.3, 0.5])
theta_ = pm.Deterministic('theta', tt.exp(lambda0 +
logistic(lambdad[0], b[0]) +
logistic(lambdad[1], b[1])))
obs = pm.Poisson('obs', theta_, observed=y)
trace = pm.sample(1000, tune=1000)
There are a few tricks I used here as well, for example, the composite transformation that is not on the PyMC3 code base yet. You can have a look at the full code here: https://gist.github.com/junpenglao/f7098c8e0d6eadc61b3e1bc8525dd90d
If you have more question, please post to https://discourse.pymc.io with your model and (simulated) data. I check and answer on the PyMC3 discourse much more regularly.
I have a constrained optimization problem where I have a number of
products I want to spend money on and estimate my total returns based on models
I built for each individual product.
I'm using scipy.optimzie.minimize to find the optimal spend given the
individual models output. The problem I'm having is that the optimizer
finishes with a "optimizer terminated successfully" flag, but very clearly
does not find the optimal solution. In fact, the output using the original
seed/x0 is better than the output produced from the optimizer. I put a print
statement in the objective function and you can see at one point it just drops off
a cliff. Does anyone have any idea why this would happen/how to fix it?
I've included a reduced version of my code below.
If my products are ['P1','P2',... 'P9'], and I have a model for each of them
# estimates returns on spend for each product
model1 = lambda money : func1(money *betas_a)
model2 = lambda money : func2(money,*betas_b)
... etc
where func is one of
def power_curve(x,beta1,beta2):
return beta1*x**beta2
def mm_curve(x,beta1,beta2):
"Beta2 >= 0"
return (beta1*x)/(1+beta2*x)
def dbl_exponential(x,beta1,beta2,beta3,beta4):
return beta1**(beta2**(x/beta4))*beta3
def neg_exp(x,beta1,beta2):
"Beta2 > 0"
return beta1*(1-np.exp(-beta2*x))
I now want to optimize my spend in each in order to maximize my total returns.
To do this is use scipy.optimize.minimize with a wrapper around the following function:
def budget(products, budget, betas, models):
"""
Given a budget distributed across each product, estimate total returns.
products = str: names of each individual product
budget = list-floats: amount of money/spend corresponding to each product
models = list-funcs: function to use to predict individual returns corresponding to each product
betas = dict: keys are product names - values are list of betas to feed to corresponding model
"""
results = []
target_total = 0 # total returns
assert len(products) == len(budget) == len(betas)
# for each product, calculate return using corresponding model
for v,b,m in zip(products,budget,models):
tpred = m(b,*betas[v])
target_total+=tpred
results.append({'val':v,'budget':b, 'tpred':tpred})
# if you watch this you can see it drops off dramatically towards the end
print(target_total)
return results, target_total
Minimum Reproducible Code:
import numpy as np
from scipy import optimize
### Setup inputs to the optimizer
vals = ['P1','P2','P3','P4','P5','P6','P7','P8','P9']
funcs = [dbl_exponential,
mm_curve,
power_curve,
mm_curve,
neg_exp,
power_curve,
power_curve,
mm_curve,
dbl_exponential]
betas = {'P1': [0.018631215601097723,0.6881958654622713,43.84956270498627,
1002.1010110475437],
'P2': [0.002871159199956573, 1.1388317502737174e-06],
'P3': [0.06863672099961649, 0.7295132426289046],
'P4': [0.009954885796211378, 3.857169894090025e-05],
'P5': [307.624705578708, 1.4454030580404746e-05],
'P6': [0.0875910297422766, 0.6848303282418671],
'P7': [0.12147343508583974, 0.6573539731442877],
'P8': [0.002789390181221983, 5.72554293489956e-07],
'P9': [0.02826834133593836,0.8999750236756555,1494.677373273538,
6529.1531372261725]
}
bounds = [(4953.474502264151, 14860.423506792453),
(48189.39647820754, 144568.18943462262),
(10243.922611886792, 30731.767835660376),
(6904.288592358491, 20712.865777075473),
(23440.199762641503, 70320.5992879245),
(44043.909679905664, 132131.729039717),
(9428.298255754717, 28284.89476726415),
(53644.56626556605, 160933.69879669815),
(8205.906018773589, 24617.718056320766)]
seed = [9906.949005,
96378.792956,
20487.845224,
13808.577185,
46880.399525,
88087.81936,
18856.596512,
107289.132531,
16411.812038]
wrapper = lambda b: -budget(vals,b,betas, funcs)[1] # negative to get *maximum* output
## Optimizer Settings
tol = 1e-16
maxiter = 10000
max_budget = 400000
# total spend can't exceed max budget
constraint = [{'type':'ineq', 'fun': lambda budget: max_budget-sum(budget)}]
# The output from the seed is better than the final "optimized" output
print('DEFAULT OUTPUT TOTAL:', wrapper(seed))
res = optimize.minimize(wrapper, seed, bounds = bounds,
tol = tol, constraints = (constraint),
options={'maxiter':maxiter})
print("Optimizer Func Output:", res.fun)
As with most of my issues, it turned out to be something silly. The sum of the seed values i passed was greater than the max_budget constraint I gave it. Hence x0 was incompatible with my constraints. Why scipy didn't produce a corresponding warning or error i'm not sure. But this turned out to be the problem.
Probably you are getting stuck in a local minima as your function seems to be quite non-linear. Have you tried to perhaps change the optimisation method to something like BFGS.
res = optimize.minimize(wrapper, seed, bounds = bounds, method='BFGS',
tol = tol, constraints = (constraint),
options={'maxiter':maxiter})
I've tried solving the problem using this algorithm, and got -78464.52052455483.
Hope it helps
I'm a bit new to Python and PyMC, and making rapid progress. But I'm just confused about the use of setting deterministic values of a 2D matrix. I have a model below, that I cannot get to parse correctly. The problem relates to setting the value theta in the model.
import numpy as np
import pymc
define known variables
N = 2
T = 10
tau = 1
define model... which I cannot get to parse correctly. It's the allocation of theta that I'm having trouble with. The aim to to get samples of D and x. Theta is just an intermediate variable, but I need to keep it as it's used in more complex variations of the model.
def NAFCgenerator():
D = np.empty(T, dtype=object)
theta = np.empty([N,T], dtype=object)
x = np.empty([N,T], dtype=object)
# true location of signal
for t in range(T):
D[t] = pymc.DiscreteUniform('D_%i' % t, lower=0, upper=N-1)
for t in range(T):
for n in range(N):
#pymc.deterministic(plot=False)
def temp_theta(dt=D[t], n=n):
return dt==n
theta[n,t] = temp_theta
x[n,t] = pymc.Normal('x_%i,%i' % (n,t),
mu=theta[n,t], tau=tau)
return locals()
** EDIT **
Explicit indexing is useful for me as I'm learning both PyMC and Python. But it seems that extracting MCMC samples is a bit clunky, e.g.
D0values = pymc_generator.trace('D_0')[:]
But I am probably missing something. But did I managed to get a vectorised version working
# Approach 1b - actually quite promising
def NAFCgenerator():
# NOTE TO SELF. It's important to declare these as objects
D = np.empty(T, dtype=object)
theta = np.empty([N,T], dtype=object)
x = np.empty([N,T], dtype=object)
# true location of signal
D = pymc.Categorical('D', spatial_prior, size=T)
# displayed stimuli
#pymc.deterministic(plot=False)
def theta(D=D):
theta = np.zeros([N,T])
theta[0,D==0]=1
theta[1,D==1]=1
return theta
#for n in range(N):
x = pymc.Normal('x', mu=theta, tau=tau)
return locals()
Which seems easier to get at MCMC samples using this for example
Dvalues = pymc_generator.trace('D')[:]
In PyMC2, when creating deterministic nodes with decorators, the default is to take the node name from the function name. The solution is simple: specify the node name as a parameter for the decorator.
#pymc.deterministic(name='temp_theta_%d_%d'%(t,n), plot=False)
def temp_theta(dt=D[t], n=n):
return dt==n
theta[n,t] = temp_theta
Here is a notebook that puts this in context.