Implementing deep networks in theano gives me precise control over my layers from the ground up. I know it's no longer supported but it appears to me there is still active use (from me at least :->). Anyway, I noticed some behavior related to gradients calculation through a scan operation.
I have the following inner function (part of a RNN inner iteration step, I can provide if nec.):
def iter_step(x_step, h):
...
return hidden, obj, est
I have a set of parameters h_params, o_params, e_params that I use to compute gradients:
h, o, e = iter_step(x_step, h)
hidden_grads = T.grad(T.sum(h), h_params)
obj_grads = T.grad(T.sum(o), o_params)
est_grads = T.grad(T.sum(est), e_params)
everything is fine. I add a scan operation
[h_n, obj_n, x_n], _ = theano.scan(
fn=iter_step,
sequences=[x],
outputs_info=[T.as_tensor_variable(np.zeros(model.h_shape), model.h.dtype),
None,
None],
)
Evaluation is fine, but computing gradients on the same parameters is now a problem:
# Works
h_n0 = theano.function([], h_n)()
# Fails
h_n_grads = T.grad(T.sum(h_n), h_params)
---------------------------------------------------------------------------
NullTypeGradError Traceback (most recent call last)
<ipython-input-32-0d7c8a65d265> in <module>
----> 1 h_n_grads = T.grad(T.sum(h_n), h_params)
/usr/local/lib/python3.6/dist-packages/theano/gradient.py in grad(cost, wrt, consider_constant, disconnected_inputs, add_names, known_grads, return_disconnected, null_gradients)
609 if null_gradients == 'raise':
610 raise NullTypeGradError("tensor.grad encountered a NaN. " +
--> 611 rval[i].type.why_null)
612 else:
613 assert null_gradients == 'return'
NullTypeGradError: tensor.grad encountered a NaN. This variable is Null because the grad method for input 4 (Subtensor{int64}.0) of the for{cpu,scan_fn} op is mathematically undefined. Depends on a shared variable
Why is this? I have not been able to debug yet - graph is not disconnected, unrolling scan by hand gives good gradients. The gradient calculations should carry through the scan operation. If gradients on h (1st output of iter_step) can be calculated, why not on the analogous output of the scan?
The problem was solved. iter_step above contained a sampling step along the lines of
def sample(self, mu, logSigma):
global SEED
srng = T.shared_randomstreams.RandomStreams(seed=SEED)
dev = srng.normal((self.batch_size, self.n_latent[-1]))
z = mu + T.exp(0.5 * logSigma) * dev
return z
Gradients through this calculation do not survive through the scan operation. Separating out all the parametric dependencies in sample still didn't work. What finally worked was creating the deviates in dev and passing it in as one of the non_sequences in scan, as in
[h_n, obj, x],inner_updates = theano.scan(
fn=iter_step,
sequences=[x_in],
outputs_info=[T.as_tensor_variable(np.zeros(self.h_shape), self.h.dtype),
None,
None],
non_sequences=[T.as_tensor_variable(self.srng.normal((self.batch_size, self.n_latent[-1])), self.h.dtype)],
)
I have an idea why this gradients over this scan fail, while simple scan operations even with randomstreams.RandomStreams objects work. I'll take a deeper look.
Summary: Use the reparamerization trick if your distribution allows it, pass in whitened deviates, pre-parameterized as non_sequences. That seems to work in all cases.
Related
I am writing CS231n assignment1 two-layer-net and I meet difficulty in relu_backward. My impletment is as below:
def relu_backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).
Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout
Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
###########################################################################
# TODO: Implement the ReLU backward pass. #
###########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
dx = dout
dx[x<=0.0] = 0.0
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx
but the result always says error is 1.0.
np.random.seed(231)
x = np.random.randn(10, 10)
dout = np.random.randn(*x.shape)
dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)
_, cache = relu_forward(x)
dx = relu_backward(dout, cache)
# The error should be on the order of e-12
print('Testing relu_backward function:')
print('dx error: ', rel_error(dx_num, dx))
Testing relu_backward function:
dx error: 1.0
Does anyone meet the same problem?
I was also experienced about this problem but actually, I solved this problem when I had debugged the 'relu_forward' function.
If you didn't use the np.maximum function, you can make that kind of error. Actually, the input and upstream derivative values come from the random values of numpy. So, if you used just max function, which is usually used in python, the calculation of max function is not concrete.
I hope this answer would be helpful to you. Then, have a good day.
1. The core problem and question
I will provide an executable example below, but let me first walk you through the problem first.
I am using solve_ivp from scipy.integrate to solve an initial value problem (see documentation). In fact I have to call the solver twice, to once integrate forward and once backward in time. (I would have to go unnecessarily deep into my concrete problem to explain why this is necessary, but please trust me here--it is!)
sol0 = solve_ivp(rhs,[0,-1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
sol1 = solve_ivp(rhs,[0, 1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
Here rhs is the right hand side function of the initial value problem y(t) = rhs(t,y). In my case, y has six components y[0] to y[5]. y0=y(0) is the initial condition. [0,±1e8] are the respective integration ranges, one forward and the other backward in time. rtol and atol are tolerances.
Importantly, you see that I flagged dense_output=True, which means that the solver does not only return the solutions on the numerical grids, but also as interpolation functions sol0.sol(t) and sol1.sol(t).
My main goal now is to define a piecewise function, say sol(t) which takes the value sol0.sol(t) for t<0 and the value sol1.sol(t) for t>=0. So the main question is: How do I do that?
I thought that numpy.piecewise should be tool of choice to do this for me. But I am having trouble using it, as you will see below, where I show you what I tried so far.
2. Example code
The code in the box below solves the initial value problem of my example. Most of the code is the definition of the rhs function, the details of which are not important to the question.
import numpy as np
from scipy.integrate import solve_ivp
# aux definitions and constants
sin=np.sin; cos=np.cos; tan=np.tan; sqrt=np.sqrt; pi=np.pi;
c = 299792458
Gm = 5.655090674872875e26
# define right hand side function of initial value problem, y'(t) = rhs(t,y)
def rhs(t,y):
p,e,i,Om,om,f = y
sinf=np.sin(f); cosf=np.cos(f); Q=sqrt(p/Gm); opecf=1+e*cosf;
R = Gm**2/(c**2*p**3)*opecf**2*(3*(e**2 + 1) + 2*e*cosf - 4*e**2*cosf**2)
S = Gm**2/(c**2*p**3)*4*opecf**3*e*sinf
rhs = np.zeros(6)
rhs[0] = 2*sqrt(p**3/Gm)/opecf*S
rhs[1] = Q*(sinf*R + (2*cosf + e*(1 + cosf**2))/opecf*S)
rhs[2] = 0
rhs[3] = 0
rhs[4] = Q/e*(-cosf*R + (2 + e*cosf)/opecf*sinf*S)
rhs[5] = sqrt(Gm/p**3)*opecf**2 + Q/e*(cosf*R - (2 + e*cosf)/opecf*sinf*S)
return rhs
# define initial values, y0
y0=[3.3578528933149297e13,0.8846,2.34921,3.98284,1.15715,0]
# integrate twice from t = 0, once backward in time (sol0) and once forward in time (sol1)
sol0 = solve_ivp(rhs,[0,-1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
sol1 = solve_ivp(rhs,[0, 1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
The solution functions can be addressed from here by sol0.sol and sol1.sol respectively. As an example, let's plot the 4th component:
from matplotlib import pyplot as plt
t0 = np.linspace(-1,0,500)*1e8
t1 = np.linspace( 0,1,500)*1e8
plt.plot(t0,sol0.sol(t0)[4])
plt.plot(t1,sol1.sol(t1)[4])
plt.title('plot 1')
plt.show()
3. Failing attempts to build piecewise function
3.1 Build vector valued piecewise function directly out of sol0.sol and sol1.sol
def sol(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol,sol1.sol])
t = np.linspace(-1,1,1000)*1e8
print(sol(t))
This leads to the following error in piecewise in line 628 of .../numpy/lib/function_base.py:
TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions
I am not sure, but I do think this is because of the following: In the documentation of piecewise it says about the third argument:
funclistlist of callables, f(x,*args,**kw), or scalars
[...]. It should take a 1d array as input and give an 1d array or a scalar value as output. [...].
I suppose the problem is, that the solution in my case has six components. Hence, evaluated on a time grid the output would be a 2d array. Can someone confirm, that this is indeed the problem? Since I think this really limits the usefulness of piecewiseby a lot.
3.2 Try the same, but just for one component (e.g. for the 4th)
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol(t)[4],sol1.sol(t)[4]])
t = np.linspace(-1,1,1000)*1e8
print(sol4(t))
This results in this error in line 624 of the same file as above:
ValueError: NumPy boolean array indexing assignment cannot assign 1000 input values to the 500 output values where the mask is true
Contrary to the previous error, unfortunately here I have so far no idea why it is not working.
3.3 Similar attempt, however first defining functions for the 4th components
def sol40(t): return sol0.sol(t)[4]
def sol41(t): return sol1.sol(t)[4]
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol40,sol41])
t = np.linspace(-1,1,1000)
plt.plot(t,sol4(t))
plt.title('plot 2')
plt.show()
Now this does not result in an error, and I can produce a plot, however this plot doesn't look like it should. It should look like plot 1 above. Also here, I so far have no clue what is going on.
Am thankful for help!
You can take a look to numpy.piecewise source code. There is nothing special in this function so I suggest to do everything manually.
def sol(t):
ans = np.empty((6, len(t)))
ans[:, t<0] = sol0.sol(t[t<0])
ans[:, t>=0] = sol1.sol(t[t>=0])
return ans
Regarding your failed attempts. Yes, piecewise excpect functions return 1d array. Your second attempt failed because documentation says that funclist argument should be list of functions or scalars but you send the list of arrays. Contrary to the documentation it works even with arrays, you just should use the arrays of the same size as t < 0 and t >= 0 like:
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol(t[t<0])[4],sol1.sol(t[t>=0])[4]])
I am playing with a mixed multinomial discrete choice model in Tensorflow Probability. The function should take an input of a choice among 3 alternatives. The chosen alternative is specified by CHOSEN (a # observationsx3 tensor). Below is an update to the code to reflect my progress on the problem (but the problem remains).
I currently get the error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [6768,3] vs. [1,3,6768] [Op:Mul]
with the traceback suggesting the issue is in the call to log_prob() for the final component of the joint distrubtion (i.e., tfp.Independent(tfp.Multinomial(...))
The main components are (thank you to Padarn Wilson for helping to fix the joint distribution definition):
#tf.function
def affine(x, kernel_diag, bias=tf.zeros([])):
"""`kernel_diag * x + bias` with broadcasting."""
kernel_diag = tf.ones_like(x) * kernel_diag
bias = tf.ones_like(x) * bias
return x * kernel_diag + bias
def mmnl_func():
adj_AV_train = (tf.ones(num_idx) - AV[:,0]) * tf.constant(-9999.)
adj_AV_SM = (tf.ones(num_idx) - AV[:,1]) * tf.constant(-9999.)
adj_AV_car = (tf.ones(num_idx) - AV[:,2]) * tf.constant(-9999.)
return tfd.JointDistributionSequential([
tfd.Normal(loc=0., scale=1e5), # mu_b_time
tfd.HalfCauchy(loc=0., scale=5), # sigma_b_time
lambda sigma_b_time,mu_b_time: tfd.MultivariateNormalDiag( # b_time
loc=affine(tf.ones([num_idx]), mu_b_time[..., tf.newaxis]),
scale_diag=sigma_b_time*tf.ones(num_idx)),
tfd.Normal(loc=0., scale=1e5), # a_train
tfd.Normal(loc=0., scale=1e5), # a_car
tfd.Normal(loc=0., scale=1e5), # b_cost
lambda b_cost,a_car,a_train,b_time: tfd.Independent(tfd.Multinomial(
total_count=1,
logits=tf.stack([
affine(DATA[:,0], tf.gather(b_time, IDX[:,0], axis=-1), (a_train + b_cost * DATA[:,1] + adj_AV_train)),
affine(DATA[:,2], tf.gather(b_time, IDX[:,0], axis=-1), (b_cost * DATA[:,3] + adj_AV_SM)),
affine(DATA[:,4], tf.gather(b_time, IDX[:,0], axis=-1), (a_car + b_cost * DATA[:,5] + adj_AV_car))
], axis=1)
),reinterpreted_batch_ndims=1)
])
#tf.function
def mmnl_log_prob(mu_b_time, sigma_b_time, b_time, a_train, a_car, b_cost):
return mmnl_func().log_prob(
[mu_b_time, sigma_b_time, b_time, a_train, a_car, b_cost, CHOICE])
# Based on https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args
# change constant values to tf.constant()
nuts_samples = tf.constant(1000)
nuts_burnin = tf.constant(500)
num_chains = tf.constant(1)
## Initial step size
init_step_size= tf.constant(0.3)
# Set the chain's start state.
initial_state = [
tf.zeros([num_chains], dtype=tf.float32, name="init_mu_b_time"),
tf.zeros([num_chains], dtype=tf.float32, name="init_sigma_b_time"),
tf.zeros([num_chains, num_idx], dtype=tf.float32, name="init_b_time"),
tf.zeros([num_chains], dtype=tf.float32, name="init_a_train"),
tf.zeros([num_chains], dtype=tf.float32, name="init_a_car"),
tf.zeros([num_chains], dtype=tf.float32, name="init_b_cost")
]
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=initial_state,
kernel=adapt_nuts_kernel,
num_burnin_steps=tf.constant(100),
parallel_iterations=tf.constant(5))
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(initial_state)
More than likely this is an issue with your initial state and number of chains. You can try to initialize your kernel outside of the sampler call:
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
and then do
nuts_kernel.bootstrap_results(initial_state)
and investigate the shapes of logL, and proposal states are being returned.
Another thing to do is to feed your initial state into your log-likelihood/posterior and see if the dimensions of the returned log-likelihoods match what you think it should be (if you are doing multiple chains then maybe it should be returning # chains log likelihoods).
It is my understanding that the batch dimension (# chains) has to be the first one in all your vectorized calculations.
The very last part of my blog post on tensorflow and custom likelihoods has working code for an example that does this.
I was able to get reasonable results from my model. Thank you to everyone for the help! The following points helped solve the various issues.
Use of JointDistributionSequentialAutoBatched() to produce consistent batch shapes. You need tf-nightly installed for access.
More informative priors for hyperparameters. The exponential transformation in the Multinomial() distribution means that uninformative hyperparameters (i.e., with sigma = 1e5) mean you quickly have large positive numbers entering the exp(), leading to infinities.
Setting the step size, etc. was also important.
I found an answer by Christopher Suter to a recent question on the Tensorflow Probability forum specifying a model from STAN useful. I took the use of taking a sample from my prior as a starting point for the initial likelihood parameters useful.
Despite JointDistributionSequentialAutoBatched() correcting the batch shapes, I went back and corrected my joint distribution shapes so that printing log_prob_parts() gives consistent shapes (i.e., [10,1] for 10 chains). I still get a shape error without using JointDistributionSequentialAutoBatched() but the combination seem to work.
I separated my affine() into two functions. They do the same thing but remove retracing warnings. Basically, affine() was able to broadcast the inputs but they differed and it was easier to write two functions that setup the inputs with consistent shapes. Differently shaped inputs causes Tensorflow to trace the function multiple times.
I am trying to estimate a model in tensorflow using NUTS by providing it a likelihood function. I have checked the likelihood function is returning reasonable values. I am following the setup here for setting up NUTS:
https://rlhick.people.wm.edu/posts/custom-likes-tensorflow.html
and some of the examples here for setting up priors, etc.:
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Multilevel_Modeling_Primer.ipynb
My code is in a colab notebook here:
https://drive.google.com/file/d/1L9JQPLO57g3OhxaRCB29do2m808ZUeex/view?usp=sharing
I get the error: OperatorNotAllowedInGraphError: iterating overtf.Tensoris not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function. This is my first time using tensorflow and I am quite lost interpreting this error. It would also be ideal if I could pass the starting parameter values as a single input (example I am working off doesn't do it, but I assume it is possible).
Update
It looks like I had to change the position of the #tf.function decorator. The sampler now runs, but it gives me the same value for all samples for each of the parameters. Is it a requirement that I pass a joint distribution through the log_prob() function? I am clearly missing something. I can run the likelihood through bfgs optimization and get reasonable results (I've estimated the model via maximum likelihood with fixed parameters in other software). It looks like I need to define the function to return a joint distribution and call log_prob(). I can do this if I set it up as a logistic regression (logit choice model is logistically distributed in differences). However, I lose the standard closed form.
My function is as follows:
#tf.function
def mmnl_log_prob(init_mu_b_time,init_sigma_b_time,init_a_car,init_a_train,init_b_cost,init_scale):
# Create priors for hyperparameters
mu_b_time = tfd.Sample(tfd.Normal(loc=init_mu_b_time, scale=init_scale),sample_shape=1).sample()
# HalfCauchy distributions are too wide for logit discrete choice
sigma_b_time = tfd.Sample(tfd.Normal(loc=init_sigma_b_time, scale=init_scale),sample_shape=1).sample()
# Create priors for parameters
a_car = tfd.Sample(tfd.Normal(loc=init_a_car, scale=init_scale),sample_shape=1).sample()
a_train = tfd.Sample(tfd.Normal(loc=init_a_train, scale=init_scale),sample_shape=1).sample()
# a_sm = tfd.Sample(tfd.Normal(loc=init_a_sm, scale=init_scale),sample_shape=1).sample()
b_cost = tfd.Sample(tfd.Normal(loc=init_b_cost, scale=init_scale),sample_shape=1).sample()
# Define a heterogeneous random parameter model with MultivariateNormalDiag()
# Use MultivariateNormalDiagPlusLowRank() to define nests, etc.
b_time = tfd.Sample(tfd.MultivariateNormalDiag( # b_time
loc=mu_b_time,
scale_diag=sigma_b_time),sample_shape=num_idx).sample()
# Definition of the utility functions
V1 = a_train + tfm.multiply(b_time,TRAIN_TT_SCALED) + b_cost * TRAIN_COST_SCALED
V2 = tfm.multiply(b_time,SM_TT_SCALED) + b_cost * SM_COST_SCALED
V3 = a_car + tfm.multiply(b_time,CAR_TT_SCALED) + b_cost * CAR_CO_SCALED
print("Vs",V1,V2,V3)
# Definition of loglikelihood
eV1 = tfm.multiply(tfm.exp(V1),TRAIN_AV_SP)
eV2 = tfm.multiply(tfm.exp(V2),SM_AV_SP)
eV3 = tfm.multiply(tfm.exp(V3),CAR_AV_SP)
eVD = eV1 + eV2 +
eV3
print("eVs",eV1,eV2,eV3,eVD)
l1 = tfm.multiply(tfm.truediv(eV1,eVD),tf.cast(tfm.equal(CHOICE,1),tf.float32))
l2 = tfm.multiply(tfm.truediv(eV2,eVD),tf.cast(tfm.equal(CHOICE,2),tf.float32))
l3 = tfm.multiply(tfm.truediv(eV3,eVD),tf.cast(tfm.equal(CHOICE,3),tf.float32))
ll = tfm.reduce_sum(tfm.log(l1+l2+l3))
print("ll",ll)
return ll
The function is called as follows:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
init = [0.,0.,0.,0.,0.,.5]
##
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
I have an answer to my question! It is simply a matter of different nomenclature. I need to define my model as a softmax function, which I knew was what I would call a "logit model", but it just wasn't clicking for me. The following blog post gave me the epiphany:
http://khakieconomics.github.io/2019/03/17/Putting-it-all-together.html
My first time trying to train a dataset containing 8 variables in a time-series of 20 years or so using GRU RNN. The biomass value is what I'm trying to predict based on the other variables. I'm trying first with 1 layer GRU. I'm not using softmax for the output layer. MSE is used for my cost function.
It is basic GRU with forward propagation and backward gradient update. Here are the main function I defined:
'x_t is the input training dataset with a dimension of 7572x8. So T = 7572, input_dim = 8, hidden_dim =128. y_train is my train label.'
def forward_prop_step(self, x_t,y_train, s_t1_prev,V, U, W, b, c,learning_rate):
T = x_t.shape[0]
z_t1 = np.zeros((T,self.hidden_dim))
r_t1 = np.zeros((T,self.hidden_dim))
h_t1 = np.zeros((T,self.hidden_dim))
s_t1 = np.zeros((T+1,self.hidden_dim))
o_s = np.zeros((T,self.input_dim))
for i in xrange(T):
x_e = x_t[i].T
z_t1[i] = sigmoid(U[0].dot(x_e) + W[0].dot(s_t1[i]) + b[0])#128x1
r_t1[i] = sigmoid(U[1].dot(x_e) + W[1].dot(s_t1[i]) + b[1])#128x1
h_t1[i] = np.tanh(U[2].dot(x_e) + W[2].dot(s_t1[i] * r_t1[i]) + b[2])#128x1
s_t1[i+1] = (np.ones_like(z_t1[i]) - z_t1[i]) * h_t1[i] + z_t1[i] * s_t1[i]#128x1
o_s[i] = np.dot(V,s_t1[i+1]) + c#8x1
return [o_s,z_t1,r_t1,h_t1,s_t1]
def bptt(self, x,y_train,o,z_t1,r_t1,h_t1,s_t1,V, U, W, b, c):
bptt_truncate = 360
T = x.shape[0]#length of time scale of input data (train)
dLdU = np.zeros(U.shape)
dLdV = np.zeros(V.shape)
dLdW = np.zeros(W.shape)
dLdb = np.zeros(b.shape)
dLdc = np.zeros(c.shape)
y_train_sp = np.repeat(y_train,self.input_dim)
for t in np.arange(T)[::-1]:
dLdy = 2 * (o[t] - y_train_sp[t])
dydV = s_t1[t]
dydc = 1.0
dLdV += np.outer(dLdy,dydV)
dLdc += dLdy*dydc
for i in np.arange(max(0, t-bptt_truncate), t+1)[::-30]:#every month in the past year
s_t1_pre = s_t1[i]
dydst1 = V #8x128
dst1dzt1 = -h_t1[i] + s_t1_pre #128x1
dst1dht1 = np.ones_like(z_t1[i]) - z_t1[i] #128x1
dzt1dU = np.outer(z_t1[i]*(1.0-z_t1[i]),x[i]) #128x8
#print dzt1dU.shape
dzt1dW = np.outer(z_t1[i]*(1.0-z_t1[i]),s_t1_pre) #128x128
dzt1db = z_t1[i]*(1.0-z_t1[i]) #128x1
dht1dU = np.outer((1.0-h_t1[i] ** 2),x[i]) #128x8
dht1dW = np.outer((1.0-h_t1[i] ** 2),s_t1_pre * r_t1[i]) #128x128
dht1db = 1.0-h_t1[i] ** 2 #128x1
dht1drt1 = (1.0-h_t1[i] ** 2)*(W[2].dot(s_t1_pre))#128x1
drt1dU = np.outer((r_t1[i]*(1.0-r_t1[i])),x[i]) #128x8
drt1dW = np.outer((r_t1[i]*(1.0-r_t1[i])),s_t1_pre) #128x128
drt1db = (r_t1[i]*(1.0-r_t1[i]))#128x1
dLdW[0] += np.outer(dydst1.T.dot(dLdy),dzt1dW.dot(dst1dzt1)) #128x128
dLdU[0] += np.outer(dydst1.T.dot(dLdy),dst1dzt1.dot(dzt1dU)) #128x8
dLdb[0] += (dydst1.T.dot(dLdy))*dst1dzt1*dzt1db#128x1
dLdW[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dW)#128x128
dLdU[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dU) #128x8
dLdb[1] += (dydst1.T.dot(dLdy))*dst1dht1*dht1drt1*drt1db#128x1
dLdW[2] += np.outer(dydst1.T.dot(dLdy),dht1dW.dot(dst1dht1)) #128x128
dLdU[2] += np.outer(dydst1.T.dot(dLdy),dst1dht1.dot(dht1dU))#128x8
dLdb[2] += (dydst1.T.dot(dLdy))*dst1dht1*dht1db#128x1
return [ dLdV,dLdU, dLdW, dLdb, dLdc ]
def predict( self, x):
pred = np.amax(x, axis = 1)
pred_f = relu(pred)
return pred_f
Parameters V,U,W,b,c are updated by gradient dLdV,dLdU,dLdW,dLdb,dLdc calculated by bptt.
I have tried different weight initialization (xavier or just random), tried different time truncation. But all lead to the same outcome. Probably the weight update wasn't right? The network set-up seems simple though. Really struggle on understanding the predication and translate to actual biomass too. The function predict is what I defined to translate the output layer from the GRU network to biomass value by taking the maximum value. But the output layer gives similar value for almost all time iterations. Not sure the best way to do the job though. Thanks for any help or suggestions in advance.
I doubt anyone on stackoverflow is going to debug a custom implementation of GRU for you. If you were using Tensorflow or another high level library, I might take a stab at it, or if it was a simple fully connected network, but as it is all I can do is give some advice on how to proceed with debugging.
First, it sounds like you're running a brand new implementation on your own data set right off the bat. Instead, try starting out by testing your network on trivial, synthetic data sets first. Can it learn an identity function? A response which is simply the weighted average of the three previous time stamps? And so on. It's easier to debug small simple problems. Once you know your implementation can learn things that a GRU based recurrent network should be able to learn, then you can start using your own data.
Second, this comment of yours was very insightful:
Probably the weight update wasn't right?
While it's impossible to say for sure, this is a very common - perhaps the most common - source of bugs in for backprop implementations. Andrew Ng recommends gradient checking to debug an implementation like this. Essentially, this involves numerically approximating the gradient. It's computationally inefficient but relies only on a correct implementation of forward propagation, which makes it very useful for debugging. For one, if the algorithm converges when the numerically approximated gradient is used, you can be more sure that your forward prop is correct and focus on debugging backprop. (On the other hand, if it is still does not succeed, it is likely the issue in your forward prop function.) For another, once the algorithm is working with the numerically approximated gradient then you can compare the output of your analytic gradient function with it and debug any discrepancies. This makes it a lot easier because you now know the correct answer that it should return.