I am trying to build a custom loss function in keras. Unfortunately i have little knowledge with tensor flow. Is there a way i can convert the incoming tensors into a numpy array so i can compute my loss function?
Here is my function:
def getBalance(x_true, x_pred):
x_true = np.round(x_true)
x_pred = np.round(x_pred)
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
CM = confusion_matrix(x_true, x_pred)
correctPredictions = CM[0,0]+CM[1,1]
wrongPredictions = CM[1,0]+CM[0,1]
TotalTrades = correctPredictions+wrongPredictions
Accuracy = (correctPredictions/TotalTrades)*100
return Accuracy
If its not possible to use numpy array's what is the best way to compute that function with tensorflow? Any direction would be greatly appreciated, thank you!
Edit 1:
Here are some details of my model. I am using a LSTM network with heavy drop out. The inputs are a multi-variable multi-time step.
The outputs are a 2d array of binary digits (20000,2)
model = Sequential()
model.add(Dropout(0.4, input_shape=(train_input_data_NN.shape[1], train_input_data_NN.shape[2])))
model.add(LSTM(30, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(2))
model.compile(loss='getBalance', optimizer='adam')
history = model.fit(train_input_data_NN, outputs_NN, epochs=50, batch_size=64, verbose=1, validation_data=(test_input_data_NN, outputs_NN_test))
EDIT: 1 Here is an untested substitution:
(took the liberty of normalizing the variable names )
def get_balance(x_true, x_pred):
x_true = K.tf.round(x_true)
x_pred = K.tf.round(x_pred)
# didnt see the need for these
# NumberOfBars = (x_true)
# NumberOfHours = NumberOfBars/60
trade_index = K.tf.not_equal(x_pred[:,1], 0 )
##remove predictions that are not tradable
x_true_tradeable = K.tf.boolean_mask(x_true[:,0], trade_index)
x_pred_tradeable = K.tf.boolean_mask(x_pred[:,0], trade_index)
cm = K.tf.confusion_matrix(x_true_tradeable, x_pred_tradeable)
correct_predictions = cm[0,0]+cm[1,1]
wrong_predictions = cm[1,0]+cm[0,1]
total_trades = correction_predictions + wrong_predictions
accuracy = (correct_predictions/total_trades)*100
return accuracy
Original Answer
Welcome to SO. As you might know we need to compute the the gradient on the loss function. We can't compute the gradient correctly on numpy arrays (they're just constants).
What is done ( in keras/theano which are the backends one uses with keras) is automatic differentiation on Tensors (e.g tf.placeholder()).This is not the entire story but what you should know at this point is that tf / theano gives us gradients by default on operators like tf.max, tf.sum.
What that means for you is all the operations on tensors (y_true and y_pred) should be rewritten to use tf / theano operators.
I'll comment with what I think would be rewritten and you can substitute accordingly and test.
See tf.round used as K.tf.round where K is the reference to the keras backend imported as
import keras.backend as K
x_true = np.round(x_true)
x_pred = np.round(x_pred)
Grab the shape of the tensor x_true. K.shape. Compute the ratio over a constant could remain as
it as Here
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
See tf.where used as K.tf.where
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
You could mask the tensor w/ a condition instead of deleting - see masking
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
See tf.confusion_matrix
CM = confusion_matrix(x_true, x_pred)
The computation that follow are computation overs constants and so remain essentially the same ( conditioned on
whatever changes have to made given the new API )
Hopefully I can update this answer with a valid substitution that runs. But I hope this sets on the right path.
A suggestion on coding style: I see you use three version of variable naming in your code choose one and stick with it.
Related
Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1
Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()
I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.
By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.
I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).
I have been following a tutorial that shows how to make a word2vec model.
This tutorial uses this piece of code:
similarity = merge([target, context], mode='cos', dot_axes=0) (no other info was given, but I suppose this comes from keras.layers)
Now, I've researched a bit on the merge method but I couldn't find much about it.
From what I understand, it has been replaced by a lot of functions like layers.Add(), layers.Concat()....
What should I use? There's .Dot(), which has an axis parameter (which seems to be correct) but no mode parameter.
What can I use in this case?
The Dot layer in Keras now supports built-in Cosine similarity using the normalize = True parameter.
From the Keras Docs:
keras.layers.Dot(axes, normalize=True)
normalize: Whether to L2-normalize samples along the dot product axis before taking the dot product. If set to True, then the output of the dot product is the cosine proximity between the two samples.
Source
There are a few things that are unclear from the Keras documentation that I think are crucial to understanding:
For each function in the keras documentation for Merge, there is a lower case and upper case one defined i.e. add() and Add().
On Github, farizrahman4u outlines the differences:
Merge is a layer.
Merge takes layers as input
Merge is usually used with Sequential models
merge is a function.
merge takes tensors as input.
merge is a wrapper around Merge.
merge is used in Functional API
Using Merge:
left = Sequential()
left.add(...)
left.add(...)
right = Sequential()
right.add(...)
right.add(...)
model = Sequential()
model.add(Merge([left, right]))
model.add(...)
using merge:
a = Input((10,))
b = Dense(10)(a)
c = Dense(10)(a)
d = merge([b, c])
model = Model(a, d)
To answer your question, since Merge has been deprecated, we have to define and build a layer ourselves for the cosine similarity. In general this will involve using those lowercase functions, which we wrap within a Lambda to create a layer that we can use within a model.
I found a solution here:
from keras import backend as K
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
Depending on your data, you may want to remove the L2 normalization. What is important to note about the solution is that it is built using the Keras function api e.g. K.mean() - I think this is necessary when defining custom layer or even loss functions.
Hope I was clear, this was my first SO answer!
Maybe this will help you
(I spent a lot of time to make sure that these are the same things)
import tensorflow as tf
with tf.device('/CPU:' + str(0)):
print(tf.losses.CosineSimilarity()([1.0,1.0,1.0,-1.0],[4.0,4.0,4.0,5.0]))
print(tf.keras.layers.dot([tf.Variable([[1.0,1.0,1.0,-1.0]]),tf.Variable([[4.0,4.0,4.0,5.0]])], axes=1, normalize=True))
Output (Pay attention to the sign):
tf.Tensor(-0.40964404, shape=(), dtype=float32)
tf.Tensor([[0.40964404]], shape=(1, 1), dtype=float32)
If you alter the last code block of the tutorial as follows, you can see that the (average) loss is decreasing nicely with the Dot solution suggested by SantoshGuptaz7 (comment in the question above):
display_after_epoch = 10000
display_after_epoch_2 = 10 * display_after_epoch
loss_sum = 0
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
loss_sum += loss
if cnt % display_after_epoch == 0 and cnt != 0:
print("\nIteration {}, loss={}".format(cnt, loss_sum / cnt))
loss_sum = 0
if cnt % display_after_epoch_2 == 0:
sim_cb.run_sim()
I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0
I'm trying to train an LSTM to classify sequences of various lengths. I want to get the weights of this model, so I can use them in stateful version of the model. Before training, the weights are normal. Also, the training seems to run successfully, with a gradually decreasing error. However, when I change the mask value from -10 to np.Nan, mod.get_weights() starts returning arrays of NaNs and the validation error drops suddenly to a value close to zero. Why is this occurring?
from keras import models
from keras.layers import Dense, Masking, LSTM
from keras.optimizers import RMSprop
from keras.losses import categorical_crossentropy
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt
def gen_noise(noise_len, mag):
return np.random.uniform(size=noise_len) * mag
def gen_sin(t_val, freq):
return 2 * np.sin(2 * np.pi * t_val * freq)
def train_rnn(x_train, y_train, max_len, mask, number_of_categories):
epochs = 3
batch_size = 100
# three hidden layers of 256 each
vec_dims = 1
hidden_units = 256
in_shape = (max_len, vec_dims)
model = models.Sequential()
model.add(Masking(mask, name="in_layer", input_shape=in_shape,))
model.add(LSTM(hidden_units, return_sequences=False))
model.add(Dense(number_of_categories, input_shape=(number_of_categories,),
activation='softmax', name='output'))
model.compile(loss=categorical_crossentropy, optimizer=RMSprop())
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
validation_split=0.05)
return model
def gen_sig_cls_pair(freqs, t_stops, num_examples, noise_magnitude, mask, dt=0.01):
x = []
y = []
num_cat = len(freqs)
max_t = int(np.max(t_stops) / dt)
for f_i, f in enumerate(freqs):
for t_stop in t_stops:
t_range = np.arange(0, t_stop, dt)
t_len = t_range.size
for _ in range(num_examples):
sig = gen_sin(f, t_range) + gen_noise(t_len, noise_magnitude)
x.append(sig)
one_hot = np.zeros(num_cat, dtype=np.bool)
one_hot[f_i] = 1
y.append(one_hot)
pad_kwargs = dict(padding='post', maxlen=max_t, value=mask, dtype=np.float32)
return pad_sequences(x, **pad_kwargs), np.array(y)
if __name__ == '__main__':
noise_mag = 0.01
mask_val = -10
frequencies = (5, 7, 10)
signal_lengths = (0.8, 0.9, 1)
dt_val = 0.01
x_in, y_in = gen_sig_cls_pair(frequencies, signal_lengths, 50, noise_mag, mask_val)
mod = train_rnn(x_in[:, :, None], y_in, int(np.max(signal_lengths) / dt_val), mask_val, len(frequencies))
This persists even if I change the network architecture to return_sequences=True and wrap the Dense layer with TimeDistributed, nor does removing the LSTM layer.
I had the same problem. In your case I can see it was probably something different but someone might have the same problem and come here from Google. So in my case I was passing sample_weight parameter to fit() method and when the sample weights contained some zeros in it, get_weights() was returning an array with NaNs. When I omitted the samples where sample_weight=0 (they were useless anyway if sample_weight=0), it started to work.
The weights are indeed changing. The unchanging weights are from the edge of the image, and they may have not changed because the edge isn't helpful for classifying digits.
to check select a specific layer and see the result:
print(model.layers[70].get_weights()[1])
70 : is the number of the last layer in my case.
get_weights() method of keras.engine.training.Model instance should retrieve the weights of the model.
This should be a flat list of Numpy arrays, or in other words this should be the list of all weight tensors in the model.
mw = model.get_weights()
print(mw)
If you got the NaN(s) this has a specific meaning. You are dealing simple with vanishing gradients problem. (In some cases even with Exploding gradients).
I would first try to alter the model to reduce the chances for the vanishing gradients. Try reducing the hidden_units first, and normalize your activations.
Even though LSTM are there to solve the problem of vanishing/exploding gradients problem you need to set the right activations from (-1, 1) interval.
Note this interval is where float points are most precise.
Working with np.nan under the masking layer is not a predictable operation since you cannot do comparison with np.nan.
Try print(np.nan==np.nan) and it will return False. This is an old problem with the IEEE 754 standard.
Or it may actually be this is a bug in Tensorflow, based on the IEEE 754 standard weakness.