PyTorch optimizer.step() function doesn't update weights - python

The code can be seen below.
The problem is, that the optimizer.step() part doesn't work. I'm printing model.parameters() before and after the training, and the weights don't change.
I'm trying to make a perceptron that can solve the AND-problem. I've been successful in doing this with my own tiny library, where I've implemented a perceptron with the two functions predict() and train().
Just to clarify, I've just started learning deep learning using PyTorch, so it's probably a very newbie problem. I've tried searching for a solution, but without luck. I've also compared my code with other codes that work, but I don't know what I'm doing wrong.
import torch
from torch import nn, optim
from random import randint
class NeuralNet(nn.Module):
def __init__(self):
super(NeuralNet, self).__init__()
self.layer1 = nn.Linear(2, 1)
def forward(self, input):
out = input
out = self.layer1(out)
out = torch.sign(out)
out = torch.clamp(out, 0, 1) # 0=false, 1=true
return out
data = torch.Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
target = torch.Tensor([0, 0, 0, 1])
model = NeuralNet()
epochs = 1000
lr = 0.01
print(list(model.parameters()))
print() # Print parameters before training
loss_func = nn.L1Loss()
optimizer = optim.Rprop(model.parameters(), lr)
for epoch in range(epochs + 1):
optimizer.zero_grad()
rand_int = randint(0, len(data) - 1)
x = data[rand_int]
y = target[rand_int]
pred = model(x)
loss = loss_func(pred, y)
loss.backward()
optimizer.step()
# Print parameters again
# But they haven't changed
print(list(model.parameters()))

Welcome to stackoverflow!
The issue here is you are trying to perform back-propagation through a non-differentiable function. Non-differentiable means that no gradients can flow back through them, implying that all trainable weights applied before them will not be updated by your optimizer. Such functions are easy to spot; they are discrete, sharp operations that resemble 'if' statements. In your case it is the sign() function.
Unfortunately, PyTorch does not do any hand-holding in this regard and will not point you to the issue. What you could do to alleviate the issue would be to transform the range of your output to [-1,1] and apply a Tanh() non-linearity instead of the sign() and clamp() operators.

Related

LSTM to Predict Pattern 010101... Understanding Hidden State

I did a quick experiment to see if I could understand what the hidden state in an LSTM does...
I tried to make an LSTM predict a sequence of [1,0,1,0,1...] based off an input sequence of X with X[0] = 1 and the remainder as random noise.
X = [1, randFloat, randFloat, randFloat...]
label = [1, 0, 1, 0...]
In my head, the model would understand:
The inputs X mean nothing, or at least very little (as it's noise) - so it'd discard these values for the most part
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
I also set X[0] = 1 so the first initial in an attempt to guide the net to predicting 1 on the first item (which it does)
So, this didn't work. In theory, should it not? Can you someone explain?
It essentially never converges, and is on the cusp of guessing between 0 or 1
## Code
import os
import numpy as np
import torch
from torchvision import transforms
from torch import nn
from sklearn import preprocessing
from util import create_sequences
import torch.optim as optim
Create some fake data
sequence_1 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_1[0] = 1
sequence_2 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_2[0] = 1
labels_1 = np.zeros(50)
labels_1[::2] = 1
labels_1 = torch.tensor(labels_1, dtype=torch.long)
labels_2 = labels_1.clone()
training_data = [sequence_1, sequence_2]
label_data = [labels_1, labels_2]
Create simple LSTM Model
class LSTM(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(LSTM, self).__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, seq):
lstm_out, _ = self.lstm(seq.view(len(seq), 1, -1))
out = self.fc(lstm_out.view(len(seq), -1))
out = F.log_softmax(out, dim=1)
return out
We try to overfit on the dataset
INPUT_DIM = 1
HIDDEN_DIM = 6
model = LSTM(INPUT_DIM, HIDDEN_DIM, 2)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(500):
for i, seq in enumerate(training_data):
labels = label_data[i]
model.zero_grad()
scores = model(seq)
loss = loss_function(scores, labels)
loss.backward()
print(loss)
optimizer.step()
with torch.no_grad():
seq_d = training_data[0]
tag_scores = model(seq_d)
for score in tag_scores:
print(np.argmax(score))
I would say it's not meant to work.
The model would always try to make sense and find patterns in the data it's trained on i.e sequence_1 and to "verify" that it has "found" them, it uses labels_1. Since the data is random the model fails to find the pattern.
The pattern the model tries to find is not in the label but in the data, so it doesn't matter how the label is arranged. The label actually never passes through the model, so NO.
If perhaps, you trained it on a single example then definitely. The model will become overfit and give you your ones and zeros and fail miserably on other examples, otherwise it just won't be able to make sense of the random data no matter the size.
Hidden State
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
Concerning Hidden state, NOTE that it is not a trainable parameter, it is the result of performing some operations on the data and parameters, meaning that the input data determines the Hidden state.
What the Hidden state does is to hold the information the model has extracted from the previous timesteps and passes it to the next timestep or as output. In the case of LSTM, it does some forgetting and updating before passing it.

why Gradient Descent doesn't work as expected with pytorch

so I'm starting with Pytorch and tried to start with an easy Linear Regression Example. Actually I made an easy Implementation of Linear Regression with Pytorch to calculate the equation 2*x+1 but the loss stay stuck at 120 and there is a Problem with Gradient Descent because it doesn't converge to a small loss value. I don't know why this is happening and it made me crazy because I don't see what's wrong. actually this example should be very easy to solve. this is the Code I'm using
import torch
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
X = np.array([i for i in np.arange(1, 20)]).reshape(-1, 1)
X = torch.tensor(X, dtype=torch.float32, requires_grad=True)
y = np.array([2*i+1 for i in np.arange(1, 20)]).reshape(-1, 1)
y = torch.tensor(y, dtype=torch.float32, requires_grad=True)
print(X.shape, y.shape)
class LR(torch.nn.Module):
def __init__(self, n_features, n_hidden1, n_out):
super(LR, self).__init__()
self.linear = torch.nn.Linear(n_features, n_hidden1)
self.predict = torch.nn.Linear(n_hidden1, n_out)
def forward(self, x):
x = F.relu(self.linear(x))
x = self.predict(x)
return x
model = LR(1, 10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()
def train(epochs=100):
for e in range(epochs):
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"epoch: {e} and loss= {loss}")
desired output is a small loss value and that the model train to give a good prediction later.
Your learning rate is too large. The model takes a few steps in the right direction, but it can't land on an actually good minimizer and henceforth zigzags around it. If you try lr=0.001 instead, your performance will be much better. This is why it's often useful to decay your learning rate over time when using first order optimizers.

Keras historical averaging custom loss function

I am currently experimenting with generative adversarial networks in Keras.
As proposed in this paper, I want to use the historical averaging loss function. Meaning that I want to penalize the change of the network weights.
I am not sure how to implement it in a clever way.
I was implementing the custom loss function according to the answer to this post.
def historical_averaging_wrapper(current_weights, prev_weights):
def historical_averaging(y_true, y_pred):
diff = 0
for i in range(len(current_weights)):
diff += abs(np.sum(current_weights[i]) + np.sum(prev_weights[i]))
return K.binary_crossentropy(y_true, y_pred) + diff
return historical_averaging
The weights of the network are penalized, and the weights are changing after each batch of data.
My first idea was to update the loss function after each batch.
Roughly like this:
prev_weights = model.get_weights()
for i in range(len(data)/batch_len):
current_weights = model.get_weights()
model.compile(loss=historical_averaging_wrapper(current_weights, prev_weights), optimizer='adam')
model.fit(training_data[i*batch_size:(i+1)*batch_size], training_labels[i*batch_size:(i+1)*batch_size], epochs=1, batch_size=batch_size)
prev_weights = current_weights
Is this reasonable? That approach seems to be a bit "messy" in my opinion.
Is there another possibility to do this in a "smarter" way?
Like maybe updating the loss function in a data generator and use fit_generator()?
Thanks in advance.
Loss functions are operations on the graph using tensors.
You can define additional tensors in the loss function to hold previous values. This is an example:
import tensorflow as tf
import tensorflow.keras.backend as K
keras = tf.keras
class HistoricalAvgLoss(object):
def __init__(self, model):
# create tensors (initialized to zero) to hold the previous value of the
# weights
self.prev_weights = []
for w in model.get_weights():
self.prev_weights.append(K.variable(np.zeros(w.shape)))
def loss(self, y_true, y_pred):
err = keras.losses.mean_squared_error(y_true, y_pred)
werr = [K.mean(K.abs(c - p)) for c, p in zip(model.get_weights(), self.prev_weights)]
self.prev_weights = K.in_train_phase(
[K.update(p, c) for c, p in zip(model.get_weights(), self.prev_weights)],
self.prev_weights
)
return K.in_train_phase(err + K.sum(werr), err)
The variable prev_weights holds the previous values. Note that we added a K.update operation after the weight errors are calculated.
A sample model for testing:
model = keras.models.Sequential([
keras.layers.Input(shape=(4,)),
keras.layers.Dense(8),
keras.layers.Dense(4),
keras.layers.Dense(1),
])
loss_obj = HistoricalAvgLoss(model)
model.compile('adam', loss_obj.loss)
model.summary()
Some test data and objective function:
import numpy as np
def test_fn(x):
return x[0]*x[1] + 2.0 * x[1]**2 + x[2]/x[3] + 3.0 * x[3]
X = np.random.rand(1000, 4)
y = np.apply_along_axis(test_fn, 1, X)
hist = model.fit(X, y, validation_split=0.25, epochs=10)
The model losses decrease over time, in my test.

AND-gate with Pytorch

I'm new to PyTorch and deep learning generally.
The code I wrote can be seen longer down.
I'm trying to learn the simple 'And' problem, which is linearby separable.
The problem is, that I'm getting poor results. Only around 2/10 times it gets to the correct answer.
Sometimes the loss.item() values is stuck at 0.250.
Just to clear up
Why does it only work 2/10 times?
.
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autog
data_x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
data_y = np.array([[0, 0, 0, 1]]).T
data_x = autog.Variable(torch.FloatTensor(data_x))
data_y = autog.Variable(torch.FloatTensor(data_y), requires_grad=False)
in_dim = 2
out_dim = 1
epochs = 15000
epoch_print = epochs / 5
l_rate = 0.001
class NeuralNet(nn.Module):
def __init__(self, input_size, output_size):
super(NeuralNet, self).__init__()
self.lin1 = nn.Linear(input_size, output_size)
self.relu = nn.ReLU()
def forward(self, x):
out = x
out = self.lin1(out)
out = self.relu(out)
return out
model = NeuralNet(in_dim, out_dim)
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=l_rate)
for epoch in range(epochs):
pred = model(data_x)
loss = criterion(pred, data_y)
loss.backward()
optimizer.step()
if (epoch + 1) % epoch_print == 0:
print("Epoch %d Loss %.3f" %(epoch + 1, loss.item()))
for x, y in zip(data_x, data_y):
pred = model(x)
print("Input", list(map(int, x)), "Pred", int(pred), "Output", int(y))
1. Using zero_grad with optimizer
You are not using optimizer.zero_grad() to clear the gradient. Your learning loop should look like this:
for epoch in range(epochs):
optimizer.zero_grad()
pred = model(data_x)
loss = criterion(pred, data_y)
loss.backward()
optimizer.step()
if (epoch + 1) % epoch_print == 0:
print("Epoch %d Loss %.3f" %(epoch + 1, loss.item()))
In this particular case it will not have any detrimental effect, the gradient is accumulating, but as you have the same dataset looped over and over it makes barely any difference (you should get into this habit though, as you will use it throughout your deep learning journey).
2. Cost Function
You are using Mean Absolute Error which is regression loss function, not a classification one (what you do is binary classification).
Accordingly, you should use BCELoss and sigmoid activation or (I prefer it that way), return logits from the network and use BCEWithLogitsLoss, both of them calculate binary cross entropy (simplified version of cross-entropy).
See below:
class NeuralNet(nn.Module):
def __init__(self, input_size, output_size):
super(NeuralNet, self).__init__()
self.lin1 = nn.Linear(input_size, output_size)
def forward(self, x):
# You may want to use torch.nn.functional.sigmoid activation
return self.lin1(x)
...
# Change your criterion to nn.BCELoss() if using sigmoid
criterion = nn.BCEWithLogitsLoss()
...
3. Predictions
If you used the logits version, classifier learns to assign negative values to 0 label and positive to indicate 1. Your display function has to be modified to incorporate this knowledge:
for x, y in zip(data_x, data_y):
pred = model(x)
# See int(pred > 0), that's the only change
print("Input", list(map(int, x)), "Pred", int(pred > 0), "Output", int(y))
This step does not apply if your forward applies sigmoid to the output. Oh, and it's better to use torch.round instead of casting to int.

Convergence of LSTM network using Tensorflow

I am trying to detect micro-events in a long time series. For this purpose, I will train a LSTM network.
Data. Input for each time sample is 11 different features somewhat normalized to fit 0-1. Output will be either one of two classes.
Batching. Due to huge class imbalance I have extracted the data in batches of each 60 time samples, of which at least 5 will always be class 1, and the rest class to. In this way the class imbalance is reduced from 150:1 to around 12:1 I have then randomized the order of all my batches.
Model. I am attempting to train an LSTM, with initial configuration of 3 different cells with 5 delay steps. I expect the micro events to arrive in sequences of at least 3 time steps.
Problem: When I try to train the network it will quickly converge towards saying that EVERYTHING belongs to the majority class. When I implement a weighted loss function, at some certain threshold it will change to saying that EVERYTHING belongs to the minority class. I suspect (without being expert) that there is no learning in my LSTM cells, or that my configuration is off?
Below is the code for my implementation. I am hoping that someone can tell me
Is my implementation correct?
What other reasons could there be for such behaviour?
ar_model.py
import numpy as np
import tensorflow as tf
from tensorflow.models.rnn import rnn
import ar_config
config = ar_config.get_config()
class ARModel(object):
def __init__(self, is_training=False, config=None):
# Config
if config is None:
config = ar_config.get_config()
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_classes], name='ModelOutput')
# Hidden layer
with tf.variable_scope('lstm') as scope:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_delays)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, [self._features], dtype=tf.float32)
# Output layer
output = outputs[-1]
softmax_w = tf.get_variable('softmax_w', [config.num_hidden, config.num_classes], tf.float32)
softmax_b = tf.get_variable('softmax_b', [config.num_classes], tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b
# Evaluate
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
self._cost = cost = tf.reduce_mean(loss)
self._predict = tf.argmax(tf.nn.softmax(logits), 1)
self._correct = tf.equal(tf.argmax(logits, 1), tf.argmax(self._targets, 1))
self._accuracy = tf.reduce_mean(tf.cast(self._correct, tf.float32))
self._final_state = state
if not is_training:
return
# Optimize
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(cost)
#property
def features(self):
return self._features
#property
def targets(self):
return self._targets
#property
def cost(self):
return self._cost
#property
def accuracy(self):
return self._accuracy
#property
def train_op(self):
return self._train_op
#property
def predict(self):
return self._predict
#property
def initial_state(self):
return self._initial_state
#property
def final_state(self):
return self._final_state
ar_train.py
import os
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import ar_network
import ar_config
import ar_reader
config = ar_config.get_config()
def main(argv=None):
if gfile.Exists(config.train_dir):
gfile.DeleteRecursively(config.train_dir)
gfile.MakeDirs(config.train_dir)
train()
def train():
train_data = ar_reader.ArousalData(config.train_data, num_steps=config.max_steps)
test_data = ar_reader.ArousalData(config.test_data, num_steps=config.max_steps)
with tf.Graph().as_default(), tf.Session() as session, tf.device('/cpu:0'):
initializer = tf.random_uniform_initializer(minval=-0.1, maxval=0.1)
with tf.variable_scope('model', reuse=False, initializer=initializer):
m = ar_network.ARModel(is_training=True)
s = tf.train.Saver(tf.all_variables())
tf.initialize_all_variables().run()
for batch_input, batch_target in train_data:
step = train_data.iter_steps
dict = {
m.features: batch_input,
m.targets: batch_target
}
session.run(m.train_op, feed_dict=dict)
state, cost, accuracy = session.run([m.final_state, m.cost, m.accuracy], feed_dict=dict)
if not step % 10:
test_input, test_target = test_data.next()
test_accuracy = session.run(m.accuracy, feed_dict={
m.features: test_input,
m.targets: test_target
})
now = datetime.now().time()
print ('%s | Iter %4d | Loss= %.5f | Train= %.5f | Test= %.3f' % (now, step, cost, accuracy, test_accuracy))
if not step % 1000:
destination = os.path.join(config.train_dir, 'ar_model.ckpt')
s.save(session, destination)
if __name__ == '__main__':
tf.app.run()
ar_config.py
class Config(object):
# Directories
train_dir = '...'
ckpt_dir = '...'
train_data = '...'
test_data = '...'
# Data
num_features = 13
num_classes = 2
batch_size = 60
# Model
num_hidden = 3
num_delays = 5
# Training
max_steps = 100000
def get_config():
return Config()
UPDATED ARCHITECTURE:
# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features, config.num_delays], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_output], name='ModelOutput')
# Weights
weights = {
'hidden': tf.get_variable('w_hidden', [config.num_features, config.num_hidden], tf.float32),
'out': tf.get_variable('w_out', [config.num_hidden, config.num_classes], tf.float32)
}
biases = {
'hidden': tf.get_variable('b_hidden', [config.num_hidden], tf.float32),
'out': tf.get_variable('b_out', [config.num_classes], tf.float32)
}
#Layer in
with tf.variable_scope('input_hidden') as scope:
inputs = self._features
inputs = tf.transpose(inputs, perm=[2, 0, 1]) # (BatchSize,NumFeatures,TimeSteps) -> (TimeSteps,BatchSize,NumFeatures)
inputs = tf.reshape(inputs, shape=[-1, config.num_features]) # (TimeSteps,BatchSize,NumFeatures -> (TimeSteps*BatchSize,NumFeatures)
inputs = tf.add(tf.matmul(inputs, weights['hidden']), biases['hidden'])
#Layer hidden
with tf.variable_scope('hidden_hidden') as scope:
inputs = tf.split(0, config.num_delays, inputs) # -> n_steps * (batchsize, features)
cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
outputs, state = rnn.rnn(cell, inputs, dtype=tf.float32)
#Layer out
with tf.variable_scope('hidden_output') as scope:
output = outputs[-1]
logits = tf.add(tf.matmul(output, weights['out']), biases['out'])
Odd elements
Weighted loss
I am not sure your "weighted loss" does what you want it to do:
ratio = (60.00 / 5.00)
class_weights = tf.constant([ratio, 1 - ratio])
weighted_logits = tf.mul(logits, class_weights)
this is applied before calculating the loss function (further I think you wanted an element-wise multiplication as well? also your ratio is above 1 which makes the second part negative?) so it forces your predictions to behave in a certain way before applying the softmax.
If you want weighted loss you should apply this after
loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
with some element-wise multiplication of your weights.
loss = loss * weights
Where your weights have a shape like [2,]
However, I would not recommend you to use weighted losses. Perhaps try increasing the ratio even further than 1:6.
Architecture
As far as I can read, you are using 5 stacked LSTMs with 3 hidden units per layer?
Try removing the multi rnn and just use a single LSTM/GRU (maybe even just a vanilla RNN) and jack the hidden units up to ~100-1000.
Debugging
Often when you are facing problems with an odd behaving network, it can be a good idea to:
Print everything
Literally print the shapes and values of every tensor in your model, use sess to fetch it and then print it. Your input data, the first hidden representation, your predictions, your losses etc.
You can also use tensorflows tf.Print() x_tensor = tf.Print(x_tensor, [tf.shape(x_tensor)])
Use tensorboard
Using tensorboard summaries on your gradients, accuracy metrics and histograms will reveal patterns in your data that might explain certain behavior, such as what lead to exploding weights. Like maybe your forget bias goes to infinity or your not tracking gradient through a certain layer etc.
Other questions
How large is your dataset?
How long are your sequences?
Are the 13 features categorical or continuous? You should not normalize categorical variables or represent them as integers, instead you should use one-hot encoding.
Gunnar has already made lots of good suggestions. A few more small things worth paying attention to in general for this sort of architecture:
Try tweaking the Adam learning rate. You should determine the proper learning rate by cross-validation; as a rough start, you could just check whether a smaller learning rate saves your model from crashing on the training data.
You should definitely use more hidden units. It's cheap to try larger networks when you first start out on a dataset. Go as large as necessary to avoid the underfitting you've observed. Later you can regularize / pare down the network after you get it to learn something useful.
Concretely, how long are the sequences you are passing into the network? You say you have a 30k-long time sequence.. I assume you are passing in subsections / samples of this sequence?

Categories

Resources