I am building a general-purpose NN that would classify images (Dog/No Dog) and movie reviews(Good/Bad). I have to stick to a very specific architecture and loss function so changing these two seems out of the equation. My architecture is a two-layer network with relu followed by a sigmoid and a cross-entropy loss function. With 1000 epochs and a learning rate of around .001 I am getting 100 percent training accuracy and .72 testing accuracy.I was looking for suggestions to improve my testing accuracy.This is the layout of what I have:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
n_x,n_h,n_y=model_size
model = Net(n_x, n_h, n_y)
optim = torch.optim.Adam(model.parameters(),lr=0.005)
loss_function = nn.BCELoss()
train_losses = []
accuracy = []
for epoch in range(epochs):
count=0
model.train()
train_loss = []
batch_accuracy = []
for idx in range(0, train_x.shape[0], batch_size):
batch_x = torch.from_numpy(train_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(train_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
batch_accuracy=[]
loss = loss_function(model_output, batch_y)
train_loss.append(loss.item())
preds = model_output > 0.5
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
optim.zero_grad()
loss.backward()
# Scheduler made it worse
# scheduler.step(loss.item())
optim.step()
if epoch % 100 == 1:
train_losses.append(train_loss)
print("Iteration : {}, Training loss: {} ,Accuracy %: {}".format(epoch,np.mean(train_loss),(count/train_x.shape[0])*100))
plt.plot(np.squeeze(train_losses))
plt.ylabel('loss')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(lr))
plt.show()
return model
My model parameters:
batch_size = 32
lr = 0.0001
epochs = 1500
n_x = 12288 # num_px * num_px * 3
n_h = 7
n_y = 1
model_size=n_x,n_h,n_y
model=train_net(epochs,batch_size,train_x,train_y,model_size,or)
and this is the testing phase.
model.eval() #Setting the model to eval mode, hence making it deterministic.
test_loss = []
count=0;
loss_function = nn.BCELoss()
for idx in range(0, test_x.shape[0], batch_size):
with torch.no_grad():
batch_x = torch.from_numpy(test_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(test_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
preds = model_output > 0.5
loss = loss_function(model_output, batch_y)
test_loss.append(loss.item())
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
print("test loss: {},test accuracy: {}".format(np.mean(test_loss),count/test_x.shape[0]))
Things I have tried:
Messing around with the learning rate, having momentum, using schedulers and changing batch sizes.Of course these were mainly guesses and not based on any valid assumptions.
The issue you're facing is overfitting. With 100% accuracy on the training set, your model is effectively memorizing the training set, then failing to generalize to unseen samples. The good news is this is a very common major challenge!
You need regularization. One method is dropout, whereby on different training epochs a random set of the NN connections are dropped, forcing the network to "learn" alternate pathways and weights, and softening sharp peaks in parameter space. Since you need to keep your architecture and loss function the same, you won't be able to add such an option in (though for completeness, read this article for a description and implementation of dropout in PyTorch).
Given your constraints, you'll want to use something like L2 or L1 weight regularization. This typically shows up in the way of adding an additional term to the cost/loss function, which penalizes large weights. In PyTorch, L2 regularization is implemented via the torch.optim construct, with the option weight_decay. (See documentation: torch.optim, search for 'L2')
For your code, try something like:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
...
optim = torch.optim.Adam(model.parameters(),...,weight_decay=0.01)
...
Based on your statement that your training accuracy is 100%, while your testing accuracy is significantly lower at 72%, it seems that you are significantly overfitting your dataset.
In short, this means that your model is training itself too specifically to the training data that you've given it, picking up on quirks that may exist in the training data but which are not inherent to the classification. For example, if the dogs in your training data were all white, the model would eventually learn to associate the color white with dogs, and be hard-pressed to recognize dogs of other colors given to it in the test data set.
There are many avenues to address this issue: a well sourced overview of the subject written in simple terms can be found here.
Without more information on the specific constraints you have around changing the architecture of the neural network, it's tough to say for sure what you will and will not be able to change. However, weight regularization and dropout are often used to great effect (and are described in the above article.) You should also be free to implement early stopping and a weight constraint to the model.
I'll leave it you to find resources on how to implement these specific strategies in pytorch, but this should provide a good jumping off point.
Related
I am trying to further pretrain a Dutch BERT model with MLM on an in-domain dataset (law-related). I have set up my entire preprocessing and training stages, but when I use the trained model to predict a masked word, it always outputs the same words in the same order, including the [PAD] token. Which is weird, because I thought it wasn't even supposed to be able to predict the pad-token at all (since my code makes sure pad-tokens are not masked).
See picture of my models predictions
I have tried to use more data (more than 50.000 instances) and more epochs (about 20). I have gone through my code and am pretty sure that it gives the right input to the model. The English version of the model seems to work, which makes me wonder if the Dutch model is less robust.
Would anyone know any possible causes/solutions for this? Or is it possible that my language model just simply doesn't work?
I will add my training loop and mask-function just in case I overlooked a mistake in them:
def mlm(tensor):
rand = torch.rand(tensor.shape)
mask_arr = (rand < 0.15) * (tensor > 3)
for i in range(tensor.shape[0]):
selection = torch.flatten(mask_arr[i].nonzero()).tolist()
tensor[i, selection] = 4
return tensor
model.train()
optim = optim.Adam(model.parameters(), lr=0.005)
epochs = 1
losses = []
for epoch in range(epochs):
epochloss = []
loop = tqdm(loader, leave=True)
for batch in loop:
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels = labels)
loss = outputs.loss
epochloss.append(loss)
loss.backward()
optim.step()
loop.set_description(f'Epoch {epoch}')
loop.set_postfix(loss=loss.item())
losses.append(epochloss)
As a simplified version of my actual research problem, let's say I have a second-order polynomial function y = ax^2 + bx + c and I want to use a deep neural network to predict the parameters a, b and c given the variable x and the value of the function y. The variable x and the parameters a,b,c are exctracted from a uniform distribution in the range [0,1].
When I try to train the network using different architectures, cost functions and hyperparameters combinations among the most used, I always got the same issue: the train and test losses rapidly converge to a value significantly higher than 0, then starts to fluctuate in a strange way and the predictions are not accurate (see figures as a general example, the predictions for b are similar, c is slightly better but still not satisfactory). This happens even if I set higher momentum or lower learning rates. Also, I got the same issue if I try to recover one parameter at a time.
As an example, here is the PyTorch code I used for my first test (4 layers, first 3 followed by ReLU, MSELoss, RMSprop optimizer with learning rate = 0.001 and momentum 0.9).
class PRNet(nn.Module):
def __init__(self, input_size, output_size):
super(PRNet, self).__init__()
self.input_size = input_size
self.fc1 = nn.Linear(self.input_size, 32)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(32, 64)
self.relu2 = nn.ReLU()
self.fc3 = nn.Linear(64, 64)
self.relu3 = nn.ReLU()
self.fc4 = nn.Linear(64, output_size)
def forward(self, x):
output = self.fc1(x)
output = self.relu1(output)
output = self.fc2(output)
output = self.relu2(output)
output = self.fc3(output)
output = self.relu3(output)
output = self.fc4(output)
return output
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
var_x = np.random.rand(100000)
pars_abc = np.random.rand(3, 100000)
func_y = pars[0]*var**2 + pars[1] * var + pars[2]
data = np.vstack((var_x, func_y)).T
parameters = pars_abc.T
X = torch.Tensor(data).to(device).float()
y = torch.Tensor(parameters).to(device).float()
train_size = int(0.8 * len(data))
batch_size = 100
train_dataset = TensorDataset(X[:train_size], y[:train_size])
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
prnet = PRNet(X.shape[1], 3).to(device)
loss_function = nn.MSELoss()
optimizer = torch.optim.RMSprop(prnet.parameters(), lr=1e-4, momentum=0.9)
num_epochs = 25
for epoch in range(0, num_epochs):
print(f'Starting epoch {epoch+1}')
current_loss = 0.0
for i, batch in enumerate(train_dataloader, 0):
inputs, targets = batch
optimizer.zero_grad()
outputs = prnet(inputs)
test_outputs = prnet(X[train_size:].to(device))
train_loss = loss_function(outputs, targets)
test_loss = loss_function(test_outputs, y[train_size:])
train_loss_plot[epoch,i] = train_loss.item()
test_loss_plot[epoch,i] = test_loss.item()
train_loss.backward()
optimizer.step()
What could be the cause of this issue? Are the features not representative enough? Do I need a custom loss more suitable for this problem?
During training, when a model's loss starts fluctuating, the most probable cause for such a pattern to show up is that the learning rate is high for the weights to get to the required value.
Consider this example. Suppose in your model, a parameter (weight), initialized with a value of 0.1, needs to get to a value of 0.00423 and the learning rate is set to 0.001.
Now, let's assume that the parameter has reached a value of 0.004 after a few epochs of training. Gradient descent will try to increase the value in order to make it equal to the target value but since the learning rate is only upto 3 decimal digits, the parameter value will now become 0.005. Since the value has now increased, gradient descent will try to decrease the value which will change the parameter value back to 0.004 and thus starting a fluctuation pattern.
To solve this issue, using a small learning rate will not help. Because if you use a small learning rate then the model will learn too slowly and might not converge at all. What you are probably looking for is a way to use a variable learning rate policy in your training. With such a policy, you can begin with a large learning rate initially so that the model learns faster. And later on, when the model parameters get close to the target values, the learning rate should decrease automatically in order to make the parameters reach as close as possible to the target. These policies are called learning rate schedulers.
There are several functions in PyTorch that let you use a learning rate scheduler of your choice. You can look for them in their documentation.
I'll suggest you to go for the Reduce LR on Plateau scheduler. It will let you set a threshold and a factor. Whenever your model loss does not improve over the specified threshold of number of epochs, it will decrease the learning rate by the factor.
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html#torch.optim.lr_scheduler.ReduceLROnPlateau
I know this has been previously discussed, but I did not find a concrete answer, and some answers did not work after trying them, the case is simple, I have a model, if I use batch norm, the training accuracy reported by model.fit(training_data) is above 0.9 (it consistently increases, and the loss decreases), but then after training if I run model.evaluate(training_data) (notice is the same data) it returns 0.09, also predictions are really bad (the accuracy is low too if manually calculated using the results from model.predict(training_data). I know the difference between training and testing time in batch norm, and I know differences should be expected, but a drop from 0.9 to 0.09 seems just wrong(and the model is completely unusable). I tried some solutions from other threads:
use batch_size in .evaluate to be the same as .fit: did not make a difference
set tf.keras.backend.set_learning_phase(0): got a message saying it is now deprecated and made no difference.
set all batch norm layers to have layer.trainable=False before .predict and .evaluate: it did not a difference.
If I remove batch norm layers, the report from model.fit(training_data) coincides with model.evaluate(training_data) but the training is not doing any progress (results are consistent but bad) so I need to add it.
Is this a major bug in TF 2.6?
Update: also tested TF 2.5, result is the same.
Sample code(omitting irrelevant code, like data reading and pre-processing):
### model definition
class CLS_BERT_Embedding(tf.keras.Model):
"""Will only use the CLS token"""
def __init__(self, bert_trainable=False, number_filters=50,FNN_units=512,
number_clases=2,dropout_rate=0.1,name="dcnn"):
super(CLS_BERT_Embedding,self).__init__(name)
self.checkpoint_id ="CLS_BERT_Embedding_bn_3fc_{}filters_{}fc_units_berttrainable{}".format(number_filters,
FNN_units,bert_trainable)
# trainable= False so we don't fine-tune bert, just use as embedding layer
self.bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
trainable=bert_trainable,
input_shape=(3,376))
self.dense_1 = layers.Dense(units = FNN_units,activation="relu")
self.bn1 = layers.BatchNormalization()
self.dense_2 = layers.Dense(units = FNN_units, activation="relu")
self.bn2 = layers.BatchNormalization()
self.dense_3 = layers.Dense(units = FNN_units, activation="relu")
self.bn3 = layers.BatchNormalization()
self.dropout = layers.Dropout(rate=dropout_rate)
if number_clases == 2:
self.last_dense = layers.Dense(units=1,activation="sigmoid")
else:
self.last_dense = layers.Dense(units=number_clases,activation="softmax")
def get_bert_embeddings(self,all_tokens):
CLS_embedding ,embeddings = self.bert_layer([all_tokens[:,0,:],
all_tokens[:,1,:],
all_tokens[:,2,:]])
return CLS_embedding,embeddings
def call(self,inputs,training):
CLS_embedding, x_seq = self.get_bert_embeddings(inputs)
x = self.dense_1(CLS_embedding)
x = self.bn1(x,training)
x = self.dense_2(x)
x = self.bn2(x,training)
x = self.dense_3(x)
x = self.bn3(x,training)
output = self.last_dense(x)
return output
#### config and hyper-params
NUMBER_FILTERS = 1024
FNN_UNITS = 2048
BERT_TRAINABLE = False
NUMBER_CLASSES = len(tokenizer.vocab)
DROPOUT_RATE = 0.2
NUMBER_EPOCHS = 3
LR = 0.001
DEVICE = '/GPU:0'
#### optimization definition
with tf.device(DEVICE):
model = CLS_BERT_Embedding(
bert_trainable = BERT_TRAINABLE,
number_filters=NUMBER_FILTERS,
FNN_units=FNN_UNITS,
number_clases=NUMBER_CLASSES,
dropout_rate = DROPOUT_RATE)
if NUMBER_CLASSES == 2:
loss = "binary_crossentropy"
metrics = ["accuracy"]
else:
loss="sparse_categorical_crossentropy"
metrics = ["sparse_categorical_accuracy"]
optimizer = tf.keras.optimizers.Adam(learning_rate = LR)
loss="sparse_categorical_crossentropy"
model.compile(loss=loss,optimizer=optimizer,metrics=metrics)
### training
with tf.device(DEVICE):
model.fit(train_dataset,
batch_size = BATCH_SIZE ,
epochs=NUMBER_EPOCHS,
shuffle=True,
callbacks=[MyCustomCallback(),
tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",patience=5),
tensorboard,lr_tensorboard])
### testing
train_results = model.evaluate(train_dataset,batch_size = BATCH_SIZE)
print(train_results)
Try running inference without adjusting the trainable flag at all and then verifying that self.bn1.trainable=True. Then run forward prop by calling the model as a callable on each batch of your training data but with training=True, and evaluating each time. So that would be something like
for idx d in enumerate(train_dataset_:
_ = model(d[0], training=True)
model.evaluate(d)
if idx > 100:
break
If your loss starts dropping, then this is an example of your batch norm moving statistics not updating fast enough, which is possible given the BNs are not trained but your bert mode/layer is. If not, ignore the rest of this because you may have a different issue.
If that's the case, you have two options. One is to keep calling the model to allow the BN moving statistics to stabilize.
The other is to statistically analyze the output of your bert layer (get the mean and var) and directly update the BN's moving statistic weights. Probably the first BN is sufficient given your latter Dense layers are
Xavier Glorot initialized, but given you are using Relu, you might also try Kaiming He initialization on them.
I agree with #Yaoshiang, it is likely that the internal statistics (moving average, moving var) do not coincide with the mean and variance per batch, hence a different normalization in the BN layer at training and at testing. Thinking about it, if we use the same batch size for training and testing, then we can keep training=True in the test without it being a real problem (when using predict or evaluate). Otherwise, we can force in the training to use the moving average and moving variance for the normalization rather than the mean and variance by batch and still estimate beta and gamma. (This implies a minor modification of the BatchNormalization class.)
In PyTorch, I want to evaluate my model on the validation set every eval_step during training, and I wrote code like this:
def tune(model, loader_train, loader_dev, optimizer, epochs, eval_step):
for epoch in range(epochs):
for step,x in enumerate(loader_train):
optimizer.zero_grad()
loss = model(x)
loss.backward()
optimizer.step()
if step % eval_step == 0:
model.eval()
test(model, loader_dev)
model.train()
When eval_step = int(len(loader_train)/2) and eval_step = int(len(loader_train)/8), they lead to quite different metric result after training through one whole epoch (which means the second output for the former differs the eighth output for the latter).
Could anyone explain why?
The length of loader_train is 20000 (it depends on batch size), and here is my testing script:
def test(model, loader_dev):
preds = []
labels = []
for step,x in enumerate(loader_dev):
preds.append(model(x).view(-1))
labels.apend(x['label'].view(-1))
metric = cal_metric(preds, labels)
logger.info(metric)
I think you probably set 'shffule=True' in your dataloader. Even though you fix 'random seed', dataloader in torch will generate different results if you use another dataloader while using current dataloader. In the scenario you describe, it may cause your model get data input in different order and then result in different metric result.
I started using Pytorch and I'm currently working on a Project where I'm using a simple feed forward neural network for linear regression. The Problem is I didn't find anything in Pytorch that allows me to get the Accuracy of a linear regression Model as in Keras or in SKlearn. in keras it would be simple just by setting metrics=["accuracy"] inside the compile function. I searched in the docs and official website of Pytorch but I didn't find anything. seems that this API doesn't exist in Pytorch. I know that I can observe the loss during training or I can simply get the test loss and based on it I can know whether the loss decreased or not but I want to use that Keras Structure where I get the loss value and also an Accuracy value. the Keras way looks more clear. I also tried to implement an accuracy function using the r2_score from sklearn but it gave me wierd values:
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)
def train(model, optimizer, loss_fn):
def train_step(x, y):
model.train()
optimizer.zero_grad()
out = model(x)
loss = loss_fn(out, y)
loss.backward()
optimizer.step()
return loss.item()
return train_step
def fit(epochs=100):
train_func = train(model, optimizer, criterion)
count, total = 0, 0
loss_list, accuracy_list, iters = [], [], []
for e in range(epochs):
for X, y in train_loader:
loss = train_func(X, y)
count += 1
total += len(y)
if count % 50 == 0:
print("loss= ", loss)
loss_list.append(loss)
iters.append(total)
if count % 100 == 0:
model.eval() # im not sure if we can do this in pytorch. I mean evaluating the model while training! it would be great if you tell me whether this is ok or not
out = model(X)
out = out.detach().numpy()
y = y.detach().numpy()
accuracy = r2_score(y, out) # r2_score is the scikit learn r2 score function.
print("accuracy = ", accuracy) # here i get wierd values and it doesn't get better over time, in contrast the loss decreased over time
accuracy_list.append(accuracy)
return iters, loss_list, accuracy_list
I know how to implement an Accuracy function in case of Classification Problem because it is using discrete values. that is clear to me because the implementation is easy and clear. I must only look which correct prediction did the model made and then calculate accuracy. but in this Case I have continuous values so that's why I couldn't implement the function myself and it surprised me that Pytorch don't have a built in function for this. so could someone maybe tell me how to implement this or where to find an Implementation of it?
another thing is where to use the evaluation and where to set the model in evaluation mode by calling the eval function. should I use it during training like I did in my Code or should I train and then test after training and if I test during training should I call the eval function as I did there or it will affect the training when the loop goes back to training mode? another thing I didn't find it also in Pytorch which is Cross validation. how should I implement it in pytorch if there is no API for it like in Keras?
Accuracy does not exist in regression problems.
A similar measure of "accuracy" for a regression problem might be the R-squared score.
If you are using pytorch to train your neural networks, have a look at the package
torchmetrics. You may find what you need there.
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
Look here for more info: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html