Hyperparameter tuning with Hparam dashboard throwing error - python

I am trying to optimize the best conditions for a sequential model I am building in keras.
I have recently come across Hparams dashboards which looks like a really nice way of doing this. However I am running into a problem at the stage of actually running the model to carry out the parameter optimization!
The code I am running (just to begin with taken directly from the tf page)
https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I have modified the code for Hparams on tf to my sequential model. For the purpose of practice I have removed a dropout layer (as I don't have any in my model) as well as the optimizer. For now I would like to see how my model is affected by changing nodes in layers. My code is as follows:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
METRIC_ACCURACY = 'accuracy'
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_NUM_UNITS],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
def train_test_model(hparams):
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=tf.nn.relu),
tf.keras.layers.Dense(24, activation=tf.nn.sigmoid),
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
)
model.fit(X_train.values, y_train, epochs=50)
_, accuracy = model.evaluate(X_test, y_test)
return accuracy
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Up to this point, everything works fine! for the purpose of my first attempt, i have not changed much apart from removing dropout and optimizer plus applying my own model in the code. I require more units than 16 and 32 etc however this is just for the purpose of making a pipeline...
When I run the following code to execute the optimization, I get the error. the code is:
session_num = 0
for num_units in HP_NUM_UNITS.domain.values:
hparams = {
HP_NUM_UNITS: num_units,
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
This throws the error! the error is (which I don't quite understand):
ValueError: Cannot create an execution function which is comprised of elements from multiple graphs.
This error takes place following what looks like the first attempt at a model as for the first set of units (16) a model is fit. If i look at the traceback i get the progress report:
Epoch 1/50
140/140 [==============================] - 0s 3ms/sample - loss: 0.6847 - accuracy: 0.5723......
Epoch 50/50
140/140 [==============================] - 0s 206us/sample - loss: 0.2661 - accuracy: 0.8857
And after this is when I get the error( cannot create an execution function... etc)
I am unsure about how to fix this and any help would be much appreciated!
I am more than happy to provide any more detail/code!
Thank you!

I had the same error and I fixed it by turning my train and test values from pandas dataframe to numpy array. So just use X_train.values and so on so forth.
If this does just tell me at what line is the error exactly occurring at.

Related

tf.keras custom metric is giving incorrect results

I have implemented a custom metric in tf.keras for a multi label classification problem.
def multilabel_TP(y_true, y_pred, thres = 0.4):
return (
tf.math.count_nonzero(
tf.math.logical_and(tf.cast(y_true, tf.bool),
tf.cast(y_pred >= thres, tf.bool))
)
)
count_zero function produces integer results but while running the model it gives me float values. The custom function gives me correct results when tried outside the scope of the keras model.
8/33 [======>.......................] - ETA: 27s - loss: 0.4294 - multilabel_TP: **121.6250**
model.compile(loss = 'binary_crossentropy', metrics = multilabel_TP, optimizer= 'adam')
model.fit(train_sentences, y_train, batch_size= 128, epochs = 20, validation_data= (test_sentences, y_test))
Why is this happenning?
What is presented in the keras progress bar is a running mean of your loss/metrics over batches, since the model is being trained on batches and the weights are changing after each batch. This is why you get a floating point value.
Your metric should also return a floating point value, maybe by taking a division over the number of elements in the batch. Then the metric values will make more sense.

Big loss and low accuracy on training data in both BERT and ALBERT

I am using huggingface TFBertModel to do a classification task (from here: ), I am using the bare TFBertModel with an added head dense layer and not TFBertForSequenceClassification since I didn't see how I could use the latter using pretrained weights to only fine-tune the model.
As far as I know, fine tuning should give me about 80% or more accuracy in both BERT and ALBERT, but I am not coming even near that number:
Train on 3600 samples, validate on 400 samples
Epoch 1/2
3600/3600 [==============================] - 177s 49ms/sample - loss: 0.6531 - accuracy: 0.5792 - val_loss: 0.5296 - val_accuracy: 0.7675
Epoch 2/2
3600/3600 [==============================] - 172s 48ms/sample - loss: 0.6288 - accuracy: 0.6119 - val_loss: 0.5020 - val_accuracy: 0.7850
More epochs don't make much difference.
I am using CoLA public data set to fine-tune , this is how the data looks like:
gj04 1 Our friends won't buy this analysis, let alone the next one we propose.
gj04 1 One more pseudo generalization and I'm giving up.
gj04 1 One more pseudo generalization or I'm giving up.
gj04 1 The more we study verbs, the crazier they get.
...
And this is the code that loads the data into python:
import csv
def get_cola_data(max_items=None):
csv_file = open('cola_public/raw/in_domain_train.tsv')
reader = csv.reader(csv_file, delimiter='\t')
x = []
y = []
for row in reader:
x.append(row[3])
y.append(float(row[1]))
if max_items is not None:
x = x[:max_items]
y = y[:max_items]
return x, y
I verified that the data is in the format that I want it to be in the lists, and this is the code of the model itself:
#!/usr/bin/env python
import tensorflow as tf
from tensorflow import keras
from transformers import BertTokenizer, TFBertModel
import numpy as np
from cola_public import get_cola_data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
bert_model.trainable = False
x_input = keras.Input(shape=(512,), dtype=tf.int64)
x_mask = keras.Input(shape=(512,), dtype=tf.int64)
_, output = bert_model([x_input, x_mask])
output = keras.layers.Dense(1)(output)
model = keras.Model(
inputs=[x_input, x_mask],
outputs=output,
name='bert_classifier',
)
model.compile(
loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'],
)
train_data_x, train_data_y = get_cola_data(max_items=4000)
encoded_data = [tokenizer.encode_plus(data, add_special_tokens=True, pad_to_max_length=True) for data in train_data_x]
train_data_x = np.array([data['input_ids'] for data in encoded_data])
mask_data_x = np.array([data['attention_mask'] for data in encoded_data])
train_data_y = np.array(train_data_y)
model.fit(
[train_data_x, mask_data_x],
train_data_y,
epochs=2,
validation_split=0.1,
)
cmd_input = ''
while True:
print("Type an opinion: ")
cmd_input = input()
# print('Your opinion is: %s' % cmd_input)
if cmd_input == 'exit':
break
cmd_input_tokens = tokenizer.encode_plus(cmd_input, add_special_tokens=True, pad_to_max_length=True)
cmd_input_ids = np.array([cmd_input_tokens['input_ids']])
cmd_mask = np.array([cmd_input_tokens['attention_mask']])
model.reset_states()
result = model.predict([cmd_input_ids, cmd_mask])
print(result)
Now, no matter if I use other dataset, other number of items from the datasets, if I use a dropout layer before the last dense layer, if I give another dense layer before the last one with higher number of units or if I use Albert instead of BERT, I always have low accuracy and high loss, and often, the validation accuracy is higher than training accuracy.
I have the same results if I try to use BERT/ALBERT for NER task, always the same result, which makes me believe I systematically make some fundamental mistake in fine tuning.
I know that I have bert_model.trainable = False and it is what I want, since I want to train only the last head and not the pretrained weights and I know that people train that way successfully. Even if I train with the pretrained weights, the results are much worse.
I see I have a very high underfit, but I just can't put my finger where I could improve here, especially seeing that people tend tohave good results with just a single dense layer on top of the model.
The default learning rate is too high for BERT. Try setting it to one of the recommended learning rates from the original paper Appendix A.3 of 5e-5, 3e-5 or 2e-5.

Validation metrics stagnate while training keeps improving

This is a model I've been using. It takes a pretrained InceptionV3 model and adds some fully connected layers on top of it. The whole thing is made trainable (including the pretrained InceptionV3 layers).
with tf.device('/cpu:0'):
pretrained_model = InceptionV3(weights='imagenet', include_top=False)
x = pretrained_model.output
x = GlobalAveragePooling2D(name='gap_final')(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.01))(x)
x = Dropout(0.2)(x)
preds = Dense(len(config.classes),activation='softmax')(x)
model = Model(inputs=pretrained_model.input, outputs=preds)
parallel_model = multi_gpu_model(model, gpus=16)
parallel_model.compile(optimizer=Adam(lr=0.0005), loss='categorical_crossentropy', metrics=['accuracy'])
I've tried training it with different image augmentation configurations, and no matter what I do the results are always similar to below:
Epoch 1/20
181/181 [====] - 1372s 8s/step - loss: 19.2332 - acc: 0.3330 - val_loss: 8.7765 - val_acc: 0.4747
Epoch 2/20
181/181 [====] - 1379s 8s/step - loss: 4.9885 - acc: 0.5474 - val_loss: 3.5256 - val_acc: 0.4084
Epoch 3/20
181/181 [====] - 1354s 7s/step - loss: 2.0334 - acc: 0.6469 - val_loss: 2.5382 - val_acc: 0.4275
Epoch 4/20
181/181 [====] - 1361s 8s/step - loss: 1.3522 - acc: 0.7117 - val_loss: 2.2028 - val_acc: 0.4741
Epoch 5/20
181/181 [====] - 1356s 7s/step - loss: 1.0838 - acc: 0.7599 - val_loss: 2.3402 - val_acc: 0.4738
From this point on (epoch 5/20), if I let the model train forever the training loss/acc will keep improving while the validation loss/acc will keep stagnating at these values.
This is a classification problem with 28 different classes, so a validation accuracy of 0.47 is not that bad given randomness would give an accuracy of 0.035, however I don't understand how the training set can be so perfectly fitted while the validation set leaves that much to be desired.
The total dataset is made of 32,000 pretty well-labeled images, all in the same configuration (think facial classification problem). Training takes roughly 27,000 and augment them by horizontal flipping and greyscaling (giving a total of 93,000 training images), while validation images are not augmented. From a visual perspective, training and validation images look very similar and I notice no striking difference between these two sets (before augmenting the training set, obviously).
Classes are slightly unbalanced, but not that much: the biggest class has 2,600 images and smallest has 610 (class size distribution is linear between these two extremes).
Note a few things that I've tried and don't alter the results:
dropouts: little impact if I play around with the dropout rates
regularization: using L1 or L2 with different values don't change results much
batch normalisation: with or without, same thing
number of fully-connected layers: one, two or even three (like above), little difference
type of pretrained network: I've tried using VGG16 with similar results
No matter what I do, training metrics always improve significantly, while validation stagnates.
Is it only a problem of "getting more data in" with 32,000 images just being "not enough" for 28 classes, especially for currently smaller classes (e.g. the one which has currently 610 images) or am I doing something wrong? Should I use a smaller learning rate, although the one being used currently is already fairly small?
Is it wrong to augment images from the training set and not from the validation set? I've read that it's standard practice, and it also seems to make sense to be doing so...
Lastly, should I limit the layers being trainable? E.g. should I make only the last 10 or 20 layers trainable instead of the full InceptionV3 network? Although choosing the trainable layers is straightforward when using a VGGxx model (being purely sequential), it seems a bit trickier for Inception. Any recommendation regarding this would be welcome.
After having tried several models and had a more thorough look at the data, it seems that the labels are not as clear as what I thought, and there is a lot of porosity between the different 28 classes.
Every time the model makes a "wrong" prediction on test data, a careful inspection of the picture makes it apparent that the model was "somehow right" and the labeling was questionable. E.g. think of a face smiling and frowning at the same time. Model could say "happy" or "unhappy" with equal legitimacy, and the "ground truth" labelling would be pretty arbitrary.
So, it seems that 45-ish percent accuracy on the validation set is in the top of what any model (or any human) could get to considering these porous classes.
The ability of InspectionV3 to get to the 85% accuracy for the training set with tens of thousands of images — after one epoch — is saying something about its power to find specific patterns that a human couldn't. As this example indicates, this ability must be balanced with equally qualitative regularization.
It also means that given a high-quality dataset with little porosity between labels, InceptionV3 should be able to give good results very quickly, e.g. compared to VGG16.

Keras: Huge loss after adding class weights

I'm working on a LSTM model in Keras with the goal of next word prediction utilizing BERT word vectors as a part of my inputs for the model.
This is a multi-class categorical problem, and I've done some weird steps to simplify English into clusters of words using BERT and stop-words and k-means, and for my initial practice model I'm using 144 target categories. I plan to up that to about 1000 after working out some kinks.
Here's the architecture of my Keras model:
model = Sequential()
model.add(LSTM(32, input_shape=(SENTENCE_LENGTH, COM_WORDS), dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(COM_WORDS))
model.add(Activation('softmax'))
optimizer = Adam(lr=lr)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.fit(X, y, validation_split=0.05, batch_size=128, epochs=epochs)
My loss starts arounds around 6 and goes down, which isn't unusual as far as I know. I then tried to incorporate class weights, since the model was over-predicting common words like 'the', which is expected. so I used this code to make the weights:
max_count = 0
for word in range(COM_WORDS):
if Ys.count(word) > max_count:
max_count = Ys.count(word)
class_weights = {}
for word in range(COM_WORDS):
class_weights[word] = (max_count - Ys.count(word) + 1)
So my most common y-input would have a value of 1 in the dictionary, and an y-input that is only represented once would be weighted at the count of the most common y-input: around 1 million in this case. Then I added it to my fit() and restarted the model.
When I run my model with the weights, i get insanely high loss (this is just a batch of 100,000 of all my inputs being run):
Epoch 1/3
950000/950000 [==============================] - 160s 168us/step - loss: 3014409.5359 - acc: 0.1261 - val_loss: 2808283.0898 - val_acc: 0.1604
The accuracy is fine though! Not too different than when I didn't use weights.
MY QUESTION(s):
Does this high loss matter? Is it just a reflection of my huge weight numbers, or is it indicating something sinister? Are loss numbers relative?
Side question: Should I use a better method to weight my inputs?
Thank you!

When is it appropriate to use sample_weights in keras?

According to this question, I learnt that class_weight in keras is applying a weighted loss during training, and sample_weight is doing something sample-wise if I don't have equal confidence in all the training samples.
So my questions would be,
Is the loss during validation weighted by the class_weight, or is it only weighted during training?
My dataset has 2 classes, and I don't actually have a seriously imbalanced class ditribution. The ratio is approx. 1.7 : 1. Is that neccessary to use class_weight to balance the loss or even use oversampling? Is that OK to leave the slightly imbalanced data as the usual dataset treated?
Can I simply consider sample_weight as the weights I give to each train sample? And my trainig samples can be treated with equal confidence, so I probably I don't need to use this.
From the keras documentation it says
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
So class_weight does only affect the loss during traning. I myself have been interested in understanding how the class and sample weights is handled during testing and training. Looking at the keras github repo and the code for metric and loss, it does not seem that either loss or metric is affected by them. The printed values are quite hard to track in the training code like model.fit() and its corresponding tensorflow backend training functions. So I decided to make a test code to test the possible scenarios, see code below. The conclusion is that both class_weight and sample_weight only affect training loss, no effect on any metrics or validation loss. A little surprising as val_sample_weights (which you can specify) seems to do nothing(??).
This types of question always depends on you problem, how skewed the date is and in what way you try to optimize the model. Are you optimizing for accuracy, then as long as the training data is equally skewed as when the model is in production, the best result will be achieved just training without any over/under sampling and/or class weights.
If you on the other hand have something where one class is more important (or expensive) than another then you should be weighting the data. For example in fraud prevention, where fraud normally is much more expensive than the income of non-fraud. I would suggest you try out unweighted classes, weighted classes and some under/over-sampling and check which gives the best validation results. Use a validation function (or write your own) that best will compare different models (for-example weighting true-positive, false-positive, true-negative and false-negative differently dependent on cost).
A relatively new loss-function that has shown great result at kaggle competitions on skewed data is Focal-loss. Focal-loss reduce the need for over/under-sampling. Unfortunately Focal-loss is not a built inn function in keras (yet), but can be manually programmed.
Yes I think you are correct. I normally use sample_weight for two reasons. 1, the training data have some kind of measuring uncertainty, which if known can be used to weight accurate data more than inaccurate measurements. Or 2, we can weight newer data more than old, forcing the model do adapt to new behavior more quickly, without ignoring valuable old data.
The code for comparing with and without class_weights and sample_weights, while holding the model and everything else static.
import tensorflow as tf
import numpy as np
data_size = 100
input_size=3
classes=3
x_train = np.random.rand(data_size ,input_size)
y_train= np.random.randint(0,classes,data_size )
#sample_weight_train = np.random.rand(data_size)
x_val = np.random.rand(data_size ,input_size)
y_val= np.random.randint(0,classes,data_size )
#sample_weight_val = np.random.rand(data_size )
inputs = tf.keras.layers.Input(shape=(input_size))
pred=tf.keras.layers.Dense(classes, activation='softmax')(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=pred)
loss = tf.keras.losses.sparse_categorical_crossentropy
metrics = tf.keras.metrics.sparse_categorical_accuracy
model.compile(loss=loss , metrics=[metrics], optimizer='adam')
# Make model static, so we can compare it between different scenarios
for layer in model.layers:
layer.trainable = False
# base model no weights (same result as without class_weights)
# model.fit(x=x_train,y=y_train, validation_data=(x_val,y_val))
class_weights={0:1.,1:1.,2:1.}
model.fit(x=x_train,y=y_train, class_weight=class_weights, validation_data=(x_val,y_val))
# which outputs:
> loss: 1.1882 - sparse_categorical_accuracy: 0.3300 - val_loss: 1.1965 - val_sparse_categorical_accuracy: 0.3100
#changing the class weights to zero, to check which loss and metric that is affected
class_weights={0:0,1:0,2:0}
model.fit(x=x_train,y=y_train, class_weight=class_weights, validation_data=(x_val,y_val))
# which outputs:
> loss: 0.0000e+00 - sparse_categorical_accuracy: 0.3300 - val_loss: 1.1945 - val_sparse_categorical_accuracy: 0.3100
#changing the sample_weights to zero, to check which loss and metric that is affected
sample_weight_train = np.zeros(100)
sample_weight_val = np.zeros(100)
model.fit(x=x_train,y=y_train,sample_weight=sample_weight_train, validation_data=(x_val,y_val,sample_weight_val))
# which outputs:
> loss: 0.0000e+00 - sparse_categorical_accuracy: 0.3300 - val_loss: 1.1931 - val_sparse_categorical_accuracy: 0.3100
There are some small deviations between using weights and not (even when all weights are one), possible due to fit using different backend functions for weighted and unweighted data or due to rounding error?

Categories

Resources