I have a dataset on which I train a DNN model.
My dataset is like this (single example row):
hash name surname datetime total nb_success nb_loss
axBBdnn78 aaa bbb 2016-01-01 00:01:26 50.00 1 2
I replaced the header names (confidentiality), but they are all important. The value I try to predict is the state column (present on the dataset), which can take 2 values: 'ok' and 'nok'.
By doing some preparation on the dataset and one-encoding the string with this code:
data = data.select_dtypes(exclude=['number']).apply(LabelEncoder().fit_transform).join(data.select_dtypes(include=['number']))
I then got my final training set, which looks like this:
hash name surname datetime total nb_success nb_loss
1696 4 37 01 50.00 1 2
I then use the following Keras DNN model:
model = Sequential()
model.add(Dense(16, input_shape=(7,)))
model.add(Activation('relu'))
# model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(1)) # 2 outputs possible (ok or nok)
model.add(Activation('relu'))
model.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
But, neither my loss decreases, or my accuracy increases.
5000/5000 [==============================] - 0s - loss: 0.5070 - acc: 0.4930 - val_loss: 0.4900 - val_acc: 0.5100
Epoch 2/10
5000/5000 [==============================] - 0s - loss: 0.5054 - acc: 0.4946 - val_loss: 0.5100 - val_acc: 0.4900
Epoch 3/10
5000/5000 [==============================] - 0s - loss: 0.5112 - acc: 0.4888 - val_loss: 0.4140 - val_acc: 0.5860
Epoch 4/10
5000/5000 [==============================] - 0s - loss: 0.4900 - acc: 0.5100 - val_loss: 0.4660 - val_acc: 0.5340
I tried several other loss functions as well as other optimizers, but each time I only achieve roughly 50% accuracy (so, nothing since the output is 2 classes).
I have two questions:
Is my one-encoding method correct?
Which part do I miss for the model to actually train/converge?
You have label-encoded your data but not one-hot encoded it. For every label you have in each of your features you need to create one binary label.
E.g.: If you have 4 names than you need 3 binary labels that represent the 4 possible states.
You can use Scikit-learn´s OneHotEncoder: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
One of the reasons for this is that right now name #37 is "more" or "higher" than name #36 or #10 or #5 which doesn´t make any sense (since these are categorical values rather than continuous ones)
On your last Dense layer you need to use sigmoid as the activation function.
One more thing: You might need to increase your models size. And you have two lines of model.add(Activation('relu')) in your code.
Related
Firstly I read in my cvs file which contained a 1 or 0 matrix
df = pd.read_csv(url)
print(df.head())
print(df.columns)
Next I gathered the Pictures and resized them
image_directory = 'Directory/'
dir_list = os.listdir(path)
print("Files and directories in '", image_directory, "' :")
# print the list
print(dir_list)
They were saved into an X2 variable.
SIZE = 200
X_dataset = []
for i in tqdm(range(df.shape[0])):
img2 = cv2.imread("Cell{}.png".format(i), cv2.IMREAD_UNCHANGED)
img = tf.keras.preprocessing.image.load_img(image_directory +df['ID'][i], target_size=(SIZE,SIZE,3))
#numpy array of each image at size 200, 200, 3 (color)
img = np.array(img)
img = img/255.
X_dataset.append(img)
X2 = np.array(X_dataset)
print(X2.shape)
I created the y2 data by getting the cvs data, dropping two columns and getting a shape of (1000, 16)
y2 = np.array(df.drop(['Outcome', 'ID'], axis=1))
print(y2.shape)
I then did the train_test_split
I wonder if my random state or test_size is not optimal
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=10, test_size=0.3)
Next, I created a sequential model
SIZE = (200,200,3) which was made above in the resized model.
model2 = Sequential()
model2.add(Conv2D(filters=16, kernel_size=(10, 10), activation="relu", input_shape=(SIZE,SIZE,3)))
model2.add(BatchNormalization())
model2.add(MaxPooling2D(pool_size=(5, 5)))
model2.add(Dropout(0.2))
model2.add(Conv2D(filters=32, kernel_size=(5, 5), activation='relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))
model2.add(Conv2D(filters=64, kernel_size=(5, 5), activation="relu"))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))
model2.add(Conv2D(filters=128, kernel_size=(3, 3), activation='relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))
model2.add(Flatten())
model2.add(Dense(512, activation='relu'))
model2.add(Dropout(0.5))
model2.add(Dense(128, activation='relu'))
model2.add(Dropout(0.5))
model2.add(Dense(16, activation='sigmoid'))
#Do not use softmax for multilabel classification
#Softmax is useful for mutually exclusive classes, either cat or dog but not both.
#Also, softmax outputs all add to 1. So good for multi class problems where each
#class is given a probability and all add to 1. Highest one wins.
#Sigmoid outputs probability. Can be used for non-mutually exclusive problems.
#like multi label, in this example.
#But, also good for binary mutually exclusive (cat or not cat).
model2.summary()
#Binary cross entropy of each label. So no really a binary classification problem but
#Calculating binary cross entropy for each label.
opt = tf.keras.optimizers.Adamax(
learning_rate=0.02,
beta_1=0.8,
beta_2=0.9999,
epsilon=1e-9,
name='Adamax')
model2.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy', 'mse' ])
The model uses a custom optimizer, ands the shape generated has 473,632 trainable params.
I then specify the sample weight which was calculated by taking the highest sampled number and divided the other numbers by that.
sample_weight = { 0:1,
1:0.5197368421,
2:0.4385964912,
3:0.2324561404,
4:0.2302631579,
5:0.399122807,
6:0.08114035088,
7:0.5723684211,
8:0.08552631579,
9:0.2061403509,
10:0.3815789474,
11:0.125,
12:0.08333333333,
13:0.1206140351,
14:0.1403508772,
15:0.4824561404
}
finally I ran the model.fit
history = model2.fit(X_train2, y_train2, epochs=25, validation_data=(X_test2, y_test2), batch_size=64, class_weight = sample_weight, shuffle = False)
My issue was that the model was maxing out at around 30 to 40% accuracy. I looked into it, and they said tuning the learning rate was important. I also saw that raising the epochs would help to a point, as would lowering the batch size.
Is there any other thing I may have missed? I noticed the worse models only predicted one class frequently (100% normal, 0% anything else) but the better model predicted on a sliding scale where some items were at 10% and some at 70%.
I also wonder if I inverted my sample weights, my item 0 has the most items in it... Should it be inverted, where 1 sample 1 counts for 2 sample 0s?
Thanks.
Things I tried.
Changing the batch size down to 16 or 8. (resulted in longer epoch times, slightly better results)
Changing the learning rate to a lower number (resulted in slightly better results, but over more epochs)
Changing it to 100 epochs (results plateaued around 20 epochs usually.)
Attempting to create more params higher filters, larger initial kernel size, larger initial pool size, more and higher value dense layers. (This resulted in it eating the RAM and not getting much better results.)
Changing the optimizer to Adam or RAdam or AdamMax. (Didn't really change much, the other optimizers sucked though). I messed with the beta_1 and epsilon too.
Revising the Cvs. (the data is fairly vague, had help and it still was hard to tell)
Removing bad data (I didn't want to get rid of too many pictures.)
Edit: Added sample accuracy. This one was unusually low, but starts off well enough (accuracy initially is 25.9%)
14/14 [==============================] - 79s 6s/step - loss: 0.4528 - accuracy: 0.2592 - mse: 0.1594 - val_loss: 261.8521 - val_accuracy: 0.3881 - val_mse: 0.1416
Epoch 2/25
14/14 [==============================] - 85s 6s/step - loss: 0.2817 - accuracy: 0.3188 - mse: 0.1310 - val_loss: 22.7037 - val_accuracy: 0.3881 - val_mse: 0.1416
Epoch 3/25
14/14 [==============================] - 79s 6s/step - loss: 0.2611 - accuracy: 0.3555 - mse: 0.1243 - val_loss: 11.9977 - val_accuracy: 0.3881 - val_mse: 0.1416
Epoch 4/25
14/14 [==============================] - 80s 6s/step - loss: 0.2420 - accuracy: 0.3521 - mse: 0.1172 - val_loss: 6.6056 - val_accuracy: 0.3881 - val_mse: 0.1416
Epoch 5/25
14/14 [==============================] - 80s 6s/step - loss: 0.2317 - accuracy: 0.3899 - mse: 0.1151 - val_loss: 4.9567 - val_accuracy: 0.3881 - val_mse: 0.1415
Epoch 6/25
14/14 [==============================] - 80s 6s/step - loss: 0.2341 - accuracy: 0.3899 - mse: 0.1141 - val_loss: 2.7395 - val_accuracy: 0.3881 - val_mse: 0.1389
Epoch 7/25
14/14 [==============================] - 76s 5s/step - loss: 0.2277 - accuracy: 0.4128 - mse: 0.1107 - val_loss: 2.3758 - val_accuracy: 0.3881 - val_mse: 0.1375
Epoch 8/25
14/14 [==============================] - 85s 6s/step - loss: 0.2199 - accuracy: 0.4106 - mse: 0.1094 - val_loss: 1.4526 - val_accuracy: 0.3881 - val_mse: 0.1319
Epoch 9/25
14/14 [==============================] - 76s 5s/step - loss: 0.2196 - accuracy: 0.4151 - mse: 0.1086 - val_loss: 0.7962 - val_accuracy: 0.3881 - val_mse: 0.1212
Epoch 10/25
14/14 [==============================] - 80s 6s/step - loss: 0.2187 - accuracy: 0.4140 - mse: 0.1087 - val_loss: 0.6308 - val_accuracy: 0.3744 - val_mse: 0.1211
Epoch 11/25
14/14 [==============================] - 81s 6s/step - loss: 0.2175 - accuracy: 0.4071 - mse: 0.1086 - val_loss: 0.5986 - val_accuracy: 0.3242 - val_mse: 0.1170
Epoch 12/25
14/14 [==============================] - 80s 6s/step - loss: 0.2087 - accuracy: 0.3968 - mse: 0.1034 - val_loss: 0.4003 - val_accuracy: 0.3333 - val_mse: 0.1092
Epoch 13/25
12/14 [========================>.....] - ETA: 10s - loss: 0.2092 - accuracy: 0.3945 - mse: 0.1044
Here are some notes that might help:
When using Batch Normalization, avoid too small batch sizes. For more details see the Group Normalization paper by Yuxin Wu and Kaiming He.
It may worth looking at metrics like AUC, and F1 as well since you have an imbalanced multiclass case. You can add tf.keras.metrics.AUC(curve='PR') to your metrics list.
The training loss seems to have stalled at the end of epoch 13. If the training loss does not decrease anymore, you may want to 1. use smaller learning rate, and/or 2. reduce your dropout parameters. Particularly, the relatively-large dropout right before your last layer seems suspicious to me. First, try to obtain a model that fits well your training dataset (with low to no dropout). That is an important step. If your model cannot fit well your training dataset without any regularization, it may need more trainable parameters. After achieving a minimum model that fits the training dataset, you can then add regularization mechanisms to mitigate the overfitting issue.
Unless you have a good reason to do otherwise set shuffle = True (which is also the default setting) to shuffle the training data before each epoch.
Although it is probably not the root cause of your problem, there is a debate on whether the normalization should come before the activation or after that. Some prefer to use it before the activation.
The following was not clear to me:
I then specify the sample weight which was calculated by taking the
highest sampled number and divided the other numbers by that.
Your class weights may have already been calculated correctly. I'd like to emphasize though that an under-represented class should be assigned a larger weight. Refer to this tutorial from TensorFlow as needed.
I want to create a neural network which can add two integer numbers. I have designed it as follows:
question I have really low accuracy of 0.002% . what can i do to increase it?
For creating data:
import numpy as np
import random
a=[]
b=[]
c=[]
for i in range(1, 1001):
a.append(random.randint(1,999))
b.append(random.randint(1,999))
c.append(a[i-1] + b[i-1])
X = np.array([a,b]).transpose()
y = np.array(c).transpose().reshape(-1, 1)
scaling my data :
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
minmax2 = MinMaxScaler()
X = minmax.fit_transform(X)
y = minmax2.fit_transform(y)
The network :
from keras import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
clfa = Sequential()
clfa.add(Dense(input_dim=2, output_dim=2, activation='sigmoid', kernel_initializer='he_uniform'))
clfa.add(Dense(output_dim=2, activation='sigmoid', kernel_initializer='uniform'))
clfa.add(Dense(output_dim=2, activation='sigmoid', kernel_initializer='uniform'))
clfa.add(Dense(output_dim=2, activation='sigmoid', kernel_initializer='uniform'))
clfa.add(Dense(output_dim=1, activation='relu'))
opt = SGD(lr=0.01)
clfa.compile(opt, loss='mean_squared_error', metrics=['acc'])
clfa.fit(X, y, epochs=140)
outputs :
Epoch 133/140
1000/1000 [==============================] - 0s 39us/step - loss: 0.0012 - acc: 0.0020
Epoch 134/140
1000/1000 [==============================] - 0s 40us/step - loss: 0.0012 - acc: 0.0020
Epoch 135/140
1000/1000 [==============================] - 0s 41us/step - loss: 0.0012 - acc: 0.0020
Epoch 136/140
1000/1000 [==============================] - 0s 40us/step - loss: 0.0012 - acc: 0.0020
Epoch 137/140
1000/1000 [==============================] - 0s 41us/step - loss: 0.0012 - acc: 0.0020
Epoch 138/140
1000/1000 [==============================] - 0s 42us/step - loss: 0.0012 - acc: 0.0020
Epoch 139/140
1000/1000 [==============================] - 0s 40us/step - loss: 0.0012 - acc: 0.0020
Epoch 140/140
1000/1000 [==============================] - 0s 42us/step - loss: 0.0012 - acc: 0.0020
That is my code with console outputs..
I have tried every different combinations of optimizers, losses, and activations, plus this data fits perfectly a Linear Regression.
Two mistakes, several issues.
The mistakes:
This is a regression problem, so the activation of the last layer should be linear, not relu (leaving it without specifying anything will work, since linear is the default activation in a Keras layer).
Accuracy is meaningless in regression; remove metrics=['acc'] from your model compilation - you should judge the performance of your model only with your loss.
The issues:
We don't use sigmoid activations for the intermediate layers; change all of them to relu.
Remove the kernel_initializer argument, thus leaving the default glorot_uniform, which is the recommended one.
A number of Dense layers each one only with two nodes is not a good idea; try reducing the number of layers and increasing the number of nodes. See here for a simple example network for the iris data.
You are trying to fit a linear function, but internally use sigmoid nodes, which map values to a range (0,1). Sigmoid is very useful for classification, but not really for regression if the values are outside (0,1). It could MAYBE work if you restricted your random number to floating point in the interval [0,1]. OR input into your nodes all the bits seperately, and have it learn an adder.
I'm trying to design a CNN in Keras to classify small images of emojis in other images. Below is an example of one of the 13 classes. All images are the same size and all the emojis are of the same size as well. I would think that one should rather easily be able to achieve VERY high accuracy when classifying, as emojis from one class is exactly the same! My intuition told me that if an emoji is 50x50 I could create a convolutional layer of the same size to match one type of emoji. My supervisor did not think that was feasible however. Anyway, my problem is that, no matter how I design my model, I always get the same validation accuracy for each epoch, which corresponds to 1/13 (or simply guessing that each emoji belongs to the same class).
My model looks like this:
model = Sequential()
model.add(Conv2D(16, kernel_size=3, activation="relu", input_shape=IMG_SIZE))
model.add(Dropout(0.5))
model.add(Conv2D(32, kernel_size=3, activation="relu"))
model.add(Conv2D(64, kernel_size=3, activation="relu"))
model.add(Conv2D(128, kernel_size=3, activation="relu"))
#model.add(Conv2D(256, kernel_size=3, activation="relu"))
model.add(Dropout(0.5))
model.add(Flatten())
#model.add(Dense(256, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(NUM_CLASSES, activation='softmax', name="Output"))
And I train it like this:
# ------------------ Compile and train ---------------
sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True)
rms = optimizers.RMSprop(lr=0.004, rho=0.9, epsilon=None, decay=0.0)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=["accuracy"]) # TODO Read more about this
train_hist = model.fit_generator(
train_generator,
steps_per_epoch=train_generator.n // BATCH_SIZE,
validation_steps=validation_generator.n // BATCH_SIZE, # TODO que?
epochs=EPOCHS,
validation_data=validation_generator,
#callbacks=[EarlyStopping(patience=3, restore_best_weights=True)]
)
Even with this model, that has over 200 million parameters, I get exactly 0.0773 in validation accuracy for each epoch:
Epoch 1/10
56/56 [==============================] - 21s 379ms/step - loss: 14.9091 - acc: 0.0737 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 2/10
56/56 [==============================] - 6s 108ms/step - loss: 14.9308 - acc: 0.0737 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 3/10
56/56 [==============================] - 6s 108ms/step - loss: 14.7869 - acc: 0.0826 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 4/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8948 - acc: 0.0759 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 5/10
56/56 [==============================] - 6s 109ms/step - loss: 14.8897 - acc: 0.0762 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 6/10
56/56 [==============================] - 6s 109ms/step - loss: 14.8178 - acc: 0.0807 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 7/10
56/56 [==============================] - 6s 108ms/step - loss: 15.0747 - acc: 0.0647 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 8/10
56/56 [==============================] - 6s 108ms/step - loss: 14.7509 - acc: 0.0848 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 9/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8948 - acc: 0.0759 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 10/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8228 - acc: 0.0804 - val_loss: 14.8719 - val_acc: 0.0773
Because its not learning anything, I'm starting to think that it's not my models fault, but maybe the dataset or how I train it. I have tried training with "adam" as well but get the same result. I try to change the input size of the images but still, same result. Below is a sample from my dataset. Do you guys have any ideas what could be wrong?
I think the main issue currently is that your model has way too many parameters relative to how few samples you have for training. For image classification nowadays, you generally want to just have conv layers, a GlobalSomethingPooling layer, and then a single Dense layer for your outputs. You just need to make sure that your conv section ends up with a large enough receptive field to be able to find all the features you need.
The first thing to think about is making sure that you have a large enough receptive field (further reading about that here). There are three main ways to achieve that: pooling, stride >= 2, and/or dilation >= 2. Because you have only 13 "features" that you want to identify, and all of them will always be pixel-perfect, I'm thinking dilation will be the way to go so that the model can easily "overfit" on those 13 "features". If we use 4 conv layers with dilations of 1, 2, 4, and 8, respectively, then we'll end up with a receptive field of 31. This should be enough to easily recognize 50-pixel emoji.
Next, how many filters should each layer have? Normally, you start with a few filters, and increase as you go through the model, as you are doing here. However, we want to "overfit" on specific features, so we should probably increase the amounts in the earlier layers. Just to make it easy, let's give all layers 64 filters.
Last, how do we convert this all to a single prediction? Rather than using a dense layer, which would use a ton of parameters and would not be translation invariant, people nowadays use GlobalAveragePooling or GlobalMaxPooling. GlobalAveragePooling is more common, because it's good for helping to find combinations of many features. However, we just want to find 13 or so exact features here, so GlobalMaxPooling might work even better. Then a single dense layer after that will be enough to get a prediction. Because we're using GlobalMaxPooling, it doesn't need to be flattened first—global pooling already does that for us.
Resulting model:
Conv2D(64, kernel_size=3, activation="relu", input_shape=IMG_SIZE)
Conv2D(64, kernel_size=3, dilation_rate=2, activation="relu")
Conv2D(64, kernel_size=3, dilation_rate=4, activation="relu")
Conv2D(64, kernel_size=3, dilation_rate=8, activation="relu")
GlobalMaxPooling()
Dense(NUM_CLASSES, activation='softmax', name="Output")
Try that. You'll probably also want to add BatchNormalization layers in there after every layers except the last. Once you get decent training accuracy, check if it's overfitting. If it is (which is likely), try these steps:
batch norm if you haven't already
weight decay on all conv and dense layers
SpatialDropout2D after the conv layers and regular Dropout after the pooling layer
augment your data with an ImageDataGenerator
Designing nets like this is almost more of an art than a science, so change stuff around if you think you should. Eventually, you will get an intuition for what is more or less likely to work in any given situation.
I'm new and learning of machine learning, kindly bear with me if the way of asking is not fine and question is so simple.
The issue is my developed model is returning loss as nan, please advice me if anything wrong I made. below are the details.
Program Logic
import tensorflow as tf
import pandas as pd
# Reading the csv file from local drive as a dataframe
bike_df = pd.read_csv('C:\\Users\\HOME\\MLPythonPractice\\Data sets\\Bike-Sharing-Dataset\\day.csv')
bike_result_df = pd.read_csv('C:\\Users\\HOME\\MLPythonPractice\\Data sets\\Bike-Sharing-Dataset\\day.csv')
# Remove unwanted columns from the data frame
bike_df = bike_df.drop(columns=['instant','dteday','cnt'])
# shape of the dataframe
print(bike_df.shape)
# Exact attribute to see the columns of the dataframe
print(bike_df.columns)
# To know the type
print(type(bike_df))
# To see the information of the dataframe
print(bike_df.info())
# Converting from dataframe to ndarray
bike_s = bike_df.values
print(type(bike_s))
print(bike_s.shape)
# Remove all the columns except cnt column which is result set
bike_result_df['cnt'] = bike_result_df['cnt'].values.astype(np.float64) #converting to float
bike_result_df = bike_result_df['cnt'] # Removing all columns except cnt column
bike_result_s = bike_result_df.values # Converting dataframe to ndarray
print(type(bike_result_s))
print(bike_result_s)
import numpy as np
print(type(bike_df))
print(bike_df.shape)
print(bike_result_df.shape)
#As the data frame is available, we will build the graph using keras (## are part of build graph)
## Initialise the sequential model
model = tf.keras.models.Sequential()
## Normalize the input data by creating a normalisation layer
model.add(tf.keras.layers.BatchNormalization(input_shape = (13,)))
## Add desnse layer for predition -- Keras declares weights and bias - dense(1) 1 here is expected value
model.add(tf.keras.layers.Dense(1))
# Compile the model - add loss and gradient descen optimiser
model.compile(optimizer='sgd',loss='mse')
print(type(bike_s))
print(type(bike_result_s))
print(bike_s.shape)
print(bike_result_s.shape)
print(bike_result_s)
# Execute the graph
model.fit(bike_s,bike_result_s,epochs=10)
model.save('models/bike_sharing_lr.h5')
I'm getting the output
Epoch 1/10
731/731 [==============================] - 1s 895us/step - loss: nan
Epoch 2/10
731/731 [==============================] - 0s 44us/step - loss: nan
Epoch 3/10
731/731 [==============================] - 0s 46us/step - loss: nan
Epoch 4/10
731/731 [==============================] - 0s 44us/step - loss: nan
Epoch 5/10
731/731 [==============================] - 0s 39us/step - loss: nan
Epoch 6/10
731/731 [==============================] - 0s 39us/step - loss: nan
Epoch 7/10
731/731 [==============================] - 0s 47us/step - loss: nan
Epoch 8/10
731/731 [==============================] - 0s 40us/step - loss: nan
Epoch 9/10
731/731 [==============================] - 0s 43us/step - loss: nan
Epoch 10/10
731/731 [==============================] - 0s 42us/step - loss: nan
To prevent your gradient from exploding you can clip it like so.
model.compile(optimizer=tf.keras.optimizers.SGD(clipnorm=1), loss='mse')
According to https://keras.io/optimizers/, setting clipnorm=1 allows the gradient descent optimizer to control gradient clipping. All parameter gradients will be clipped to a maximum norm of 1. This prevents your loss function to diverge.
See also https://www.dlology.com/blog/how-to-deal-with-vanishingexploding-gradients-in-keras/ for other ways to control exploding gradients.
With above tweak, loss function doesn't diverge, but it doesn't decrease over time either. I've noticed the way you've set up your model is weird. Batch normalization should typically follow an activation layer. I'm not sure why you need to normalize your input, but you should not be using BatchNormalize for that. If you change your model to,
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(1, activation='relu'))
model.add(tf.keras.layers.BatchNormalization(input_shape = (13,)))
model.compile(optimizer='sgd', loss='mse')
you will get a more meaningful result, and the loss function value now decreases from some 20 million to 120,000.
I am training a model on fruits360 dataset from kaggle.
I have 0 dense layers, and 3 convolutional layers in my keras model. My input shape is (60,60,3) as the images are loaded in rgb format. please help me to troubleshoot what is the problem with this model why is it not training properly. I've tried with different combinations of layers but the accuracy and loss remains constant no matter whatever you change.
following is the model:
dense_layers = [0]
layer_sizes = [64]
conv_layers = [3]
for dense_layer in dense_layers:
for layer_size in layer_sizes:
for conv_layer in conv_layers:
NAME = "{}-conv-{}-nodes-{}-dense-{}".format(conv_layer, layer_size, dense_layer, int(time.time()))
print(NAME)
model = Sequential()
model.add(Conv2D(layer_size, (3, 3), input_shape=(60, 60, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
for l in range(conv_layer-1):
model.add(Conv2D(layer_size, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
for _ in range(dense_layer):
model.add(Dense(layer_size))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
tensorboard = TensorBoard(log_dir="logs/")
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'],
)
model.fit(X_norm, y,
batch_size=32,
epochs=10,
validation_data=(X_norm_test,y_test),
callbacks=[tensorboard])
but the accuracy remains constant as follows:
Epoch 1/10
42798/42798 [==============================] - 27s 641us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 2/10
42798/42798 [==============================] - 27s 638us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 3/10
42798/42798 [==============================] - 27s 637us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 4/10
42798/42798 [==============================] - 27s 635us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 5/10
42798/42798 [==============================] - 27s 635us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 6/10
42798/42798 [==============================] - 27s 631us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 7/10
42798/42798 [==============================] - 27s 631us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 8/10
42798/42798 [==============================] - 27s 631us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 9/10
42798/42798 [==============================] - 27s 635us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
Epoch 10/10
42798/42798 [==============================] - 27s 626us/step - loss: nan - acc: 0.0115 - val_loss: nan - val_acc: 0.0114
what can I do to train this model properly. To increase the accuracy.
I'm not sure sparse_categorical_crossentropy is a proper loss for an output with only 1 unit.
Notice your loss is nan. This means there is a mathematical error in your model/data/loss somewhere. Many times this is caused by division by zero, overflowing numbers, etc.
I suppose you should be using 'binary_crossentropy' as a loss function.
Notice that you will still have a risk of frozen losses because of 'relu' activations. If this happens, you can add BatchNormalization() layers before the Activation('relu') layers.
Please take into account #desertnaut's comments. You're creating a new Sequential model inside every loop.
The bug is with the last dense layer. It is set to one (i.e., one class). The Fruits-360 dataset (depending on the version) has upwards of a 100 classes. The 1% accuracy is also another giveaway. The labels from the training set have 100 different values. A dense layer of one output node can only (randomly) get it right 1 out of a 100 -- hence the 1%.
The 'sparse_categorical_crossentropy' is okay as long as the labels are not one-hot encoded. If so, you should use 'categorical_crossentropy'. The Fruits-360 dataset when used with Adam tends not to converge at high learning rates (like 0.1 and 0.01), best to set the learning rate low to something between 0.001 and 0.0001.