Model accuracy and loss not improving in CNN

Model accuracy and loss not improving in CNN - python

I am using the below LeNet architecture to train my image classification model , I have noticed that both train , val accuracy not improving for each iteration . Can any one expertise in this area explain what might have gone wrong ?
training samples - 110 images belonging to 2 classes.
validation - 50 images belonging to 2 classes.
#LeNet
import keras
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
#import dropout class if needed
from keras.layers import Dropout
from keras import regularizers
model = Sequential()
#Layer 1
#Conv Layer 1
model.add(Conv2D(filters = 6,
kernel_size = 5,
strides = 1,
activation = 'relu',
input_shape = (32,32,3)))
#Pooling layer 1
model.add(MaxPooling2D(pool_size = 2, strides = 2))
#Layer 2
#Conv Layer 2
model.add(Conv2D(filters = 16,
kernel_size = 5,
strides = 1,
activation = 'relu',
input_shape = (14,14,6)))
#Pooling Layer 2
model.add(MaxPooling2D(pool_size = 2, strides = 2))
#Flatten
model.add(Flatten())
#Layer 3
#Fully connected layer 1
model.add(Dense(units=128,activation='relu',kernel_initializer='uniform'
,kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(rate=0.2))
#Layer 4
#Fully connected layer 2
model.add(Dense(units=64,activation='relu',kernel_initializer='uniform'
,kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(rate=0.2))
#layer 5
#Fully connected layer 3
model.add(Dense(units=64,activation='relu',kernel_initializer='uniform'
,kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(rate=0.2))
#layer 6
#Fully connected layer 4
model.add(Dense(units=64,activation='relu',kernel_initializer='uniform'
,kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(rate=0.2))
#Layer 7
#Output Layer
model.add(Dense(units = 2, activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
from keras.preprocessing.image import ImageDataGenerator
#Image Augmentation
train_datagen = ImageDataGenerator(
rescale=1./255, #rescaling pixel value bw 0 and 1
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
#Just Feature scaling
test_datagen = ImageDataGenerator(rescale=1./255)
training_set = train_datagen.flow_from_directory(
'/Dataset/Skin_cancer/training',
target_size=(32, 32),
batch_size=32,
class_mode='categorical')
test_set = test_datagen.flow_from_directory(
'/Dataset/Skin_cancer/testing',
target_size=(32, 32),
batch_size=32,
class_mode='categorical')
model.fit_generator(
training_set,
steps_per_epoch=50, #number of input (image)
epochs=25,
validation_data=test_set,
validation_steps=10) # number of training sample
Epoch 1/25
50/50 [==============================] - 52s 1s/step - loss: 0.8568 - accuracy: 0.4963 - val_loss: 0.7004 - val_accuracy: 0.5000
Epoch 2/25
50/50 [==============================] - 50s 1s/step - loss: 0.6940 - accuracy: 0.5000 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/25
50/50 [==============================] - 48s 967ms/step - loss: 0.6932 - accuracy: 0.5065 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/25
50/50 [==============================] - 50s 1s/step - loss: 0.6932 - accuracy: 0.4824 - val_loss: 0.6933 - val_accuracy: 0.5000
Epoch 5/25
50/50 [==============================] - 49s 974ms/step - loss: 0.6932 - accuracy: 0.4949 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/25
50/50 [==============================] - 51s 1s/step - loss: 0.6932 - accuracy: 0.4854 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 7/25
50/50 [==============================] - 49s 976ms/step - loss: 0.6931 - accuracy: 0.5015 - val_loss: 0.6918 - val_accuracy: 0.5000
Epoch 8/25
50/50 [==============================] - 51s 1s/step - loss: 0.6932 - accuracy: 0.4986 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 9/25
50/50 [==============================] - 49s 973ms/step - loss: 0.6932 - accuracy: 0.5000 - val_loss: 0.6929 - val_accuracy: 0.5000
Epoch 10/25
50/50 [==============================] - 50s 1s/step - loss: 0.6931 - accuracy: 0.5044 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 11/25
50/50 [==============================] - 49s 976ms/step - loss: 0.6931 - accuracy: 0.5022 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 12/25

Most importantly is that you are using loss = 'categorical_crossentropy', change it to loss = 'binary_crossentropy' as you have just 2 classes. And also change class_mode='categorical' to class_mode='binary' in flow_from_directory.
As #desertnaut rightly mentioned, categorical_crossentropy goes hand in hand with softmax activation in the last layer, and if you change the loss to binary_crossentropy the last activation should also be changed to sigmoid.
Other Improvements:
You have very limited data (160 images) and you have used almost 50% of data as validation data.
As you are building the model for image classification, you just have two Conv2D Layer and 4 dense Layer. The Dense layers are adding huge amount of weights to be learnt. Add few more conv2d layer and reduce the Dense layer.
Set batch_size = 1 and remove steps_per_epoch. As you have very less input let every epoch have same number of steps as input records.
Use the default glorot_uniform kernel initializer.
To further tune your model, build model using multiple Conv2D layer, followed by GlobalAveragePooling2D layer and FC Layer and final softmax layer.
Use Data Augmentation technique like horizontal_flip, vertical_flip, shear_range, zoom_range of ImageDataGenerator to increase the number of training and validation images.
Moving the comments to answer section as suggested by #desertnaut -
Question - Thanks ! Yes , less data is the problem I figured . One additional
question - why is that adding more dense layer than conv layer
negatively affecting the model, is there any rule to follow when we
decide how many conv and dense layer we gonna use ? –
Arun_Ramji_Shanmugam 2 days ago
Answer - To answer the first part of your question, Conv2D layer maintains the
spatial information of the image and weights to be learnt depend on
the kernel size and stride mentioned in the layer,where as the Dense
layer needs the output of Conv2D to be flattened and used further
hence losing the spatial information. Also dense layer adds more
number of weights, for example 2 dense layers of 512 adds
(512*512)=262144 params or weights to the model(has to be learnt by
the model).That means you have to train for more number of epochs and
with good hype parameters settings for learning of these weights. –
Tensorflow Warriors 2 days ago
Answer - To answer the second part of your question,use systematic experiments
to discover what works best for your specific dataset. Also it depends
on processing power you hold. Remember, deeper networks is always
better, at the cost of more data and increased complexity of learning.
A conventional approach is to look for similar problems and deep
learning architectures which have already been shown to work. Also we
have the flexibility to utilize the pretrained models like resnet, vgg
etc, use these models by freezing the part of the layers and training
on remaining layers. – Tensorflow Warriors 2 days ago
Question - Thank you for detailed answer !! If you don't bother one more question
- so when we are using already trained model (may be some layers) , isn't it required to be trained on same input data as the one we gonna
work ? – Arun_Ramji_Shanmugam yesterday
Answer - The intuition behind transfer learning for image classification is
that if a model is trained on a large and general enough dataset, this
model will effectively serve as a generic model of the visual world.
You can find transfer learning example with explanation here -
tensorflow.org/tutorials/images/transfer_learning . – Tensorflow
Warriors yesterday

Remove all kernel_initializer='uniform' arguments from your layers; don't specify anything here, the default initializer glorot_uniform is the highly recommended one (and the uniform is a particularly bad one).
As a general rule, keep in mind that the default values for such rather advanced settings are there for your convenience, they are implicitly recommended, and you should better not mess with them unless you have specific reasons to do so and you know exactly what you are doing.
For the kernel_initializer argument in particular, I have started believing that it has caused a lot of unnecessary pain to people (just see here for the most recent example).
Also, dropout should not be used by default, especially in cases like here where the model seems to struggle to learn anything; start without any dropout (comment out the respective layers), and only add it back if you see signs of overfitting.

Related

Tensorflow Keras poor accuracy on image classification with more than 30 classes

I have problem with image classification using Keras. I always got poor accuracy with only 0.02. I tried to follow cat and dog classification which has 0.8 in accuracy but it could not work in my case with 30 classes.
Let say I have datasets with around 100K of images and categorized within 30 classes. I split it with 80% for training and 20% validation.
The structure folder is look like this.
|-train
|---category1
|---category2
|---category3
|---category4
|---.....
|---category30
|
|-validation
|---category1
|---category2
|---category3
|---category4
|---.....
|---category30
Each category in train folder contains around 2000 to 4000 images.
my model
model = tf.keras.Sequential([
Conv2D(kernel_size=3, filters=16, padding='same', activation='relu', input_shape=[150,150, 3]),
Conv2D(kernel_size=3, filters=30, padding='same', activation='relu'),
MaxPooling2D(pool_size=2),
Conv2D(kernel_size=3, filters=60, padding='same', activation='relu'),
MaxPooling2D(pool_size=2),
Conv2D(kernel_size=3, filters=90, padding='same', activation='relu'),
MaxPooling2D(pool_size=2),
Conv2D(kernel_size=3, filters=110, padding='same', activation='relu'),
MaxPooling2D(pool_size=2),
Conv2D(kernel_size=3, filters=130, padding='same', activation='relu'),
Conv2D(kernel_size=1, filters=40, padding='same', activation='relu'),
GlobalAveragePooling2D(),
Dense(1,'sigmoid'),
Activation('softmax')
])
model.compile(optimizer=keras.optimizers.Adam(lr=.00001),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
Train the datasets
history = model.fit_generator(
train_generator,
steps_per_epoch=100,
epochs=10,
validation_data=validation_generator,
validation_steps=50,
verbose=2)
And I always got low accuracy like 0.02 or 0.03
Epoch 6/10
100/100 - 172s - loss: -1.8906e+01 - accuracy: 0.0265 - val_loss: -1.8923e+01 - val_accuracy: 0.0270
Epoch 7/10
100/100 - 171s - loss: -1.8773e+01 - accuracy: 0.0230 - val_loss: -1.8396e+01 - val_accuracy: 0.0330
Epoch 8/10
100/100 - 170s - loss: -1.8780e+01 - accuracy: 0.0295 - val_loss: -1.9882e+01 - val_accuracy: 0.0180
Epoch 9/10
100/100 - 170s - loss: -1.8895e+01 - accuracy: 0.0240 - val_loss: -1.8572e+01 - val_accuracy: 0.0210
Epoch 10/10
100/100 - 170s - loss: -1.9091e+01 - accuracy: 0.0265 - val_loss: -1.8685e+01 - val_accuracy: 0.0300
So how can I improve my model? is there something wrong?

You should have as many neurons in your final layer as you have classes. So your final dense layer should be:
Dense(n_classes),
Activation('softmax')
Also, since your task is not binary classification, your loss function should be:
loss=tf.keras.losses.CategoricalCrossentropy()
from_logits=True should only be set to true if you don't have an activation function on your final dense layer (which you have). If you want to keep from_logits=True, remove the softmax activation.
For this loss function, make sure that in your flow_from _directory call, that class_mode='categorical'.
One more thing, your learning rate seems very small. The default learning rate of 0.001 should be fine.

First of all, you should change the following lines:
Dense(1,'sigmoid'),
Activation('softmax')
into:
Dense(number_of_classes,'softmax'),
and down in the model.fit()
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False)
where number_of_classes is the number of categories in your case (30).
Second of all, since you have a very small number for each of your classes, you should use a pretrained network. A good starting point would be a ResNet50.

Keras Neural Network Accuracy is always 0 While Training

I'm making a simple classification algo with a keras neural network. The goal is to take 3 data points on weather and decide whether or not there's a wildfire. Here's an image of the .csv dataset that I'm using to train the model(this image is only the top few lines and isn't the entire thing ):
wildfire weather dataset
As you can see, there are 4 columns with the fourth being either a "1" which means "fire", or a "0" which means "no fire". I want the algo to predict either a 1 or a 0. This is the code that I wrote:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import csv
#THIS IS USED TO TRAIN THE MODEL
# Importing the dataset
dataset = pd.read_csv('Fire_Weather.csv')
dataset.head()
X=dataset.iloc[:,0:3]
Y=dataset.iloc[:,3]
X.head()
obj=StandardScaler()
X=obj.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.25)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation =
'relu', input_dim = 3))
# classifier.add(Dropout(p = 0.1))
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation
= 'relu'))
# classifier.add(Dropout(p = 0.1))
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation
= 'sigmoid'))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics
= ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 3, epochs = 10)
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)
print(y_pred)
classifier.save("weather_model.h5")
The problem is that whenever I run this, my accuracy is always "0.0000e+00" and my training output looks like this:
Epoch 1/10
2146/2146 [==============================] - 2s 758us/step - loss: nan - accuracy: 0.0238
Epoch 2/10
2146/2146 [==============================] - 1s 625us/step - loss: nan - accuracy: 0.0000e+00
Epoch 3/10
2146/2146 [==============================] - 1s 604us/step - loss: nan - accuracy: 0.0000e+00
Epoch 4/10
2146/2146 [==============================] - 1s 609us/step - loss: nan - accuracy: 0.0000e+00
Epoch 5/10
2146/2146 [==============================] - 1s 624us/step - loss: nan - accuracy: 0.0000e+00
Epoch 6/10
2146/2146 [==============================] - 1s 633us/step - loss: nan - accuracy: 0.0000e+00
Epoch 7/10
2146/2146 [==============================] - 1s 481us/step - loss: nan - accuracy: 0.0000e+00
Epoch 8/10
2146/2146 [==============================] - 1s 476us/step - loss: nan - accuracy: 0.0000e+00
Epoch 9/10
2146/2146 [==============================] - 1s 474us/step - loss: nan - accuracy: 0.0000e+00
Epoch 10/10
2146/2146 [==============================] - 1s 474us/step - loss: nan - accuracy: 0.0000e+00
Does anyone know why this is happening and what I could do to my code to fix this?
Thank You!

EDIT: I realized that my earlier response was highly misleading, which was thankfully pointed out by #xdurch0 and #Timbus Calin. Here is an edited answer.
Check that all your input values are valid. Are there any nan or inf values in your training data?
Try using different activation functions. ReLU is good, but it is prone to what is known as the dying ReLu problem, where the neural network basically learns nothing since no updates are made to its weight. One possibility is to use Leaky ReLu or PReLU.
Try using gradient clipping, which is a technique used to tackle vanishing or exploding gradients (which is likely what is happening in your case). Keras allows users to configure clipnorm clip value for optimizers.
There are posts on SO that report similar problems, such as this one, which might also be of interest to you.

What do I do to improve my Keras CNN VGG16 model

I'm working in a project that has 700 images for 2 classes (1400 total). I'm using VGG16 but i'm new with this model and I don't know what could I do to improve this model..
This is my model:
vgg16_model = VGG16(weights="imagenet", include_top=True)
# (1) visualize layers
print("VGG16 model layers")
for i, layer in enumerate(vgg16_model.layers):
print(i, layer.name, layer.output_shape)
# (2) remove the top layer
base_model = Model(input=vgg16_model.input,
output=vgg16_model.get_layer("block5_pool").output)
# (3) attach a new top layer
base_out = base_model.output
base_out = Reshape((25088,))(base_out)
top_fc1 = Dense(256, activation="relu")(base_out)
top_fc1 = Dropout(0.5)(top_fc1)
# output layer: (None, 5)
top_preds = Dense(1, activation="sigmoid")(top_fc1)
# (4) freeze weights until the last but one convolution layer (block4_pool)
for layer in base_model.layers[0:14]:
layer.trainable = False
# (5) create new hybrid model
model = Model(input=base_model.input, output=top_preds)
# (6) compile and train the model
sgd = SGD(lr=1e-4, momentum=0.9)
model.compile(optimizer=sgd, loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit([data], [labels], nb_epoch=NUM_EPOCHS,
batch_size=BATCH_SIZE, validation_split=0.1)
# evaluate final model
vlabels = model.predict(np.array(valid))
model.save('model.h5')
... that gives me the follow return:
Train on 1260 samples, validate on 140 samples
Epoch 1/5
1260/1260 [==============================] - 437s 347ms/step - loss: 0.2200 - acc: 0.9746 - val_loss: 2.4432e-05 - val_acc: 1.0000
Epoch 2/5
1260/1260 [==============================] - 456s 362ms/step - loss: 0.0090 - acc: 0.9984 - val_loss: 1.5452e-04 - val_acc: 1.0000
Epoch 3/5
1260/1260 [==============================] - 438s 347ms/step - loss: 1.3702e-07 - acc: 1.0000 - val_loss: 8.4489e-05 - val_acc: 1.0000
Epoch 4/5
1260/1260 [==============================] - 446s 354ms/step - loss: 4.2592e-06 - acc: 1.0000 - val_loss: 7.6768e-05 - val_acc: 1.0000
Epoch 5/5
1260/1260 [==============================] - 457s 363ms/step - loss: 0.0017 - acc: 0.9992 - val_loss: 1.1921e-07 - val_acc: 1.0000
It seems to be a bit overfitting..
My predict.py:
def fix_layer0(filename, batch_input_shape, dtype):
with h5py.File(filename, 'r+') as f:
model_config = json.loads(f.attrs['model_config'].decode('utf-8'))
layer0 = model_config['config']['layers'][0]['config']
layer0['batch_input_shape'] = batch_input_shape
layer0['dtype'] = dtype
f.attrs['model_config'] = json.dumps(model_config).encode('utf-8')
fix_layer0('model.h5', [None, 224, 224, 3], 'float32')
model = load_model('model.h5')
for filename in os.listdir(r'v/'):
if filename.endswith(".jpg") or filename.endswith(".ppm") or filename.endswith(".jpeg") or filename.endswith(".png"):
ImageCV = cv2.resize(cv2.imread(os.path.join(TEST_DIR) + filename), (224,224))
ImageCV = cv2.addWeighted(ImageCV,4, cv2.GaussianBlur(ImageCV,(0,0), 224/25), -4, 120) #The same process made when I get data in train
ImageCV = ImageCV.reshape(-1,224,224,3)
print(model.predict(ImageCV))
And the results are strange because only the 2 first images are of 'class 0'.. the others are 'class 1':
[[0.99905235]]
[[0.]]
[[1.]]
[[0.012198]]
[[0.]]
[[1.]]
[[1.6363418e-07]]
[[0.99997246]]
[[0.00433112]]
[[0.9996668]]
[[1.]]
[[6.183685e-08]]
What can I do to improve it? I'm a little confused..

ImageCV = cv2.addWeighted(ImageCV,4, cv2.GaussianBlur(ImageCV,(0,0),
224/25), -4, 120)
Not sure why you do this for the test data. For the validation/test data, usually only normalization is done. During training as well you need to apply the same normalization as a final step before feeding the data to the network.
Refer to this example for fine-tuning VGG16 for a two class problem(dogs vs cats)
https://gist.github.com/fchollet/7eb39b44eb9e16e59632d25fb3119975
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
To reduce overfitting, you can do data augmentation for the training data i.e. feed the original data & augmented data(apply operations like flip, zoom etc.). Keras ImageDataGenerators make it easy to do the augmentation. Explored in the above tutorial as well.
https://keras.io/preprocessing/image/

First of all, Keras predict will return the scores of the regression (probabilities for each class) and predict_classes will return the most likely class of your prediction. For example, if you classify between cats and dogs, predict could output 0.2 for cat and 0.8 for dog.
So, if you use predict, there should be two values per picture, one for each class.
The reason why you only have one value is that your network only has one output neuron. It should have two, as there are two classes.
top_preds = Dense(2, activation="sigmoid")(top_fc1)
If you now want to see most likely class, not the probabilities, you should use predict_classes.

Keras model evaluation accuracy unchanged, and designing model

I'm trying to design a CNN in Keras to classify small images of emojis in other images. Below is an example of one of the 13 classes. All images are the same size and all the emojis are of the same size as well. I would think that one should rather easily be able to achieve VERY high accuracy when classifying, as emojis from one class is exactly the same! My intuition told me that if an emoji is 50x50 I could create a convolutional layer of the same size to match one type of emoji. My supervisor did not think that was feasible however. Anyway, my problem is that, no matter how I design my model, I always get the same validation accuracy for each epoch, which corresponds to 1/13 (or simply guessing that each emoji belongs to the same class).
My model looks like this:
model = Sequential()
model.add(Conv2D(16, kernel_size=3, activation="relu", input_shape=IMG_SIZE))
model.add(Dropout(0.5))
model.add(Conv2D(32, kernel_size=3, activation="relu"))
model.add(Conv2D(64, kernel_size=3, activation="relu"))
model.add(Conv2D(128, kernel_size=3, activation="relu"))
#model.add(Conv2D(256, kernel_size=3, activation="relu"))
model.add(Dropout(0.5))
model.add(Flatten())
#model.add(Dense(256, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(NUM_CLASSES, activation='softmax', name="Output"))
And I train it like this:
# ------------------ Compile and train ---------------
sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True)
rms = optimizers.RMSprop(lr=0.004, rho=0.9, epsilon=None, decay=0.0)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=["accuracy"]) # TODO Read more about this
train_hist = model.fit_generator(
train_generator,
steps_per_epoch=train_generator.n // BATCH_SIZE,
validation_steps=validation_generator.n // BATCH_SIZE, # TODO que?
epochs=EPOCHS,
validation_data=validation_generator,
#callbacks=[EarlyStopping(patience=3, restore_best_weights=True)]
)
Even with this model, that has over 200 million parameters, I get exactly 0.0773 in validation accuracy for each epoch:
Epoch 1/10
56/56 [==============================] - 21s 379ms/step - loss: 14.9091 - acc: 0.0737 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 2/10
56/56 [==============================] - 6s 108ms/step - loss: 14.9308 - acc: 0.0737 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 3/10
56/56 [==============================] - 6s 108ms/step - loss: 14.7869 - acc: 0.0826 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 4/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8948 - acc: 0.0759 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 5/10
56/56 [==============================] - 6s 109ms/step - loss: 14.8897 - acc: 0.0762 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 6/10
56/56 [==============================] - 6s 109ms/step - loss: 14.8178 - acc: 0.0807 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 7/10
56/56 [==============================] - 6s 108ms/step - loss: 15.0747 - acc: 0.0647 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 8/10
56/56 [==============================] - 6s 108ms/step - loss: 14.7509 - acc: 0.0848 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 9/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8948 - acc: 0.0759 - val_loss: 14.8719 - val_acc: 0.0773
Epoch 10/10
56/56 [==============================] - 6s 108ms/step - loss: 14.8228 - acc: 0.0804 - val_loss: 14.8719 - val_acc: 0.0773
Because its not learning anything, I'm starting to think that it's not my models fault, but maybe the dataset or how I train it. I have tried training with "adam" as well but get the same result. I try to change the input size of the images but still, same result. Below is a sample from my dataset. Do you guys have any ideas what could be wrong?

I think the main issue currently is that your model has way too many parameters relative to how few samples you have for training. For image classification nowadays, you generally want to just have conv layers, a GlobalSomethingPooling layer, and then a single Dense layer for your outputs. You just need to make sure that your conv section ends up with a large enough receptive field to be able to find all the features you need.
The first thing to think about is making sure that you have a large enough receptive field (further reading about that here). There are three main ways to achieve that: pooling, stride >= 2, and/or dilation >= 2. Because you have only 13 "features" that you want to identify, and all of them will always be pixel-perfect, I'm thinking dilation will be the way to go so that the model can easily "overfit" on those 13 "features". If we use 4 conv layers with dilations of 1, 2, 4, and 8, respectively, then we'll end up with a receptive field of 31. This should be enough to easily recognize 50-pixel emoji.
Next, how many filters should each layer have? Normally, you start with a few filters, and increase as you go through the model, as you are doing here. However, we want to "overfit" on specific features, so we should probably increase the amounts in the earlier layers. Just to make it easy, let's give all layers 64 filters.
Last, how do we convert this all to a single prediction? Rather than using a dense layer, which would use a ton of parameters and would not be translation invariant, people nowadays use GlobalAveragePooling or GlobalMaxPooling. GlobalAveragePooling is more common, because it's good for helping to find combinations of many features. However, we just want to find 13 or so exact features here, so GlobalMaxPooling might work even better. Then a single dense layer after that will be enough to get a prediction. Because we're using GlobalMaxPooling, it doesn't need to be flattened first—global pooling already does that for us.
Resulting model:
Conv2D(64, kernel_size=3, activation="relu", input_shape=IMG_SIZE)
Conv2D(64, kernel_size=3, dilation_rate=2, activation="relu")
Conv2D(64, kernel_size=3, dilation_rate=4, activation="relu")
Conv2D(64, kernel_size=3, dilation_rate=8, activation="relu")
GlobalMaxPooling()
Dense(NUM_CLASSES, activation='softmax', name="Output")
Try that. You'll probably also want to add BatchNormalization layers in there after every layers except the last. Once you get decent training accuracy, check if it's overfitting. If it is (which is likely), try these steps:
batch norm if you haven't already
weight decay on all conv and dense layers
SpatialDropout2D after the conv layers and regular Dropout after the pooling layer
augment your data with an ImageDataGenerator
Designing nets like this is almost more of an art than a science, so change stuff around if you think you should. Eventually, you will get an intuition for what is more or less likely to work in any given situation.

Loss is not decreasing while training Keras Sequential Model

I'm creating a very simple 2 layer feed forward network but am finding that the loss is not updating at all. I have some ideas but I wanted to get additional feedback/guidance.
Details about the data:
X_train:
(336876, 158)
X_dev:
(42109, 158)
Y_train counts:
0 285793
1 51083
Name: default, dtype: int64
Y_dev counts:
0 35724
1 6385
Name: default, dtype: int64
And here is my model architecture:
# define the architecture of the network
model = Sequential()
model.add(Dense(50, input_dim=X_train.shape[1], init="uniform", activation="relu"))
model.add(Dense(3print("[INFO] compiling model...")
adam = Adam(lr=0.01)
model.compile(loss="binary_crossentropy", optimizer=adam,
metrics=['accuracy'])
model.fit(np.array(X_train), np.array(Y_train), epochs=12, batch_size=128, verbose=1)Dense(1, activation = 'sigmoid'))
Now, with this, my loss after the first few epochs are as follows:
Epoch 1/12
336876/336876 [==============================] - 8s - loss: 2.4441 - acc: 0.8484
Epoch 2/12
336876/336876 [==============================] - 7s - loss: 2.4441 - acc: 0.8484
Epoch 3/12
336876/336876 [==============================] - 6s - loss: 2.4441 - acc: 0.8484
Epoch 4/12
336876/336876 [==============================] - 7s - loss: 2.4441 - acc: 0.8484
Epoch 5/12
336876/336876 [==============================] - 7s - loss: 2.4441 - acc: 0.8484
Epoch 6/12
336876/336876 [==============================] - 7s - loss: 2.4441 - acc: 0.8484
Epoch 7/12
336876/336876 [==============================] - 7s - loss: 2.4441 - acc: 0.8484
Epoch 8/12
336876/336876 [==============================] - 6s - loss: 2.4441 - acc: 0.8484
Epoch 9/12
336876/336876 [==============================] - 6s - loss: 2.4441 - acc: 0.8484
And when I test the model after, my f1_score is 0. My main thought was that I may need more data but I'd still expect it to perform better than it is now on the test set. Could it be that it is overfitting? I added Dropout but no luck there either.
Any help would be much appreciated.

at first glance, I believe that your learning rate is too high. Also, please consider normalizing your data especially if different features have different ranges of values (look at Scaling). Also, please consider changing your layer activations depending on whether your labels are multi-class or not. Assuming your code is of this form (you seem to have some typos in problem description):
# define the architecture of the network
model = Sequential()
#also what is the init="uniform" argument? I did not find this in keras documentation, consider removing this.
model.add(Dense(50, input_dim=X_train.shape[1], init="uniform",
activation="relu"))
model.add(Dense(1, activation = 'sigmoid')))
#a slightly more conservative learning rate, play around with this.
adam = Adam(lr=0.0001)
model.compile(loss="binary_crossentropy", optimizer=adam,
metrics=['accuracy'])
model.fit(np.array(X_train), np.array(Y_train), epochs=12, batch_size=128,
verbose=1)
This should lead the loss to converge. If not, please consider deepening your neural net (think about how many parameters you may need).

Consider adding the classification layer before compiling your model.
model.add(Dense(1, activation = 'sigmoid'))
adam = Adam(lr=0.01)
model.compile(loss="binary_crossentropy", optimizer=adam,
metrics=['accuracy'])
model.fit(np.array(X_train), np.array(Y_train), epochs=12, batch_size=128, verbose=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.