Firstly, I know that similar questions have been asked before, but mainly for classification problems. Mine is a regression-style problem.
I am trying to train a neural network using keras to evaluate chess positions using stockfish evaluations. The input is boards in a (12,8,8) array (representing piece placement for each individual piece) and output is the evaluation in pawns. When training, the loss stagnates at around 500,000-600,000. I have a little over 12 million boards + evaluations and I train on all the data at once. The loss function is MSE.
This is my current code:
model = Sequential()
model.add(Dense(16, activation = "relu", input_shape = (12, 8, 8)))
model.add(Dropout(0.2))
model.add(Dense(16, activation = "relu"))
model.add(Dense(10, activation = "relu"))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1, activation = "linear"))
model.compile(optimizer = "adam", loss = "mean_squared_error", metrics = ["mse"])
model.summary()
# model = load_model("model.h5")
boards = np.load("boards.npy")
evals = np.load("evals.npy")
perf = model.fit(boards, evals, epochs = 10).history
model.save("model.h5")
plt.figure(dpi = 600)
plt.title("Loss")
plt.plot(perf["loss"])
plt.show()
This is the output of a previous epoch:
145856/398997 [=========>....................] - ETA: 26:23 - loss: 593797.4375 - mse: 593797.4375
The loss will remain at 570,000-580,000 upon further fitting, which is not ideal. The loss should decrease by a few more orders of magnitude if I am not wrong.
What is the problem and how can I fix it to make the model learn better?
I would suspect that your evaluation data contains very big values, like 100000 pawns if one of sides forcefully wins. Than, if your model predicts something like 0 in the same position, then squared error is very high and this pushes MSE high as well. You might want to check your evaluation data and ensure they are in some limited range like [-20..20].
Furthermore, evaluating a chess position is a very complex problem. It looks like your model has too few parameters for the task. Possible improvements:
Increase the numbers of neurons in your dense layers (say to 300,
200, 100).
Increase the numbers of hidden layers (say to 10).
Use convolutional layers.
Besides this, you might want to create a simple "baseline model" to better evaluate the performance of your neural network. This baseline model could be just a python function, which runs on input data and does position evaluation based on material counting (like bishop - 3 pawns, rook - 5 etc.) Than you can run this function on your dataset and see MSE for it. If your neural network produces a smaller MSE than this baseline model, than it is really learning some useful patterns.
I also recommend the following book: "Neural Networks For Chess: The magic of deep and reinforcement learning revealed" by Dominik Klein. The book contains a description of network architecture used in AlphaZero chess engine and a neural network used in Stockfish.
Related
I am working on a project to implement CNN-LSTM sentiment analysis. Below is the code
from keras.models import Sequential
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Conv1D , MaxPool1D , Flatten , Dropout
from keras.layers import BatchNormalization
from keras import regularizers
model7 = Sequential()
model7.add(Embedding(max_words, 40,input_length=max_len)) #The embedding layer
model7.add(Conv1D(20, 5, activation='relu', kernel_regularizer = regularizers.l2(l = 0.0001), bias_regularizer=regularizers.l2(0.01)))
model7.add(Dropout(0.5))
model7.add(Bidirectional(LSTM(20,dropout=0.5, kernel_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))))
model7.add(Dense(1,activation='sigmoid'))
model7.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
checkpoint7 = ModelCheckpoint("best_model7.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model7.fit(X_train_padded, y_train, epochs=10,validation_data=(X_test_padded, y_test),callbacks=[checkpoint7])
Even after adding regularizers and dropout, my model has very high validation loss and low accuracy.
Epoch 3: val_accuracy improved from 0.54517 to 0.57010, saving model to best_model7.hdf5
2188/2188 [==============================] - 290s 132ms/step - loss: 0.4241 - accuracy: 0.8301 - val_loss: 0.9713 - val_accuracy: 0.5701
My train and test data:
train: (70000, 7)
test: (30000, 7)
train['sentiment'].value_counts()
1 41044
0 28956
test['sentiment'].value_counts()
1 17591
0 12409
Can anyone please let me know how to reduce overfitting.
Since your code works, I believe that your network is failing silently by 'not learning' a lot from the data. Here's a list of some of the things you can generally check:
Is your textual data well transformed into numerical data? Is it well reprented using TF-IDF or bag of words or any other method that returns a numerical representation?
I see that you imported batch normalization but you do not apply it. Batch norm actually helps and most importantly, does the job of regularizers since each input to each layer is normalized using the mini-batch the network has seen. So maybe remove your L2 regularizations in all layers and apply a simple batch norm instead which should reduce overfitting (also, use it without the drop out since some empirical studies show that they should not be combined together)
Your embedding output is currently set to 40, that is 40 numerical elements of a text vector that may contain more than 10,000 elements. It seems a bit low. Try something more 'standard' such as 128 or 256 instead of 40.
Lastly, you set the adam optimizer with all the default parameters. However, the learning rate can have a big impact on the way your loss function is computed. As I am sure you know, the gradient step uses this learning rate to progress in its calculation of the derivatives for each neuron. the default is learning_rate=0.001. So try the following code and increase a bit the learning rate (for example 0.01 or even 0.1).
A simple example :
# define model
model = Sequential()
model.add(LSTM(32)) # or CNN
model.add(BatchNormalization())
model.add(Dense(1))
# define optimizer
optimizer = keras.optimizers.Adam(0.01)
# define loss function
loss = keras.losses.binary_crossentropy
# define metric to optimize
metric = [keras.metrics.Accuracy(name='accuracy')] # you can add more
# compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metric)
Final thought: I see that you went for a combination of CNN and LSTM which has great merite. However, it is always recommended to try a simple MLP network to establish a baseline score that you may later try to beat. Does a simple MLP with 1 or 2 layers and not a lot of units produce a low accuracy score as well? If it performs better than maybe the problem is in the implementation or in the hyper parameters that you chose for the layers (or even theoretical).
I hope this answer helps and cheers!
I have the following Neural Network:
model = Sequential()
model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(150, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(10))
model.add(Dropout(0.3))
model.add(Dense(2, activation='softmax')) # Activation_layer
opt = Adam(lr=1e-3, decay=1e-6)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
The network will be fed sequential data, and is trying to classify the data to either 1 or 0.
Example of one of the samples:
X:
[[0.56450562 0.69825955 0.57768099 0.69077864]
[0.58818427 0.70355375 0.61725885 0.30270281]
[0.57407927 0.72501532 0.59603936 0.29196058]
[0.56501804 0.69072662 0.59064673 0.66034622]
[0.56552001 0.70354009 0.59136487 0.1586415 ]
[0.56501496 0.68205159 0.57877241 0.62252169]
[0.54535762 0.67067675 0.58414928 0.9077868 ]
[0.56197241 0.71226839 0.5920788 0.1339519 ]
[0.57308813 0.70469134 0.59749238 0.27085101]
[0.56146488 0.69258436 0.58377929 0.7065891 ]
[0.55943607 0.69106406 0.59569036 0.69378783]
[0.5670203 0.68271571 0.58702014 0.70585781]
[0.58320254 0.71228948 0.60867704 0.19280208]
[0.56904526 0.71490986 0.59027546 0.35757948]
[0.56398908 0.67858148 0.58197139 0.75064535]
[0.57005691 0.7062191 0.60363236 0.38345417]
[0.5705625 0.70394121 0.58630169 0.19171352]
[0.56145905 0.69106039 0.58340288 0.76821359]
[0.55183665 0.68991404 0.5935228 0.53419864]
[0.56549613 0.68800419 0.58013082 0.74470123]
[0.54926442 0.67315638 0.58336904 0.77819332]
[0.56802882 0.71842805 0.60222782 0.12845991]
[0.59591035 0.70927878 0.61161172 0.68023463]
[0.56904526 0.713053 0.58773435 0.20017562]
[0.58321778 0.69939555 0.61194041 0.47063807]
[0.57814777 0.71113559 0.58991151 0.62149082]
[0.56044844 0.69257776 0.58738045 0.39285414]
[0.56853912 0.70091102 0.59713724 0.21938703]
[0.56398364 0.69939514 0.59316136 0.43031303]
[0.56701957 0.69901619 0.5935228 0.39333831]
[0.56701916 0.68082684 0.58701647 0.84346823]
[0.57765044 0.70812209 0.60147335 0.38961049]
[0.58975543 0.71340576 0.6050683 0.61008348]
[0.57207508 0.70280098 0.59821004 0.44573693]
[0.56702537 0.71035313 0.59424384 0.30333905]
[0.58417429 0.69901619 0.60288387 0.7210835 ]
[0.56400225 0.70128289 0.59028243 0.42721302]
[0.5725759 0.70241467 0.60000056 0.22784863]
[0.57055816 0.69561772 0.59136355 0.66855609]
[0.58766922 0.70995564 0.60538235 0.71163122]
[0.57206444 0.69788453 0.59567842 0.707679 ]
[0.5775922 0.70956495 0.60249313 0.32745877]
[0.57407031 0.6997696 0.57952909 0.54327415]
[0.55346759 0.69223554 0.58920848 0.27867972]
[0.58612784 0.7031614 0.617901 0.76338596]
[0.58659902 0.72005896 0.60604811 0.48696192]
[0.57004823 0.70539865 0.59173347 0.47288217]
[0.57405756 0.7023936 0.59030119 0.49981083]
[0.55801818 0.68813345 0.58564415 0.38486918]
[0.55900944 0.69300306 0.58527681 0.41875207]
[0.56351994 0.68585174 0.58239563 0.70965566]
[0.5509523 0.69524821 0.59280378 0.46280846]
[0.56753474 0.69713124 0.59172507 0.29915786]
[0.56753451 0.69939326 0.5978358 0.59996518]
[0.56954889 0.69109776 0.57734904 0.27905973]
[0.55595081 0.68429475 0.59424321 0.86881108]
[0.57005376 0.71486763 0.60215717 0.20096972]
[0.57509255 0.70467308 0.59028491 0.29196681]
[0.5584625 0.68958804 0.59028342 0.24039387]
[0.57005412 0.70203582 0.5964024 0.59344888]]
y:
1
The issue I am having, is that the loss starts out at around 0.69 and never decreases significantly (it fluctuates a bit), and the loss and validtation loss both stay around 0.5
What I've tried so far:
Checked Training and validation data for NaN's or values < 0 or > 1 -> Not found
Reduced sample size dramatically (down to 50 samples) and a network that should be more than large enough to overfit, but alas still the same result.
Preprocessed the data in a completely different way
Using sigmoid activation instead of softmax to classify the labels.
Reduced learning rate
Removing the second last dense layer
Used LeakyReLU with alpha=0.05
Although the data could be next to random, shouldn't a sufficiently large network easily overfit onto 50 samples or less?
2 suggestions:
It appears you have a binary classification problem (either 0 or 1), perhaps you could try a binary cross entropy loss instead?
Are you using a method such as to_categorical to one-hot encode your labels?
Other factors that can sometimes dramatically affect accuracies that you haven't mentioned trying/changing:
Using different optimizers
Exploring different architectures: have you considered maybe a CNN-LSTM model? Or have you tested on different architectures, do some learn better than others?
I want to classify pattern on image. My original image shape are 200 000*200 000 i reshape it to 96*96, pattern are still recognizable with human eyes. Pixel value are 0 or 1.
i'm using the following neural network.
train_X, test_X, train_Y, test_Y = train_test_split(cnn_mat, img_bin["Classification"], test_size = 0.2, random_state = 0)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(train_Y),
train_Y)
train_Y_one_hot = to_categorical(train_Y)
test_Y_one_hot = to_categorical(test_Y)
train_X,valid_X,train_label,valid_label = train_test_split(train_X, train_Y_one_hot, test_size=0.2, random_state=13)
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
train = model.fit(train_X, train_label, batch_size=80,epochs=20,verbose=1,validation_data=(valid_X, valid_label),class_weight=class_weights)
I have already run some experiment to find a "good" number of hidden layer and fully connected layer. it's probably not the most optimal architecture since my computer is slow, i just ran different model once and selected best one with matrix confusion, i didn't use cross validation,I didn't try more complex architecture since my number of data is small, i have read small architecture are the best, is it worth to try more complex architecture?
here the result with 5 and 12 epoch, bach size 80. This is the confusion matrix for my test set
As you can see it's look like i'm overfiting. When i only run 5 epoch, most of the class are assigned to class 0; With more epoch, class 0 is less important but classification is still bad
I added 0.8 dropout after each convolutional layer
e.g
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
With drop out, 95% of my image are classified in class 0.
I tryed image augmentation; i made rotation of all my training image, still used weighted activation function, result didnt improve. Should i try to augment only class with small number of image? Most of the thing i read says to augment all the dataset...
To resume my question are:
Should i try more complex model?
Is it usefull to do image augmentation only on unrepresented class? then should i still use weight class (i guess no)?
Should i have hope to find a "good" model with cnn when we see the size of my dataset?
I think according to the imbalanced data, it is better to create a custom data generator for your model so that each of it's generated data batch, contains at least one sample from each class. And also it is better to use Dropout layer after each dense layer instead of conv layer. For data augmentation it is better to at least use combination of rotate, horizontal flip and vertical flip. there are some other approaches for data augmentation like using GAN network or random pixel replacement.
For Gan you can check This SO post
For using Gan as data augmenter you can read This Article.
For combination of pixel level augmentation and GAN pixel level data augmentation
What I used - in a different setting - was to upsample my data with ADASYN. This algorithm calculates the amount of new data required to balance your classes, and then takes available data to sample novel examples.
There is an implementation for Python. Otherwise, you also have very little data. SVMs are good performing even with little data. You might want to try them or other image classification algorithms depending where the expected pattern is always at the same position, or varies. Then you could also try the Viola–Jones object detection framework.
While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?
Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .
I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.
I am trying to estimate the third band(Blue) in an RGB image using convolutional neural networks. my design using Keras is a sequentiol model with a convolution2D layer as input layer two hidden layers and output neuron. if i want loss(rmse) to be zero how should i change my model?
my model in python goes like this
in_image = skimage.io.imread('test.jpg')[0:50,0:50,:].astype(float)
data = in_image[:,:,0:2]
target = in_image[:,:,2:3]
model1 = keras.models.Sequential()
model1.add(keras.layers.Convolution2D(50,(3,3),strides = (1,1),padding = "same",input_shape=(None,None,2))) #Convolution Layer
model1.add(keras.layers.Dense(50,activation = 'relu')) # Hiden Layer1
model1.add(keras.layers.Dense(50,activation = 'sigmoid')) # Hidden Layer 2
model1.add(keras.layers.Dense(1)) # Output Layer
adadelta = keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=0.0)
model1.compile(loss='mean_squared_error', optimizer=adadelta) # Compile the model
model1.fit(np.array([data]),np.array([target]),epochs = 5000)
estimated_band = model1.predict(np.array([data]))
Given your problem setup, it looks like you're trying to training a neural network on one image such that it is able to predict the blue channel of an image from other 2 images. Putting aside the use of such an experiment, there are a few important things when training neural networks properly, including.
learning rate
weight initialization
optimizer
model complexity.
Yann Lecun's Efficient backprop is a late 90s paper that talks about numbers 1, 2 and 3. Number 4 holds on the assumption that as the number of free parameters increase, at some point you'll be able to match each parameter to each output.
Note that achieving zero-loss provides no guarantees on generalization nor does it mean that your model will not generalize, as brilliantly described in a paper presented at ICLR.