Negative huge loss in tensorflow

Negative huge loss in tensorflow - python

I am trying to predict price values from datasets using keras. I am following this tutorial: https://keras.io/examples/structured_data/structured_data_classification_from_scratch/, but when I get to the part of fitting the model, I am getting a huge negative loss and very small accuracy
Epoch 1/50
1607/1607 [==============================] - ETA: 0s - loss: -117944.7500 - accuracy: 3.8897e-05
2022-05-22 11:14:28.922065: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
1607/1607 [==============================] - 15s 10ms/step - loss: -117944.7500 - accuracy: 3.8897e-05 - val_loss: -123246.0547 - val_accuracy: 7.7791e-05
Epoch 2/50
1607/1607 [==============================] - 15s 9ms/step - loss: -117944.7734 - accuracy: 3.8897e-05 - val_loss: -123246.0547 - val_accuracy: 7.7791e-05
Epoch 3/50
1607/1607 [==============================] - 15s 10ms/step - loss: -117939.4844 - accuracy: 3.8897e-05 - val_loss: -123245.9922 - val_accuracy: 7.7791e-05
Epoch 4/50
1607/1607 [==============================] - 16s 10ms/step - loss: -117944.0859 - accuracy: 3.8897e-05 - val_loss: -123245.9844 - val_accuracy: 7.7791e-05
Epoch 5/50
1607/1607 [==============================] - 15s 10ms/step - loss: -117944.7422 - accuracy: 3.8897e-05 - val_loss: -123246.0547 - val_accuracy: 7.7791e-05
Epoch 6/50
1607/1607 [==============================] - 15s 10ms/step - loss: -117944.8203 - accuracy: 3.8897e-05 - val_loss: -123245.9766 - val_accuracy: 7.7791e-05
Epoch 7/50
1607/1607 [==============================] - 15s 10ms/step - loss: -117944.8047 - accuracy: 3.8897e-05 - val_loss: -123246.0234 - val_accuracy: 7.7791e-05
Epoch 8/50
1607/1607 [==============================] - 15s 10ms/step - loss: -117944.7578 - accuracy: 3.8897e-05 - val_loss: -123245.9766 - val_accuracy: 7.7791e-05
Epoch 9/50
This is my graph, as far as the code, it looks like the one from the example but adapted:
# Categorical feature encoded as string
desc = keras.Input(shape=(1,), name="desc", dtype="string")
# Numerical features
date = keras.Input(shape=(1,), name="date")
quant = keras.Input(shape=(1,), name="quant")
all_inputs = [
desc,
quant,
date,
]
# String categorical features
desc_encoded = encode_categorical_feature(desc, "desc", train_ds)
# Numerical features
quant_encoded = encode_numerical_feature(quant, "quant", train_ds)
date_encoded = encode_numerical_feature(date, "date", train_ds)
all_features = layers.concatenate(
[
desc_encoded,
quant_encoded,
date_encoded,
]
)
x = layers.Dense(32, activation="sigmoid")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="relu")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
And the dataset looks like this:
date desc quant price
0 20140101.0 CARBONATO DE DIMETILO 999.00 1428.57
1 20140101.0 HIDROQUINONA 137.00 1314.82
2 20140101.0 1,5 PENTANODIOL TECN. 495.00 2811.60
3 20140101.0 SOSA CAUSTICA LIQUIDA 50% 567160.61 113109.14
4 20140101.0 BOROHIDRURO SODICO 6.24 299.27
Also I am converting the date from being YYYY-MM-DD to being numbers using:
dataset['date'] = pd.to_datetime(dataset["date"]).dt.strftime("%Y%m%d").astype('float64')
What am I doing wrong? :(
EDIT: I though the encoder function from the tutorial was normalizing data, but it wasnt. Is there any other tutorial that you know guys which can guide me better? The loss problem has been fixed ! (was due to normalization)

You seem to be quite confused by the components of your model.
Binary cross entropy is a classification loss, your problem is regression -> use MSE. Also "accuracy" makes no sense for regression, change it to MSE too.
You data is huge and thus your loss is huge. You have a price of 113109.14 in the data, what if your model is bad initially and says 0? You get a loss of ~100,000^2 = 10,000,000,000. Normalise your data, in your case - the output variable (target, price) to in between -1 and 1
There are some use cases where an output neuron should have an activation function, but unless you know why you are doing this, leaving it as a linear is a much safer choice.
Dropout is a method for regularising your model, do not start with having it, always start with the simplest possible model, and make sure you can learn before trying to maximise test score.
Neural networks will not extrapolate, feeding in an ever growing signal (date) in a raw format almost surely will cause problems.

Related

Multilabel classification with imbalanced dataset

I am trying to do a multilabel classfication problem, which has an imabalnced dataset.
The total number of samples is 1130, out of the 1130 samples, the first class occur in 913 of them. The second class 215 times and the third one 423 times.
In the model architecture, I have 3 output nodes, and have applied sigmoid activation.
input_tensor = Input(shape=(256, 256, 3))
base_model = VGG16(input_tensor=input_tensor,weights='imagenet',pooling=None, include_top=False)
#base_model.summary()
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = tf.math.reduce_max(x,axis=0,keepdims=True)
x = Dense(512,activation='relu')(x)
output_1 = Dense(3, activation='sigmoid')(x)
sagittal_model_abn = Model(inputs=base_model.input, outputs=output_1)
for layer in base_model.layers:
layer.trainable = True
I am using binary cross entropy loss, which I calculate usng this function.
I am using weighted loss to deal with the imbalance.
if y_true[0]==1:
loss_abn = -1*K.log(y_pred[0][0])*cwb[0][1]
elif y_true[0]==0:
loss_abn = -1*K.log(1-y_pred[0][0])*cwb[0][0]
if y_true[1]==1:
loss_acl = -1*K.log(y_pred[0][1])*cwb[1][1]
elif y_true[1]==0:
loss_acl = -1*K.log(1-y_pred[0][1])*cwb[1][0]
if y_true[2]==1:
loss_men = -1*K.log(y_pred[0][2])*cwb[2][1]
elif y_true[2]==0:
loss_men = -1*K.log(1-y_pred[0][2])*cwb[2][0]
loss_value_ds = loss_abn + loss_acl + loss_men
cwb contains the class weights.
y_true is the ground truth labels having length 3.
y_pred is a numpy array with shape (1,3)
I weight the classes individually as occurrence and non-occurrence of the classes.
Like, if the label is 1, I count it as an occurrence and if it is 0, then it is a non-occurrence.
So, the first class's label 1 occurs 913 times out of 1130
So the class weight of label 1 for the first class is 1130/913 which is about 1.23 and the weight of the label 0 for the first class is 1130/(1130-913)
When I train the model, the accuracy oscillates (or stays almost same), and the loss decreases.
And I am getting predictions like this for every sample
[[0.51018655 0.5010625 0.50482965]]
The prediction values are in the range 0.49 - 0.51 in every iteration for all the classes
Tried changing the number of nodes in the FC layer but it still behaves the same way.
Can anyone help?
Does using tf.math,reduce_max cause the problem? Should using #tf.function to do the operation that I am doing using tf.math.reduce_max be beneficial?
NOTE:
I am weighting the labels 1 and 0 for each class separately.
cwb = {0: {0: 5.207373271889401, 1: 1.2376779846659365},
1: {0: 1.2255965292841648, 1: 5.4326923076923075},
2: {0: 1.5416098226466575, 1: 2.8463476070528966}}
EDIT:
The results when I train using model.fit().
Epoch 1/20
1130/1130 [==============================] - 1383s 1s/step - loss: 4.1638 - binary_accuracy: 0.4558 - val_loss: 5.0439 - val_binary_accuracy: 0.3944
Epoch 2/20
1130/1130 [==============================] - 1397s 1s/step - loss: 4.1608 - binary_accuracy: 0.4165 - val_loss: 5.0526 - val_binary_accuracy: 0.5194
Epoch 3/20
1130/1130 [==============================] - 1402s 1s/step - loss: 4.1608 - binary_accuracy: 0.4814 - val_loss: 5.1469 - val_binary_accuracy: 0.6361
Epoch 4/20
1130/1130 [==============================] - 1407s 1s/step - loss: 4.1722 - binary_accuracy: 0.4472 - val_loss: 5.0501 - val_binary_accuracy: 0.5583
Epoch 5/20
1130/1130 [==============================] - 1397s 1s/step - loss: 4.1591 - binary_accuracy: 0.4991 - val_loss: 5.0521 - val_binary_accuracy: 0.6028
Epoch 6/20
1130/1130 [==============================] - 1375s 1s/step - loss: 4.1596 - binary_accuracy: 0.5431 - val_loss: 5.0515 - val_binary_accuracy: 0.5917
Epoch 7/20
1130/1130 [==============================] - 1370s 1s/step - loss: 4.1595 - binary_accuracy: 0.4962 - val_loss: 5.0526 - val_binary_accuracy: 0.6000
Epoch 8/20
1130/1130 [==============================] - 1387s 1s/step - loss: 4.1591 - binary_accuracy: 0.5316 - val_loss: 5.0523 - val_binary_accuracy: 0.6028
Epoch 9/20
1130/1130 [==============================] - 1391s 1s/step - loss: 4.1590 - binary_accuracy: 0.4909 - val_loss: 5.0521 - val_binary_accuracy: 0.6028
Epoch 10/20
1130/1130 [==============================] - 1400s 1s/step - loss: 4.1590 - binary_accuracy: 0.5369 - val_loss: 5.0519 - val_binary_accuracy: 0.6028
Epoch 11/20
1130/1130 [==============================] - 1397s 1s/step - loss: 4.1590 - binary_accuracy: 0.4808 - val_loss: 5.0519 - val_binary_accuracy: 0.6028
Epoch 12/20
1130/1130 [==============================] - 1394s 1s/step - loss: 4.1590 - binary_accuracy: 0.5469 - val_loss: 5.0522 - val_binary_accuracy: 0.6028

I would try the label powerset method.
Instead of 3 output nodes, try setting that to the total number of combinations possible as per your labels and dataset. For example, for a multi-label classification with 3 distinct classes, there are 7 possible outputs.
Say, labels are A, B and C. Map output 0 to A, 1 to B, 2 to C, 3 to AB, 4 to AC and so on.
Using a simple transformation function before training and for testing, this problem can be converted to a multi-class, single label problem.

High train accuracy poor test accuracy

I have a neural network which classify 3 output.My dataset is very small, I have 340 images for train, and 60 images for test. I build a model and when I compile at my result is this:
Epoch 97/100
306/306 [==============================] - 46s 151ms/step - loss: 0.2453 - accuracy: 0.8824 - val_loss: 0.3557 - val_accuracy: 0.8922
Epoch 98/100
306/306 [==============================] - 47s 152ms/step - loss: 0.2096 - accuracy: 0.9031 - val_loss: 0.3795 - val_accuracy: 0.8824
Epoch 99/100
306/306 [==============================] - 47s 153ms/step - loss: 0.2885 - accuracy: 0.8627 - val_loss: 0.4501 - val_accuracy: 0.7745
Epoch 100/100
306/306 [==============================] - 46s 152ms/step - loss: 0.1998 - accuracy: 0.9150 - val_loss: 0.4586 - val_accuracy: 0.8627
when I predict the test images, test accuracy is poor.
What should I do ? I also use ImageDatagenerator for data augmentation but the result is same.Is it because I have small dataset.

You can use Regularization on fully connected layers. But the fact that you already have high validation accuracy it's probably your data. your train data might not fully represent your test data. try to analyze that and make sure you do all the pre processing on the test data before testing as you did for the train data.

Keras: val_loss is increasing and evaluate loss is too high

I'm new to Keras and I'm using it to build a normal Neural Network to classify number MNIST dataset.
Beforehand I have already split the data into 3 parts: 55000 to train, 5000 to evaluate and 10000 to test, and I have scaled the pixel density down (by dividing it by 255.0)
My model looks like this:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28,28]))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(10, activation='softmax'))
And here is the compile:
model.compile(loss='sparse_categorical_crossentropy',
optimizer = 'Adam',
metrics=['accuracy'])
I train the model:
his = model.fit(xTrain, yTrain, epochs = 20, validation_data=(xValid, yValid))
At first the val_loss decreases, then it increases although the accuracy is increasing.
Train on 55000 samples, validate on 5000 samples
Epoch 1/20
55000/55000 [==============================] - 5s 91us/sample - loss: 0.2822 - accuracy: 0.9199 - val_loss: 0.1471 - val_accuracy: 0.9588
Epoch 2/20
55000/55000 [==============================] - 5s 82us/sample - loss: 0.1274 - accuracy: 0.9626 - val_loss: 0.1011 - val_accuracy: 0.9710
Epoch 3/20
55000/55000 [==============================] - 5s 83us/sample - loss: 0.0899 - accuracy: 0.9734 - val_loss: 0.0939 - val_accuracy: 0.9742
Epoch 4/20
55000/55000 [==============================] - 5s 84us/sample - loss: 0.0674 - accuracy: 0.9796 - val_loss: 0.0760 - val_accuracy: 0.9770
Epoch 5/20
55000/55000 [==============================] - 5s 94us/sample - loss: 0.0541 - accuracy: 0.9836 - val_loss: 0.0842 - val_accuracy: 0.9742
Epoch 15/20
55000/55000 [==============================] - 4s 82us/sample - loss: 0.0103 - accuracy: 0.9967 - val_loss: 0.0963 - val_accuracy: 0.9788
Epoch 16/20
55000/55000 [==============================] - 5s 84us/sample - loss: 0.0092 - accuracy: 0.9973 - val_loss: 0.0956 - val_accuracy: 0.9774
Epoch 17/20
55000/55000 [==============================] - 5s 82us/sample - loss: 0.0081 - accuracy: 0.9977 - val_loss: 0.0977 - val_accuracy: 0.9770
Epoch 18/20
55000/55000 [==============================] - 5s 85us/sample - loss: 0.0076 - accuracy: 0.9977 - val_loss: 0.1057 - val_accuracy: 0.9760
Epoch 19/20
55000/55000 [==============================] - 5s 83us/sample - loss: 0.0063 - accuracy: 0.9980 - val_loss: 0.1108 - val_accuracy: 0.9774
Epoch 20/20
55000/55000 [==============================] - 5s 85us/sample - loss: 0.0066 - accuracy: 0.9980 - val_loss: 0.1056 - val_accuracy: 0.9768
And when I evaluate the loss is too high:
model.evaluate(xTest, yTest)
Result:
10000/10000 [==============================] - 0s 41us/sample - loss: 25.7150 - accuracy: 0.9740
[25.714989705941953, 0.974]
Is this ok, or is it a sign of overfitting? Should I do something to improve it? Thanks in advance.

Usually, it is not Ok. You want the loss rate to be as small as possible. Your result is typical for overfitting. Your Network 'knows' its training data, but isn't capable of analysing new Images. You may want to add some layers. Maybe Convolutional Layers, Dropout Layer... another idea would be to augment your training images. The ImageDataGenerator-Class provided by Keras might help you out here
Another thing to look at could be your hyperparameters. Why do you use 100 nodes in the first dense layer? maybe something like 784 (28*28) seems more interesting if you want to start with a dense layer. I would suggest some combination of Convolutional-Dropout-Dense. Then your dense -layer maybe doesn't need that many nodes...

Keras train and validation metric values are different even when using same data (Logistic regression)

I have been trying to better understand the train/validation sequence in the keras model fit() loop. So I tried out a simple training loop where I attempted to fit a simple logistic regression model with input data consisting of a single feature.
I feed the same data for both training and validation. Under those conditions, and by specifying batch size to be the same and total data size, one would expect to obtain exactly the same loss and accuracy. But this is not the case.
Here is my code:
Generate some two random data with two classes:
N = 100
x = np.concatenate([np.random.randn(N//2, 1), np.random.randn(N//2, 1)+2])
y = np.concatenate([np.zeros(N//2), np.ones(N//2)])
And plotting the two class data distribution (one feature x):
data = pd.DataFrame({'x': x.ravel(), 'y': y})
sns.violinplot(x='x', y='y', inner='point', data=data, orient='h')
pyplot.tight_layout(0)
pyplot.show()
Build and fit the keras model:
model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation='sigmoid', input_dim=1)])
model.compile(optimizer=tf.keras.optimizers.SGD(2), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x, y, epochs=10, validation_data=(x, y), batch_size=N)
Notice that I have specified the data x and targets y for both training and for validation_data. Also, the batch_size is same as total size batch_size=N.
The training results are:
100/100 [==============================] - 1s 5ms/step - loss: 1.4500 - acc: 0.2300 - val_loss: 0.5439 - val_acc: 0.7200
Epoch 2/10
100/100 [==============================] - 0s 18us/step - loss: 0.5439 - acc: 0.7200 - val_loss: 0.4408 - val_acc: 0.8000
Epoch 3/10
100/100 [==============================] - 0s 16us/step - loss: 0.4408 - acc: 0.8000 - val_loss: 0.3922 - val_acc: 0.8300
Epoch 4/10
100/100 [==============================] - 0s 16us/step - loss: 0.3922 - acc: 0.8300 - val_loss: 0.3659 - val_acc: 0.8400
Epoch 5/10
100/100 [==============================] - 0s 17us/step - loss: 0.3659 - acc: 0.8400 - val_loss: 0.3483 - val_acc: 0.8500
Epoch 6/10
100/100 [==============================] - 0s 16us/step - loss: 0.3483 - acc: 0.8500 - val_loss: 0.3356 - val_acc: 0.8600
Epoch 7/10
100/100 [==============================] - 0s 17us/step - loss: 0.3356 - acc: 0.8600 - val_loss: 0.3260 - val_acc: 0.8600
Epoch 8/10
100/100 [==============================] - 0s 18us/step - loss: 0.3260 - acc: 0.8600 - val_loss: 0.3186 - val_acc: 0.8600
Epoch 9/10
100/100 [==============================] - 0s 18us/step - loss: 0.3186 - acc: 0.8600 - val_loss: 0.3127 - val_acc: 0.8700
Epoch 10/10
100/100 [==============================] - 0s 23us/step - loss: 0.3127 - acc: 0.8700 - val_loss: 0.3079 - val_acc: 0.8800
The results show that val_loss and loss are not the same at the end of each epoch, and also acc and val_acc are not exactly the same. However, based on this setup, one would expect them to be the same.
I have been going through the code in keras, particularly this part:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1364
and so far, all I can say that the difference is due to some different computation through the computation graph.
Does anyone has any idea why there would be such difference?

So after looking more closely at the results, the loss and acc values from the training step are computed BEFORE the current batch is used to update the model.
Thus, in the case of a single batch per epoch, the train acc and loss are evaluated when the batch is fed in, then the model parameters are updated based on the provided optimizer. After the train step is finished, we compute loss and accuracy by feeding in the validation data, which is now evaluated using a new updated model.
This is evident from the training results output, where validation accuracy and loss are in epoch 1 are equal to train accuracy and loss in epoch 2, etc...
A quick check using tensorflow confirmed that values are fetched before variables are updated:
import tensorflow as tf
import numpy as np
np.random.seed(1)
x = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="x")
y = tf.placeholder(dtype=tf.float32, shape=(None), name="y")
W = tf.get_variable(name="W", shape=(1, 1), dtype=tf.float32, initializer=tf.constant_initializer(0))
b = tf.get_variable(name="b", shape=1, dtype=tf.float32, initializer=tf.constant_initializer(0))
z = tf.matmul(x, W) + b
error = tf.square(z - y)
obj = tf.reduce_mean(error, name="obj")
opt = tf.train.MomentumOptimizer(learning_rate=0.025, momentum=0.9)
grads = opt.compute_gradients(obj)
train_step = opt.apply_gradients(grads)
N = 100
x_np = np.random.randn(N).reshape(-1, 1)
y_np = 2*x_np + 3 + np.random.randn(N)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(2):
res = sess.run([obj, W, b, train_step], feed_dict={x: x_np, y: y_np})
print('MSE: {}, W: {}, b: {}'.format(res[0], res[1][0, 0], res[2][0]))
Output:
MSE: 14.721437454223633, W: 0.0, b: 0.0
MSE: 13.372591018676758, W: 0.08826743811368942, b: 0.1636980175971985
Since the parameters W and b were initialized to 0, then it is clear that the fetched values is still 0 even though session was run with gradient update request...

How to understand loss acc val_loss val_acc in Keras model fitting

I'm new on Keras and have some questions on how to understanding my model results. Here is my result:(for your convenience, I only paste the loss acc val_loss val_acc after each epoch here)
Train on 4160 samples, validate on 1040 samples as below:
Epoch 1/20
4160/4160 - loss: 3.3455 - acc: 0.1560 - val_loss: 1.6047 - val_acc: 0.4721
Epoch 2/20
4160/4160 - loss: 1.7639 - acc: 0.4274 - val_loss: 0.7060 - val_acc: 0.8019
Epoch 3/20
4160/4160 - loss: 1.0887 - acc: 0.5978 - val_loss: 0.3707 - val_acc: 0.9087
Epoch 4/20
4160/4160 - loss: 0.7736 - acc: 0.7067 - val_loss: 0.2619 - val_acc: 0.9442
Epoch 5/20
4160/4160 - loss: 0.5784 - acc: 0.7690 - val_loss: 0.2058 - val_acc: 0.9433
Epoch 6/20
4160/4160 - loss: 0.5000 - acc: 0.8065 - val_loss: 0.1557 - val_acc: 0.9750
Epoch 7/20
4160/4160 - loss: 0.4179 - acc: 0.8296 - val_loss: 0.1523 - val_acc: 0.9606
Epoch 8/20
4160/4160 - loss: 0.3758 - acc: 0.8495 - val_loss: 0.1063 - val_acc: 0.9712
Epoch 9/20
4160/4160 - loss: 0.3202 - acc: 0.8740 - val_loss: 0.1019 - val_acc: 0.9798
Epoch 10/20
4160/4160 - loss: 0.3028 - acc: 0.8788 - val_loss: 0.1074 - val_acc: 0.9644
Epoch 11/20
4160/4160 - loss: 0.2696 - acc: 0.8923 - val_loss: 0.0581 - val_acc: 0.9856
Epoch 12/20
4160/4160 - loss: 0.2738 - acc: 0.8894 - val_loss: 0.0713 - val_acc: 0.9837
Epoch 13/20
4160/4160 - loss: 0.2609 - acc: 0.8913 - val_loss: 0.0679 - val_acc: 0.9740
Epoch 14/20
4160/4160 - loss: 0.2556 - acc: 0.9022 - val_loss: 0.0599 - val_acc: 0.9769
Epoch 15/20
4160/4160 - loss: 0.2384 - acc: 0.9053 - val_loss: 0.0560 - val_acc: 0.9846
Epoch 16/20
4160/4160 - loss: 0.2305 - acc: 0.9079 - val_loss: 0.0502 - val_acc: 0.9865
Epoch 17/20
4160/4160 - loss: 0.2145 - acc: 0.9185 - val_loss: 0.0461 - val_acc: 0.9913
Epoch 18/20
4160/4160 - loss: 0.2046 - acc: 0.9183 - val_loss: 0.0524 - val_acc: 0.9750
Epoch 19/20
4160/4160 - loss: 0.2055 - acc: 0.9120 - val_loss: 0.0440 - val_acc: 0.9885
Epoch 20/20
4160/4160 - loss: 0.1890 - acc: 0.9236 - val_loss: 0.0501 - val_acc: 0.9827
Here are my understandings:
The two losses (both loss and val_loss) are decreasing and the tow acc (acc and val_acc) are increasing. So this indicates the modeling is trained in a good way.
The val_acc is the measure of how good the predictions of your model are. So for my case, it looks like the model was trained pretty well after 6 epochs, and the rest training is not necessary.
My Questions are:
The acc (the acc on training set) is always smaller, actually much smaller, than val_acc. Is this normal? Why this happens?In my mind, acc should usually similar to better than val_acc.
After 20 epochs, the acc is still increasing. So should I use more epochs and stop when acc stops increasing? Or I should stop where val_acc stops increasing, regardless of the trends of acc?
Is there any other thoughts on my results?
Thanks!

Answering your questions:
As described on official keras FAQ
the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.
Training should be stopped when val_acc stops increasing, otherwise your model will probably overffit. You can use earlystopping callback to stop training.
Your model seems to achieve very good results. Keep up the good work.

What are loss and val_loss?
In deep learning, the loss is the value that a neural network is trying to minimize: it's the distance between the ground truth and the predictions. In order to minimize this distance, the neural network learns by adjusting weights and biases in a manner that reduces the loss.
For instance, in regression tasks, you have a continuous target, e.g., height. What you want to minimize is the difference between your predictions, and the actual height. You can use mean_absolute_error as loss so the neural network knows this is what it needs to minimize.
In classification, it's a little more complicated, but very similar. Predicted classes are based on probability. The loss is therefore also based on probability. In classification, the neural network minimizes the likelihood to assign a low probability to the actual class. The loss is typically categorical_crossentropy.
loss and val_loss differ because the former is applied to the train set, and the latter the test set. As such, the latter is a good indication of how the model performs on unseen data. You can get a validation set by using validation_data=[x_test, y_test] or validation_split=0.2.
It's best to rely on the val_loss to prevent overfitting. Overfitting is when the model fits the training data too closely, and the loss keeps decreasing while the val_loss is stale, or increases.
In Keras, you can use EarlyStopping to stop training when the val_loss stops decreasing. Read here.
Read more about deep learning losses here: Loss and Loss Functions for Training Deep Learning Neural Networks.
What are acc and val_acc?
Accuracy is a metric only for classification. It makes no sense on a task with a continuous target. It gives the percentage of instances that are correctly classified.
Once again, acc is on the training data, and val_acc is on the validation data. It's best to rely on val_acc for a fair representation of model performance because a good neural network will end up fitting the training data at 100%, but would perform poorly on unseen data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.