I am trying to implement a custom loss function in Keras.
To start it off, I wanted to be sure the previous loss function can be called from my custom function. And this is where the weird stuff begins:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
works as expected.
Now the implementation of "sparse_categorical_crossentropy" in keras.losses is as follows:
def sparse_categorical_crossentropy(y_true, y_pred):
return K.sparse_categorical_crossentropy(y_true, y_pred)
I concluded that passing K.sparse_categorical_crossentropy directly should also work. However, it throws expected activation_6 to have shape (4,) but got array with shape (1,).
Also, defining a custom loss function like this:
def custom_loss(y_true, y_pred):
return keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
does not work. During training is reduces the loss (which seems correct) but the accuracy does not improve (but it does, when using the non-custom loss function)
I am not sure what is happening, neither do I know how to debug it properly. Any help would be highly appreciated.
I tested what you are saying on my code and yes, you are right. I was initially getting the same error as you were getting, but once I changed the metrics parameter from accuracy to sparse_categorical_accuracy, I started getting higher accuracy.
Here, one important thing to note is when we tell keras to use accuracy as metrics, keras uses the default accuracy which is categorical_accuracy. So, if we want to implement our own custom loss function, then we have to set metrics parameter accordingly.
Read about available metrics function in keras from here.
Case 1:
def sparse_categorical_crossentropy(y_true, y_pred):
return K.sparse_categorical_crossentropy(y_true, y_pred)
model.compile(optimizer='adam',
loss=sparse_categorical_crossentropy,
metrics=['accuracy'])
output:
ValueError: Error when checking target: expected dense_71 to have
shape (10,) but got array with shape (1,)
Case 2:
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
output:
Epoch 1/2
60000/60000 [==============================] - 2s 38us/step - loss: 0.4714 - acc: 0.8668
Epoch 2/2
60000/60000 [==============================] - 1s 22us/step - loss: 0.2227 - acc: 0.9362
10000/10000 [==============================] - 1s 94us/step
Case 3:
def custom_sparse_categorical_crossentropy(y_true, y_pred):
return K.sparse_categorical_crossentropy(y_true, y_pred)
model.compile(optimizer='adam',
loss=custom_sparse_categorical_crossentropy,
metrics=['accuracy'])
output:
Epoch 1/2
60000/60000 [==============================] - 2s 41us/step - loss: 0.4558 - acc: 0.1042
Epoch 2/2
60000/60000 [==============================] - 1s 22us/step - loss: 0.2164 - acc: 0.0997
10000/10000 [==============================] - 1s 89us/step
Case 4:
def custom_sparse_categorical_crossentropy(y_true, y_pred):
return K.sparse_categorical_crossentropy(y_true, y_pred)
model.compile(optimizer='adam',
loss=custom_sparse_categorical_crossentropy,
metrics=['sparse_categorical_accuracy'])
output:
Epoch 1/2
60000/60000 [==============================] - 2s 40us/step - loss: 0.4736 - sparse_categorical_accuracy: 0.8673
Epoch 2/2
60000/60000 [==============================] - 1s 23us/step - loss: 0.2222 - sparse_categorical_accuracy: 0.9372
10000/10000 [==============================] - 1s 85us/step
Full Code:
from __future__ import absolute_import, division, print_function
import tensorflow as tf
import keras.backend as K
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(100, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.10),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
def custom_sparse_categorical_crossentropy(y_true, y_pred):
return K.sparse_categorical_crossentropy(y_true, y_pred)
#def sparse_categorical_accuracy(y_true, y_pred):
# # reshape in case it's in shape (num_samples, 1) instead of (num_samples,)
# if K.ndim(y_true) == K.ndim(y_pred):
# y_true = K.squeeze(y_true, -1)
# # convert dense predictions to labels
# y_pred_labels = K.argmax(y_pred, axis=-1)
# y_pred_labels = K.cast(y_pred_labels, K.floatx())
# return K.cast(K.equal(y_true, y_pred_labels), K.floatx())
model.compile(optimizer='adam',
loss=custom_sparse_categorical_crossentropy,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train, epochs=2, batch_size=200)
model.evaluate(x_test, y_test)
Check out the implementation of sparse_categorical_accuracy from here and sparse_categorical_crossentropy from here.
What happens is that when you use the accuracy metric, Kera actually selects a different accuracy implementation depending on the loss, as how the accuracy is computed depends on the labels and the predictions of the model:
for categorical_crossentropy it uses categorical_accuracy as accuracy metric.
for binary_crossentropy it uses binary_accuracy as accuracy metric.
for sparse_categorical_crossentropy it uses sparse_categorical_accuracy as accuracy metric.
Keras can only do this if you use the predefined losses, as it can't guess otherwise. For your custom loss you can directly use one of the three accuracy implementations directly, like metrics=['sparse_categorical_accuracy'].
Related
How is Accuracy defined when the loss function is mean square error? Is it mean absolute percentage error?
The model I use has output activation linear and is compiled with loss= mean_squared_error
model.add(Dense(1))
model.add(Activation('linear')) # number
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
and the output looks like this:
Epoch 99/100
1000/1000 [==============================] - 687s 687ms/step - loss: 0.0463 - acc: 0.9689 - val_loss: 3.7303 - val_acc: 0.3250
Epoch 100/100
1000/1000 [==============================] - 688s 688ms/step - loss: 0.0424 - acc: 0.9740 - val_loss: 3.4221 - val_acc: 0.3701
So what does e.g. val_acc: 0.3250 mean? Mean_squared_error should be a scalar not a percentage - shouldnt it? So is val_acc - mean squared error, or mean percentage error or another function?
From definition of MSE on wikipedia:https://en.wikipedia.org/wiki/Mean_squared_error
The MSE is a measure of the quality of an estimator—it is always
non-negative, and values closer to zero are better.
Does that mean a value of val_acc: 0.0 is better than val_acc: 0.325?
edit: more examples of the output of accuracy metric when I train - where the accuracy is increase as I train more. While the loss function - mse should decrease. Is Accuracy well defined for mse - and how is it defined in Keras?
lAllocator: After 14014 get requests, put_count=14032 evicted_count=1000 eviction_rate=0.0712657 and unsatisfied allocation rate=0.071714
1000/1000 [==============================] - 453s 453ms/step - loss: 17.4875 - acc: 0.1443 - val_loss: 98.0973 - val_acc: 0.0333
Epoch 2/100
1000/1000 [==============================] - 443s 443ms/step - loss: 6.6793 - acc: 0.1973 - val_loss: 11.9101 - val_acc: 0.1500
Epoch 3/100
1000/1000 [==============================] - 444s 444ms/step - loss: 6.3867 - acc: 0.1980 - val_loss: 6.8647 - val_acc: 0.1667
Epoch 4/100
1000/1000 [==============================] - 445s 445ms/step - loss: 5.4062 - acc: 0.2255 - val_loss: 5.6029 - val_acc: 0.1600
Epoch 5/100
783/1000 [======================>.......] - ETA: 1:36 - loss: 5.0148 - acc: 0.2306
There are at least two separate issues with your question.
The first one should be clear by now from the comments by Dr. Snoopy and the other answer: accuracy is meaningless in a regression problem, such as yours; see also the comment by patyork in this Keras thread. For good or bad, the fact is that Keras will not "protect" you or any other user from putting not-meaningful requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting.
Having clarified that, the other issue is:
Since Keras does indeed return an "accuracy", even in a regression setting, what exactly is it and how is it calculated?
To shed some light here, let's revert to a public dataset (since you do not provide any details about your data), namely the Boston house price dataset (saved locally as housing.csv), and run a simple experiment as follows:
import numpy as np
import pandas
import keras
from keras.models import Sequential
from keras.layers import Dense
# load dataset
dataframe = pandas.read_csv("housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model asking for accuracy, too:
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y,
batch_size=5,
epochs=100,
verbose=1)
As in your case, the model fitting history (not shown here) shows a decreasing loss, and an accuracy roughly increasing. Let's evaluate now the model performance in the same training set, using the appropriate Keras built-in function:
score = model.evaluate(X, Y, verbose=0)
score
# [16.863721372581754, 0.013833992168483997]
The exact contents of the score array depend on what exactly we have requested during model compilation; in our case here, the first element is the loss (MSE), and the second one is the "accuracy".
At this point, let us have a look at the definition of Keras binary_accuracy in the metrics.py file:
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
So, after Keras has generated the predictions y_pred, it first rounds them, and then checks to see how many of them are equal to the true labels y_true, before getting the mean.
Let's replicate this operation using plain Python & Numpy code in our case, where the true labels are Y:
y_pred = model.predict(X)
l = len(Y)
acc = sum([np.round(y_pred[i])==Y[i] for i in range(l)])/l
acc
# array([0.01383399])
Well, bingo! This is actually the same value returned by score[1] above...
To make a long story short: since you (erroneously) request metrics=['accuracy'] in your model compilation, Keras will do its best to satisfy you, and will return some "accuracy" indeed, calculated as shown above, despite this being completely meaningless in your setting.
There are quite a few settings where Keras, under the hood, performs rather meaningless operations without giving any hint or warning to the user; two of them I have happened to encounter are:
Giving meaningless results when, in a multi-class setting, one happens to request loss='binary_crossentropy' (instead of categorical_crossentropy) with metrics=['accuracy'] - see my answers in Keras binary_crossentropy vs categorical_crossentropy performance? and Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?
Disabling completely Dropout, in the extreme case when one requests a dropout rate of 1.0 - see my answer in Dropout behavior in Keras with rate=1 (dropping all input units) not as expected
The loss function (Mean Square Error in this case) is used to indicate how far your predictions deviate from the target values. In the training phase, the weights are updated based on this quantity. If you are dealing with a classification problem, it is quite common to define an additional metric called accuracy. It monitors in how many cases the correct class was predicted. This is expressed as a percentage value. Consequently, a value of 0.0 means no correct decision and 1.0 only correct decisons.
While your network is training, the loss is decreasing and usually the accuracy increases.
Note, that in contrast to loss, the accuracy is usally not used to update the parameters of your network. It helps to monitor the learning progress and the current performane of the network.
#desertnaut has said it very clearly.
Consider the following two pieces of code
compile code
binary_accuracy code
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
Your labels should be integer,Because keras does not round y_true, and you get high accuracy.......
I am doing a time series analysis using Tensorflow/ Keras in Python.
The overall LSTM model looks like,
model = keras.models.Sequential()
model.add(keras.layers.LSTM(25, input_shape = (1,1), activation = 'relu', dropout = 0.2, return_sequences = False))
model.add(keras.layers.Dense(1))
model.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['acc'])
tensorboard = keras.callbacks.TensorBoard(log_dir="logs/{}".format(time()))
es = keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', verbose=1, patience=50)
mc = keras.callbacks.ModelCheckpoint('/home/sukriti/best_model.h5', monitor='val_loss', mode='min', save_best_only=True)
history = model.fit(trainX_3d, trainY_1d, epochs=50, batch_size=10, verbose=2, validation_data = (testX_3d, testY_1d), callbacks=[mc, es, tensorboard])
I am having the following outcome,
Train on 14015 samples, validate on 3503 samples
Epoch 1/50
- 3s - loss: 0.0222 - acc: 7.1352e-05 - val_loss: 0.0064 - val_acc: 0.0000e+00
Epoch 2/50
- 2s - loss: 0.0120 - acc: 7.1352e-05 - val_loss: 0.0054 - val_acc: 0.0000e+00
Epoch 3/50
- 2s - loss: 0.0108 - acc: 7.1352e-05 - val_loss: 0.0047 - val_acc: 0.0000e+00
Now the val_acc remains unchanged. Is it normal?
what does it signify?
As signified by loss = 'mean_squared_error', you are in a regression setting, where accuracy is meaningless (it is meaningful only in classification problems).
Unfortunately, Keras will not "protect" you in such a case, insisting in computing and reporting back an "accuracy", despite the fact that it is meaningless and inappropriate for your problem - see my answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?
You should simply remove metrics=['acc'] from your model compilation, and don't bother - in regression settings, MSE itself can (and usually does) serve also as the performance metric.
In my case I had validation accuracy of 0.0000e+00 throughout training (using Keras and CNTK-GPU backend) when my batch size was 64 but there were only 120 samples in my validation set (divided into three classes). After I changed the batch size to 60, I got normal accuracy values.
It will not improve with changing batch size or with metrics. I had the same problem but when I shuffled my training and validation data set 0.0000e+00 gone.
I have a dataset that I used for making NN model in Keras, i took 2000 rows from that dataset to have them as validation data, those 2000 rows should be added in .predict function.
I wrote a code for Keras NN and for now it works good, but I noticed something that is very strange for me. It gives me very good accuracy of more than 83%, loss is around 0.12, but when I want to make a prediction with unseen data (those 2000 rows), it only predicts correct in average of 65%.
When I add Dropout layer, it only decreases accuracy.
Then I have added EarlyStopping, and it gave me accuracy around 86%, loss is around 0.10, but still when I make prediction with unseen data, I get final prediction accuracy of 67%.
Does this mean that model made correct prediction in 87% of situations? Im going with a logic, if I add 100 samples in my .predict function, that program should make good prediction for 87/100 samples, or somewhere in that range (lets say more than 80)? I have tried to add 100, 500, 1000, 1500 and 2000 samples in my .predict function, and it always make correct prediction in 65-68% of the samples.
Why is that, am I doing something wrong?
I have tried to play with number of layers, number of nodes, with different activation functions and with different optimizers but it only changes the results by 1-2%.
My dataset looks like this:
DataFrame shape (59249, 33)
x_train shape (47399, 32)
y_train shape (47399,)
x_test shape (11850, 32)
y_test shape (11850,)
testing_features shape (1000, 32)
This is my NN model:
model = Sequential()
model.add(Dense(64, input_dim = x_train.shape[1], activation = 'relu')) # input layer requires input_dim param
model.add(Dropout(0.2))
model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(1, activation='sigmoid')) # sigmoid instead of relu for final probability between 0 and 1
# compile the model, adam gradient descent (optimized)
model.compile(loss="binary_crossentropy", optimizer= "adam", metrics=['accuracy'])
# call the function to fit to the data training the network)
es = EarlyStopping(monitor='val_loss', min_delta=0.0, patience=1, verbose=0, mode='auto')
model.fit(x_train, y_train, epochs = 15, shuffle = True, batch_size=32, validation_data=(x_test, y_test), verbose=2, callbacks=[es])
scores = model.evaluate(x_test, y_test)
print(model.metrics_names[0], round(scores[0]*100,2), model.metrics_names[1], round(scores[1]*100,2))
These are the results:
Train on 47399 samples, validate on 11850 samples
Epoch 1/15
- 25s - loss: 0.3648 - acc: 0.8451 - val_loss: 0.2825 - val_acc: 0.8756
Epoch 2/15
- 9s - loss: 0.2949 - acc: 0.8689 - val_loss: 0.2566 - val_acc: 0.8797
Epoch 3/15
- 9s - loss: 0.2741 - acc: 0.8773 - val_loss: 0.2468 - val_acc: 0.8849
Epoch 4/15
- 9s - loss: 0.2626 - acc: 0.8816 - val_loss: 0.2416 - val_acc: 0.8845
Epoch 5/15
- 10s - loss: 0.2566 - acc: 0.8827 - val_loss: 0.2401 - val_acc: 0.8867
Epoch 6/15
- 8s - loss: 0.2503 - acc: 0.8858 - val_loss: 0.2364 - val_acc: 0.8893
Epoch 7/15
- 9s - loss: 0.2480 - acc: 0.8873 - val_loss: 0.2321 - val_acc: 0.8895
Epoch 8/15
- 9s - loss: 0.2450 - acc: 0.8886 - val_loss: 0.2357 - val_acc: 0.8888
11850/11850 [==============================] - 2s 173us/step
loss 23.57 acc 88.88
And this is for prediction:
#testing_features are 2000 rows that i extracted from dataset (these samples are not used in training, this is separate dataset thats imported)
prediction = model.predict(testing_features , batch_size=32)
res = []
for p in prediction:
res.append(p[0].round(0))
# Accuracy with sklearn - also much lower
acc_score = accuracy_score(testing_results, res)
print("Sklearn acc", acc_score)
result_df = pd.DataFrame({"label":testing_results,
"prediction":res})
result_df["prediction"] = result_df["prediction"].astype(int)
s = 0
for x,y in zip(result_df["label"], result_df["prediction"]):
if x == y:
s+=1
print(s,"/",len(result_df))
acc = s*100/len(result_df)
print('TOTAL ACC:', round(acc,2))
The problem is...now I get accuracy with sklearn 52% and my_acc 52%.
Why do I get such low accuracy on validation, when it says that its much larger?
The training data you posted gives high validation accuracy, so I'm a bit confused as to where you get that 65% from, but in general when your model performs much better on training data than on unseen data, that means you're over fitting. This is a big and recurring problem in machine learning, and there is no method guaranteed to prevent this, but there are a couple of things you can try:
regularizing the weights of your network, e.g. using l2 regularization
using stochastic regularization techniques such as drop-out during training
early stopping
reducing model complexity (but you say you've already tried this)
I will list the problems/recommendations that I see on your model.
What are you trying to predict? You are using sigmoid activation function in the last layer which seems it is a binary classification but in your loss fuction you used mse which seems strange. You can try binary_crossentropy instead of mse loss function for your model.
Your model seems suffer from overfitting so you can increase the prob. of Dropout and also add new Dropout between other hidden layers or you can remove one of the hidden layers because it seem your model is too complex.
You can change your neuron numbers in layers like a narrower => 64 -> 32 -> 16 -> 1 or try different NN architectures.
Try adam optimizer instead of sgd.
If you have 57849 sample you can use 47000 samples in training+validation and rest of will be your test set.
Don't use the same sets for your evaluation and validation. First split your data into train and test set. Then when you are fitting your model give validation_split_ratio then it will automatically give validation set from your training set.
I am working my way through an ML example in Google Colabs. The documentation says that when I run model.fit, the loss and accuracy metrics are displayed. I am not seeing any loss or accuracy metric.
I have added accuracy as a metric in model.compile
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Here is a screenshot of what I am seeing.
How do I get the loss and accuracy metrics to be displayed when I am fitting the model?
You can use the verbose flag and set it to 2 to display 1 line per epoch or 1 for a progress bar.
import keras
import numpy as np
model = keras.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5, 6)))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy')
x_data = np.random.random((32, 5, 6))
y_data = np.random.randint(0, 9, size=(32,5,1))
model.fit(x=x_data, y=y_data, batch_size=16, epochs=3)
Use tf.cast instead.
Epoch 1/3
32/32 [==============================] - 1s 20ms/step - loss: 9.9664
Epoch 2/3
32/32 [==============================] - 0s 293us/step - loss: 9.9537
Epoch 3/3
32/32 [==============================] - 0s 164us/step - loss: 9.9425
I hope it solves your problem.
I'm programming a model in tf.keras, and running model.evaluate() on the training set usually yields ~96% accuracy. My evaluation on the test set is usually close, about 93%. However, when I predict manually, the model is usually inaccurate. This is my code:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
!git clone https://github.com/DanorRon/data
%cd data
!ls
batch_size = 100
epochs = 15
alpha = 0.001
lambda_ = 0.001
h1 = 50
train = pd.read_csv('/content/data/mnist_train.csv.zip')
test = pd.read_csv('/content/data/mnist_test.csv.zip')
train = train.loc['1':'5000', :]
test = test.loc['1':'2000', :]
train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)
x_train = train.loc[:, '1x1':'28x28']
y_train = train.loc[:, 'label']
x_test = test.loc[:, '1x1':'28x28']
y_test = test.loc[:, 'label']
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values
nb_classes = 10
targets = y_train.reshape(-1)
y_train_onehot = np.eye(nb_classes)[targets]
nb_classes = 10
targets = y_test.reshape(-1)
y_test_onehot = np.eye(nb_classes)[targets]
model = tf.keras.Sequential()
model.add(layers.Dense(784, input_shape=(784,), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(h1, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(10, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.compile(optimizer='SGD',
loss = 'mse',
metrics = ['categorical_accuracy'])
model.fit(x_train, y_train_onehot, epochs=epochs, batch_size=batch_size)
model.evaluate(x_test, y_test_onehot, batch_size=batch_size)
prediction = model.predict_classes(x_test)
print(prediction)
print(y_test[1:])
I've heard that a lot of the time when people have this problem, it's just a problem with data input. But I can't see any problem with that here since it almost always predicts wrongly (about as much as you would expect if it was random). How do I fix this problem?
Edit: Here are the specific results:
Last training step:
Epoch 15/15
49999/49999 [==============================] - 3s 70us/sample - loss: 0.0309 - categorical_accuracy: 0.9615
Evaluation output:
2000/2000 [==============================] - 0s 54us/sample - loss: 0.0352 - categorical_accuracy: 0.9310
[0.03524150168523192, 0.931]
Output from model.predict_classes:
[9 9 0 ... 5 0 5]
Output from print(y_test):
[9 0 0 7 6 8 5 1 3 2 4 1 4 5 8 4 9 2 4]
First thing is, your loss function is wrong: you are in a multi-class classification setting, and you are using a loss function suitable for regression and not classification (MSE).
Change our model compilation to:
model.compile(loss='categorical_crossentropy',
optimizer='SGD',
metrics=['accuracy'])
See the Keras MNIST MLP example for corroboration, and own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details (although here you actually have the inverse problem, i.e. regression loss in a classification setting).
Moreover, it is not clear if the MNIST variant you are using is already normalized; if not, you should normalize them yourself:
x_train = x_train.values/255
x_test = x_test.values/255
It is also not clear why you ask for a 784-unit layer, since this is actually the second layer of your NN (the first is implicitly set by the input_shape argument - see Keras Sequential model input layer), and it certainly does not need to contain one unit for each one of your 784 input features.
UPDATE (after comments):
But why is MSE meaningless for classification?
This is a theoretical issue, not exactly appropriate for SO; roughly speaking, it is for the same reason we don't use linear regression for classification - we use logistic regression, the actual difference between the two approaches being exactly the loss function. Andrew Ng, in his popular Machine Learning course at Coursera, explains this nicely - see his Lecture 6.1 - Logistic Regression | Classification at Youtube (explanation starts at ~ 3:00), as well as section 4.2 Why Not Linear Regression [for classification]? of the (highly recommended and freely available) textbook An Introduction to Statistical Learning by Hastie, Tibshirani and coworkers.
And MSE does give a high accuracy, so why doesn't that matter?
Nowadays, almost anything you throw at MNIST will "work", which of course neither makes it correct nor a good approach for more demanding datasets...
UPDATE 2:
whenever I run with crossentropy, the accuracy just flutters around at ~10%
Sorry, cannot reproduce the behavior... Taking the Keras MNIST MLP example with a simplified version of your model, i.e.:
model = Sequential()
model.add(Dense(784, activation='linear', input_shape=(784,)))
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=SGD(),
metrics=['accuracy'])
we easily end up with a ~ 92% validation accuracy after only 5 epochs:
history = model.fit(x_train, y_train,
batch_size=128,
epochs=5,
verbose=1,
validation_data=(x_test, y_test))
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 4s - loss: 0.8974 - acc: 0.7801 - val_loss: 0.4650 - val_acc: 0.8823
Epoch 2/10
60000/60000 [==============================] - 4s - loss: 0.4236 - acc: 0.8868 - val_loss: 0.3582 - val_acc: 0.9034
Epoch 3/10
60000/60000 [==============================] - 4s - loss: 0.3572 - acc: 0.9009 - val_loss: 0.3228 - val_acc: 0.9099
Epoch 4/10
60000/60000 [==============================] - 4s - loss: 0.3263 - acc: 0.9082 - val_loss: 0.3024 - val_acc: 0.9156
Epoch 5/10
60000/60000 [==============================] - 4s - loss: 0.3061 - acc: 0.9132 - val_loss: 0.2845 - val_acc: 0.9196
Notice the activation='linear' of the first Dense layer, which is the equivalent of not specifying anything, like in your case (as I said, practically everything you throw to MNIST will "work")...
Final advice: Try modifying your model as:
model = tf.keras.Sequential()
model.add(layers.Dense(784, activation = 'relu',input_shape=(784,)))
model.add(layers.Dense(h1, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
in order to use the better (and default) 'glorot_uniform' initializer, and remove the kernel_regularizer args (they may be the cause of any issue - always start simple!)...