Deep learning model predicts mostly around the mean of the actuals?

Deep learning model predicts mostly around the mean of the actuals? - python

I have a 1DCNN model that seems to only predict close to the mean of the actual values in my test dataset. Is this a poor model based on the fact the distribution of the actuals vs predictions are radically different?
Actual vs Predicted density plot
My questions are:
What could this graph indicate about my predictions? Is it normal for a model to simply predict the mean? Is that what will happen if it can't learn much about the dataset?
Doesn't it seem that MAE is not a good metric? Is it misleading and should I use a different one?
I am trying to improve the model by decreasing MAE, but as it decreases, it simply moves toward the mean value of the actual data and further from the spread of the real distribution. You can see the SD of my predictions is around 9 and the SD of the actual data is about 22. The users want the results in the actual units, which is why I am supplying MAE. Plus I have other baselines to compare to with MAE. I feel it is a very misleading metric.
I have about 30 weather and soil features, all continuous and scaled. 6 years of daily weather data at several thousand weather locations. At each location I have a single target value per year. The 1DCNN architecture is shown below. I split my data with the first 5 years in training and the last year is test. The data spans 3 US states and there are about 9 physical districts per state. I tried building a model per state (just 3 models) but my performance is poor for each. If I build it down to the district level, I can get acceptable results. I don't expect great results, but I'm really just trying to figure out why it's circling the mean.
My model looks like this:
model = Sequential()
model.add(Conv1D(filters=13, kernel_size=3, activation='relu', input_shape=input_shape))
model.add(Conv1D(filters=13, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))
opt = Adam(lr=.0001)
model.compile(loss='mean_squared_error', optimizer=opt)
I'm training the model on different size datasets to capture results throughout the year as more weather data is added, how ever the predictions are similar for each model.
multiple models showing same situation

In the question, you said you are using MAE(Mean Absolute Error), but in code, you have used the loss as mean_squared_error.
Any reason for that?
Some debugging ideas.
Check the distribution of your training and validation split dependent.
Bucket the values into classes and try to build a classifier, and see if you still face the same issue, this could help you pinpoint which part of the distribution is not being captured properly.
If you are using a structured dataset, try Tree-based algorithms for example XGBoost or Random Forest, for something to see if the algorithm might be the problem.

Related

Keras neural network predicts the same number for all inputs

I am trying to create a keras neural network to predict distance on roads between two points in city. I am using Google Maps to get travel distance and then train neural network to do that.
import pandas as pd
arr=[]
for i in range(0,100):
arr.append(generateTwoPoints(55.901819,37.344735,55.589537,37.832254))
df=pd.DataFrame(arr,columns=['p1Lat','p1Lon','p2Lat','p2Lon', 'distnaceInMeters', 'timeInSeconds'])
print(df)
Neural network architecture:
from keras.optimizers import SGD
sgd = SGD(lr=0.00000001)
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(100, input_dim=4 , activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mse'])
Then i divide sets to test/train
Xtrain=train[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytrain=train[['distnaceInMeters']]/100000
Xtest=test[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytest=test[['distnaceInMeters']]/100000
Then i fit data into the model, but loss stays the same:
history = model.fit(Xtrain, Ytrain,
batch_size=1,
epochs=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(Xtest, Ytest))
I later print the data:
prediction = model.predict(Xtest)
print(prediction)
print (Ytest)
But result is the same for all the inputs:
[[0.26150784]
[0.26171574]
[0.2617755 ]
[0.2615582 ]
[0.26173398]
[0.26166356]
[0.26185763]
[0.26188275]
[0.2614446 ]
[0.2616575 ]
[0.26175532]
[0.2615183 ]
[0.2618127 ]]
distnaceInMeters
2 0.13595
6 0.27998
7 0.48849
16 0.36553
21 0.37910
22 0.40176
33 0.09173
39 0.24542
53 0.04216
55 0.38212
62 0.39972
64 0.29153
87 0.08788
I can not find the problem. What is it? I am new to machine learning.

You are doing a very elementary mistake: since you are in a regression setting, you should not use a sigmoid activation for your final layer (this is used for binary classification cases); change your last layer to
model.add(Dense(1,activation='linear'))
or even
model.add(Dense(1))
since, according to the docs, if you do not specify the activation argument it defaults to linear.
Various other advice offered already in the other answer and the comments may be useful (lower LR, more layers, other optimizers e.g. Adam), and you certainly need to increase your batch size; but nothing will work with the sigmoid activation function you currently use for your last layer.
Irrelevant to the issue, but in regression settings you don't need to repeat your loss function as a metric; this
model.compile(loss='mse', optimizer='sgd')
will suffice.

It would be very useful if you could post the the progression of the loss and MSE (of both the training and validation/test set) as it goes throughout the training. Even better, it would be best if you can visualize it as per https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/ and post the vizualization here.
In the meantime, based on the facts:
1) You say the loss isn't decreasing (I'm assuming on the training set, during training, based on your compile args).
2) You say that the prediction "accuracy" on your test set is bad.
3) My experience/intuition (not an empirical assessment) tells me that your two layer dense model is a little too small to be able to capture the complexity inherent in your data. AKA your model is suffering from too high a Bias https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
The fastest and easiest thing you can try, is to try to add both more layers and more nodes to each layer.
However, I should note that there is a lot of causal information that can affect the driving distance and driving time beyond just the the distance between two coordinates, which might be the feature that your Neural network will most readily extract. For example, whether you drive on a highway or sides treets, traffic lights, whehter the roads twist and turn or go straight... to infer all of that just from that the data you will need enormous amounts of data(examples) in my opinion. If you could add input columns with e.g. disatance to nearest higway from both points, you might be able to train with less data
I would also reccomend that you souble check that you are feeding as input what you think you are feeding (and its shape), and also, you should use some standardization from function sklearn which might help the model learn faster and converge faster to a higher "accuracy".
If and when you post either more code or the training history I can help you more (and also how many training samples).
EDIT 1: Try changing batch size to a larger number preferably batch_size=32 if it fits in your memory. you can use a small batch size (such as 1) when working with an "info rich" input like an image, but when using a very "info poor" datum like 4 floats (2 coordinates), the gradient will point each batch (with batch_size=1) to a practically random (pseudo...) direction and not neccessarily get any closer to a local minimum. Only when taking the gradient on the collective loss of a larger batch (like 32, and perhaps more) will you get a gradient that points at least approximately in the direction of the local minimum and converge to a better result. Also, I suggest that you don't mess with the learning rate manually and perhaps change to an optimizer like "adam" or "RMSProp".
Edit 2: #Desertnaut made an excellent point that I totally missed, a correction without which, your code will not work properly. He deserves the credit so I will not include it here. Please refer to his answer. Also, don't forget to raise your batch size, and not "manually mess" with your learning rate, "adam" for example, will do it for you.

Predicting the percentage accuracy based on limited features

A practice problem based on whether or not and with what accuracy/probability an uber ride gets completed after being ordered has the following features:
Available Drivers int64
Placed Time float64
Response Distance float64
Car Type int32
Day Of Week int64
Response Delay float64
Order Completion int32 [target]
My approach has been to use tf.Keras Sequential to predict the target. Here's what it looks like:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=input_shape),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
binary_crossentropy_loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=adam_optimizer,
loss=binary_crossentropy_loss,
metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=ES_PATIENCE)
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS, verbose=2,
callbacks=[early_stop])
I normalize the data like this (note that train_data is a dataframe):
train_data = tf.keras.utils.normalize(train_data)
And then for predicting,
predictions = model.predict_proba(prediction_dataset, batch_size=None)
Training results:
loss: 0.3506 - accuracy: 0.8817 - val_loss: 0.3493 - val_accuracy: 0.8773
But this still gives me a poor quality probability for the corresponding occurrence. Is this the wrong approach ?
What approach would you suggest for a problem like this and am I doing it completely wrong ? Are Neural Networks a bad idea for this solution? Thanks a lot!

As you framed the problem, this is a classic machine learning classification problem.
Given N features(independent variables) you have to predict 1(one) dependent variable.
The way in which you constructed the neural network is good.
Since you have a binary classification problem, the sigmoid activation is the correct one.
With respect to the complexity of your model (number of layers, number of neurons per layer) it depends very much on your dataset.
If you have a comprehensive dataset with a lot of features and a lot of examples(an example is a row in dataframe with X1,X2,X3... Y), where X are the features and Y the dependent variable, your model can vary in complexity.
If you have a small dataset with a few features, a small model is recommended. Always begin with a small model.
If you run into the issue of underfitting (poor accuracy on the training set and also on the validation and test set), you can gradually increase the complexity of the model (add more layers, add more neurons per layer).
If you run into the issue of overfitting, implementing regularisation techniques may help (Dropout, L1/L2 Regularisation, Noise Addition, Data Augmentation).
What you have to take into consideration is that, if you have a small dataset, then a classical machine learning algorithm could outperform the deep learning model. This happens because neural networks are very 'hungry' ---> as compared to machine learning models, they require much more data in order to properly work. You could choose SVM/Kernel SVM/Random Forest/ XGBoost and other similar algorithms.
EDIT!
Whether or not and with what accuracy/probability automatically splits the problem into two parts, not only a simple classification one.
What I would personally do is the following: Since the probabilities occur between 0% and 100%, if you had probability as a feature in your X columns (which you don't), then, according to the number of data points(rows) you have you could do the following: I would assign a label to each probability section: 1 to (0%,25%), 2 to (25%, 50%), 3 to (50%,75%), 4 to (75%, 100%). But that depends exclusively on the prior probability information(if you had the probability as a feature). Then if you inferred and you get label 3, you would know the probability of the ride being completed.
Otherwise, you cannot frame your current problem as both a classification and a probablity one.
I hope that I have given you an introductory insight. Happy coding.

If you are doing classification, you may want to look into ensemble methods (forests, boosts, etc.)
If you are calculating probability, you may want to look into probabilistic graphical models (Bayesian networks, etc.)

I train a full CNN on 3-D data and I get NaN in the training loss

I am trying to train a full CNN on 3-D data(I use conv3D). First, some context to understand. the input is a 3-D matrix that represents the density map of a protein, and the output is a 3-D matrix where the locations of C-alpha is labeled as 1 and the rest is labeled as 0. As expected, this leads to a massive data imbalance. So I implemented a custom cross-entropy cost function that focuses the model on class=1 as shown below:
custom cost function
these maps tend to be large, so the training time tends to increase exponentially with the slight increase in map dimension, or if I make my network a little deeper. In addition to that, a large part of the map is empty space, but I have to keep it to maintain real distance between different locations of C-alpha. To work around this problem, I split each map into smaller boxes of dimension (5,5,5). the benefit of this approach is I get to ignore the empty space which significantly reduces the amount of memory and computation needed for the training.
The problem that I have now is that, I get NaN in the training loss and the training is terminated as shown below:
Network training behavior
this is the network I am using:
model = Sequential()
model.add(Conv3D(15, kernel_size=(3,3,3), activation='relu', input_shape=(5,5,5,1), padding='same'))
model.add(Conv3D(30, kernel_size=(3,3,3), activation='relu', padding='same'))
model.add(Conv3D(60, kernel_size=(3,3,3), activation='relu', padding='same'))
model.add(Conv3D(30, kernel_size=(3,3,3), activation='relu', padding='same'))
model.add(Conv3D(15, kernel_size=(3,3,3), activation='relu', padding='same'))
model.add(Conv3D(1, kernel_size=(3,3,3), activation='sigmoid', padding='same'))
model.compile(loss=weighted_cross_entropp_loss, optimizer='nadam',metrics=['accuracy'])
############# model training ######################################
model.fit(x_train, y_train, batch_size=32, epochs=epochs, verbose=1,validation_split=0.2,shuffle=True,callbacks=[stop_immediately,save_best_model,stop_here_please])
model.save('my_map_model_weighted_custom_box_5.h5')
can anyone please help me, I have been working on this problem for many many weeks
Regards

So I asked this question some time ago and since then I have been trying to find a solution to it. I think I managed to find it, so I thought I should share it. But before I do that I want to say thing that I read and I think it points to the core of my problem. the saying goes like this " if hyper parameter works for someone else, it does not necessarily work for you. If you are trying to solve a new problem using a new architecture, then you need new hyper parameters", or something along those lines.
Sadly, in Keras (which is a good tool by the way if used within the expected domain) it is stated that the optimizer parameters should be kept as they are, which is wrong, now to the solution.
the short answer:
the learning rate was too high.
the long answer:
here I detail how to figure that out. If you get a NaN within the first 100 iterations that is a straight indication that the problem is high learning rate. However, if you get it after the first 100 iterations, then the problem can be one of two things:
if you are using RCNN then you have to use gradient.clip
if you have a custom layer (I had a custom cost function) then it is likely the reason. Now, an important thing to keep in mind is, there is a difference between correct implementation and stable implementation. you can implement your custom layer correctly, However, there are some tricks that you need to add to make it stable which is what was missing from my custom layer.

How to get the prediction of new data by LSTM in python

This is a univariate time series prediction problem. As the following code shows, I divide the initial data into a train dataset (trainX) and a test dataset(testX), then I create a LSTM network by keras. Next, I train the model by the train dataset. However, when I want to get the prediction, I need to know the test value, so my problem is: why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time? If I have some misunderstandings about LSTM network, please tell me.
Thank you!
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

Since we don't have the future value with us while training the model, we just divide the data into train and test sets. Then we just imagine that test sets are the future values. We train our model using train set (and also usually a validation set). And after our model is trained, we test it using the test set to check our models performance.

why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time?
In ML, we give test data X and it returns us Y. In the case of time-series, it may mislead a beginner a bit as we use the X and output is apparently X as well: The difference here is that we are inputting old values of time-series as X and the output Y is value of same time-series but we are predicting in future (can be applied for present or even past as well) as you have identified it correctly.
(P.S: I would recommend you to begin with simple regression and then come to LSTMs etc. if all you want is to learn the Machine Learning.)

I think the correct term in this context is 'Forecasting'.
A good explanation is: after you train and test your model, with the data that you already had (as the other ones said here before me), you want to predict future data, which is, I think, the trully interresting thing about recurrent networks.
So in order to make this, you need to start predicting the values from one day after your final date in your original dataset, using the model (which is trained with this past data). Once you predict this value, you do the same thing, but considering the last values predict, and so on.
The fact that you are using a prediction to make others predictions, implies that is much more difficult to get good results, so is common to try to predict short ranges of time.
The exact code that you need to perform to do this could vary, but I think that is the prime concept
In the link below, in the last part, in which is perform a forecast, the author show us a code and a explanation on how he did it.
https://towardsdatascience.com/time-series-forecasting-with-recurrent-neural-networks-74674e289816
I guess that's it.

Neural net with duplicated inputs - Keras

I have a dataset of N videos each video is characterized by some metrics (that will be inputs for a neural net) my goal is to predict the score that a person will give when he or she watches the video.
The problem is that in my dataset each video was watched more than once by different subjects, so I was forced to duplicate the same metrics (inputs) the number of time the video was watched to keep all the scores given by the subjects.
I built an MLP model to predictet the scores. But when I calculate the RMSE it's always higher than 0.7.
I want to know if having a dataset like that would affect the performance of my model ? And how can I deal with it ?
Here is how the dataset looks like:
The first 5 columns are the inputs and the last one is the score of subjects. Note that all of them are normalized.
Here is my Model:
def mlp_model():
# create model
model = Sequential()
model.add(Dense(100,input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
seed = 100
numpy.random.seed(seed)
myModel = mlp_model()
myModel.fit(x=x_train, y=y_train, batch_size=10, epochs=45, validation_split=0.3, shuffle=True,callbacks=[plot_losses])
predictions = myModel.predict(x_test)
print predictions

Your problem statement reveals an inherent flaw in the design. As you correctly pointed out, you have no way of knowing what the user does, how she has rated other videos, and how she will rate the current video.
It would be helpful to explain what your current input values are, and whether they could differ at all. For example, a metric like "time spent watching the video" might be different for different users.
On a larger scale, try to answer the question whethre you could answer the rating (with a completely deterministic judgement), i.e. would it be possible for you to come up with the same answer (given the same input), and constantly get the same result?
Since that is currently not the case, I would say that you should investigate more time in finding a suitable approach to your problem, like for example recommender systems, but that also requires you to use a lot of different input information.
Alternatively, you could try to find more input data, which specifically identifies the users, and allows you to make more suitable predictions; even then, it will be hard to base a reasonable prediction on such proxy metrics, since you might end up creating an unwanted bias in your preprocessing.
In any case, getting much better results with the current format of the input is very unlikely.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.