Keras neural network predicts the same number for all inputs

Keras neural network predicts the same number for all inputs - python

I am trying to create a keras neural network to predict distance on roads between two points in city. I am using Google Maps to get travel distance and then train neural network to do that.
import pandas as pd
arr=[]
for i in range(0,100):
arr.append(generateTwoPoints(55.901819,37.344735,55.589537,37.832254))
df=pd.DataFrame(arr,columns=['p1Lat','p1Lon','p2Lat','p2Lon', 'distnaceInMeters', 'timeInSeconds'])
print(df)
Neural network architecture:
from keras.optimizers import SGD
sgd = SGD(lr=0.00000001)
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(100, input_dim=4 , activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mse'])
Then i divide sets to test/train
Xtrain=train[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytrain=train[['distnaceInMeters']]/100000
Xtest=test[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytest=test[['distnaceInMeters']]/100000
Then i fit data into the model, but loss stays the same:
history = model.fit(Xtrain, Ytrain,
batch_size=1,
epochs=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(Xtest, Ytest))
I later print the data:
prediction = model.predict(Xtest)
print(prediction)
print (Ytest)
But result is the same for all the inputs:
[[0.26150784]
[0.26171574]
[0.2617755 ]
[0.2615582 ]
[0.26173398]
[0.26166356]
[0.26185763]
[0.26188275]
[0.2614446 ]
[0.2616575 ]
[0.26175532]
[0.2615183 ]
[0.2618127 ]]
distnaceInMeters
2 0.13595
6 0.27998
7 0.48849
16 0.36553
21 0.37910
22 0.40176
33 0.09173
39 0.24542
53 0.04216
55 0.38212
62 0.39972
64 0.29153
87 0.08788
I can not find the problem. What is it? I am new to machine learning.

You are doing a very elementary mistake: since you are in a regression setting, you should not use a sigmoid activation for your final layer (this is used for binary classification cases); change your last layer to
model.add(Dense(1,activation='linear'))
or even
model.add(Dense(1))
since, according to the docs, if you do not specify the activation argument it defaults to linear.
Various other advice offered already in the other answer and the comments may be useful (lower LR, more layers, other optimizers e.g. Adam), and you certainly need to increase your batch size; but nothing will work with the sigmoid activation function you currently use for your last layer.
Irrelevant to the issue, but in regression settings you don't need to repeat your loss function as a metric; this
model.compile(loss='mse', optimizer='sgd')
will suffice.

It would be very useful if you could post the the progression of the loss and MSE (of both the training and validation/test set) as it goes throughout the training. Even better, it would be best if you can visualize it as per https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/ and post the vizualization here.
In the meantime, based on the facts:
1) You say the loss isn't decreasing (I'm assuming on the training set, during training, based on your compile args).
2) You say that the prediction "accuracy" on your test set is bad.
3) My experience/intuition (not an empirical assessment) tells me that your two layer dense model is a little too small to be able to capture the complexity inherent in your data. AKA your model is suffering from too high a Bias https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
The fastest and easiest thing you can try, is to try to add both more layers and more nodes to each layer.
However, I should note that there is a lot of causal information that can affect the driving distance and driving time beyond just the the distance between two coordinates, which might be the feature that your Neural network will most readily extract. For example, whether you drive on a highway or sides treets, traffic lights, whehter the roads twist and turn or go straight... to infer all of that just from that the data you will need enormous amounts of data(examples) in my opinion. If you could add input columns with e.g. disatance to nearest higway from both points, you might be able to train with less data
I would also reccomend that you souble check that you are feeding as input what you think you are feeding (and its shape), and also, you should use some standardization from function sklearn which might help the model learn faster and converge faster to a higher "accuracy".
If and when you post either more code or the training history I can help you more (and also how many training samples).
EDIT 1: Try changing batch size to a larger number preferably batch_size=32 if it fits in your memory. you can use a small batch size (such as 1) when working with an "info rich" input like an image, but when using a very "info poor" datum like 4 floats (2 coordinates), the gradient will point each batch (with batch_size=1) to a practically random (pseudo...) direction and not neccessarily get any closer to a local minimum. Only when taking the gradient on the collective loss of a larger batch (like 32, and perhaps more) will you get a gradient that points at least approximately in the direction of the local minimum and converge to a better result. Also, I suggest that you don't mess with the learning rate manually and perhaps change to an optimizer like "adam" or "RMSProp".
Edit 2: #Desertnaut made an excellent point that I totally missed, a correction without which, your code will not work properly. He deserves the credit so I will not include it here. Please refer to his answer. Also, don't forget to raise your batch size, and not "manually mess" with your learning rate, "adam" for example, will do it for you.

Related

Need help defining a simple neural network

I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!

Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.

Predicting the percentage accuracy based on limited features

A practice problem based on whether or not and with what accuracy/probability an uber ride gets completed after being ordered has the following features:
Available Drivers int64
Placed Time float64
Response Distance float64
Car Type int32
Day Of Week int64
Response Delay float64
Order Completion int32 [target]
My approach has been to use tf.Keras Sequential to predict the target. Here's what it looks like:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=input_shape),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
binary_crossentropy_loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=adam_optimizer,
loss=binary_crossentropy_loss,
metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=ES_PATIENCE)
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS, verbose=2,
callbacks=[early_stop])
I normalize the data like this (note that train_data is a dataframe):
train_data = tf.keras.utils.normalize(train_data)
And then for predicting,
predictions = model.predict_proba(prediction_dataset, batch_size=None)
Training results:
loss: 0.3506 - accuracy: 0.8817 - val_loss: 0.3493 - val_accuracy: 0.8773
But this still gives me a poor quality probability for the corresponding occurrence. Is this the wrong approach ?
What approach would you suggest for a problem like this and am I doing it completely wrong ? Are Neural Networks a bad idea for this solution? Thanks a lot!

As you framed the problem, this is a classic machine learning classification problem.
Given N features(independent variables) you have to predict 1(one) dependent variable.
The way in which you constructed the neural network is good.
Since you have a binary classification problem, the sigmoid activation is the correct one.
With respect to the complexity of your model (number of layers, number of neurons per layer) it depends very much on your dataset.
If you have a comprehensive dataset with a lot of features and a lot of examples(an example is a row in dataframe with X1,X2,X3... Y), where X are the features and Y the dependent variable, your model can vary in complexity.
If you have a small dataset with a few features, a small model is recommended. Always begin with a small model.
If you run into the issue of underfitting (poor accuracy on the training set and also on the validation and test set), you can gradually increase the complexity of the model (add more layers, add more neurons per layer).
If you run into the issue of overfitting, implementing regularisation techniques may help (Dropout, L1/L2 Regularisation, Noise Addition, Data Augmentation).
What you have to take into consideration is that, if you have a small dataset, then a classical machine learning algorithm could outperform the deep learning model. This happens because neural networks are very 'hungry' ---> as compared to machine learning models, they require much more data in order to properly work. You could choose SVM/Kernel SVM/Random Forest/ XGBoost and other similar algorithms.
EDIT!
Whether or not and with what accuracy/probability automatically splits the problem into two parts, not only a simple classification one.
What I would personally do is the following: Since the probabilities occur between 0% and 100%, if you had probability as a feature in your X columns (which you don't), then, according to the number of data points(rows) you have you could do the following: I would assign a label to each probability section: 1 to (0%,25%), 2 to (25%, 50%), 3 to (50%,75%), 4 to (75%, 100%). But that depends exclusively on the prior probability information(if you had the probability as a feature). Then if you inferred and you get label 3, you would know the probability of the ride being completed.
Otherwise, you cannot frame your current problem as both a classification and a probablity one.
I hope that I have given you an introductory insight. Happy coding.

If you are doing classification, you may want to look into ensemble methods (forests, boosts, etc.)
If you are calculating probability, you may want to look into probabilistic graphical models (Bayesian networks, etc.)

Accuracy Decreasing with higher epochs

I am a newbie to Keras and machine learning in general. I'm trying to build a binary classification model using the Sequential model. After some experimenting, I saw that on multiple runs(not always) I was getting a an accuracy of even 97% on my validation data in the second or third epoch itself but this dramatically decreased to as much as 12%. What is the reason behind this ? How do I fine tune my model ?
Here's my code -
model = Sequential()
model.add(Flatten(input_shape=(6,size)))
model.add(Dense(6,activation='relu'))
model.add(Dropout(0.35))
model.add(Dense(3,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['binary_accuracy'])
model.fit(x, y,epochs=60,batch_size=40,validation_split=0.2)

According to me, you can take into consideration the following factors.
Reduce your learning rate to a very small number like 0.001 or even 0.0001.
Provide more data.
Set Dropout rates to a number like 0.2. Keep them uniform across the network.
Try decreasing the batch size.
Using appropriate optimizer: You may need to experiment a bit on this. Use different optimizers on the same network, and select an optimizer which gives you the least loss.
If any of the above factors work for you, please let me know about it in the comments section.

When I have this issue, I solved by changing the optimizer from RMSprop to adam. I also reduced the learning rate and added dropout after each fully connected layers. If your FC layers have small number of neurons then adding dropout does not make a major difference.

Reduce the learning rate (like 0.001 or 0.0001) and add batch normalization after every convolution layer.

Validation and training accuracy high in the first epoch [Keras]

I am training an image classifier with 2 classes and 53k images, and validating it with 1.3k images using keras. Here is the structure of the neural network :
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])
Training accuracy increases from ~50% to ~85% in the first epoch, with 85% validation accuracy. Subsequent epochs increase the training accuracy consistently, however, validation accuracy stays in the 80-90% region.
I'm curious, is it possible to get high validation and training accuracy in the first epoch? If my understanding is correct, it starts small and increases steadily with each passing epoch.
Thanks
EDIT : The image size is 150x150 after rescaling and the mini-batch size is 16.

Yes, it is entirely possible to get high accuracy on first epoch and then only modest improvements.
If there is enough redundancy in the data and you make enough updates (wrt. the complexity of your model, which seems fairly easy to optimize) in the first epoch (i.e. you use small minibatches), it's entirely possible that you learn most of the important stuff during the first epoch. When you show the data again, the model will start overfitting to pecularities introduced by the specific images in your train set (thus you get increasing training accuracy), but since you do not provide any novel samples, it will not learn anything new about the underlying properties of your classes.
You can think of your training data as an infinite stream (which actually SGD would like to enjoy all the convergence theorems). Do you think that you need more than 50k samples to learn what is important? You can actually test the data-hunger of your model by providing less data or reporting performance after a some sub-epoch number of updates.

You cannot expect to get an accuracy over 90-95% with image classification using feed forward neural networks.
You need to use another architecture called Convolution Neural network, state of the art in image recognition.
Also it is very easy to build that using keras, but it is computationally more intensive than this.
If you want to stick with feed forward layers the best thing you can do is early stopping, but even that wouldn't give you accuracy over 90%.

Yeah, epoch are supposed to fit the data on the model.
Try using 2 neurons at the end and one hot encode on ur Class label!
Like I have seen one case where I got better results doing that, instead of binary output.

lstm prediction result delay phenomenon

Recently I'm using lstm to predict time series. I'm using keras 2.0 to construct my lstm model. It has a structure like this:
model = Sequential()
model.add(LSTM(128, input_shape=(timesteps, 1), return_sequences=False, stateful=False)
model.add(Dropout(rate=0.1))
model.add(Dense(1))
I have tried to use this network to predict several time series including sin(t) and a real traffic flow dataset. I found that the prediction for sin is fine while the prediction for real dataset is just like shifting the last input value by one step. I don't know whether it's a prediction error or the network doesn't learn the pattern of the dataset at all. Does anyone get similar results? Are there any solutions to this annoying shift? Thanks a lot.
Here are some of my predictions:
3 frequencies sin prediction result
real traffic dataset prediction result

This is simply the starting point for your network and you'll have to work through it by trying various things.
To name only a few:
Try different window lengths (timesteps fed into network)
Try adding dense layers, or multiple LSTM layers, or fewer LTSM nodes
Try different optimizers, with various learning rates
Look for additional datapoints to feed into the network
How much data do you have? You may need more to get a good prediction
Try different offsets for the Y variable, how many timesteps do you need to be able to predict out for your specific problem?
The list goes on....

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.