Big difference when using Relu over Tanh on a simple problem

Big difference when using Relu over Tanh on a simple problem - python

I was testing a toy problem, where you have as input zeros and ones, and the output is whether the number of ones is odd or even (simplicity itself). With a MLP that uses Tanh activation, I never managed to get around random guess performance (~50%)! Just completely by chance, I tried Relu (out of desperation), and...it worked perfectly (getting an accuracy of 100% most of the time).
Then, while discussing it with a friend, we wanted to see what will happen if we replace the zeros with -1 (the task stay the same, odd or even ones). To my sheer surprise, it worked with the Tanh (performance between 75~90 %). Relu still performs better.
The code
import numpy as np
from sklearn.neural_network import MLPClassifier
# from sklearn.preprocessing import StandardScaler
def generate_data(batch_size, data_length=10, zeros=True):
x = np.random.randint(0, 2, (batch_size, data_length))
y = x.sum(axis=1) % 2
y = y.astype(np.int16).reshape(-1)
if not zeros: # in this case, convert the zeros to -1
x[x==0] = -1
return x, y
# With ReLU, it is perfect!. With Tanh, it is shit
# clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="relu")
clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="tanh")
X_train, y_train = generate_data(batch_size=10000, data_length=10, zeros=True)
X_test, y_test = generate_data(batch_size=1000, data_length=10, zeros=True)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
To get the -1 instead of zeros, just make the zeros parameter False when using generate_data function.
Can someone please explain what is happening here?
Edit:
Thanks to #BlackBear and #Andreas K. for there answers. So apparently using Tanh leads the neurons to saturate (the gradient is not moving). With better choice for the learning rate, or to let the network optimize for longer time, it does work. For example, with updating the classifier choices to
clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="tanh", max_iter=5000, learning_rate="adaptive", n_iter_no_change=100)
It always works!

It is just an issue with the optimization procedure that is not able to find good values for the weights. You can construct a network with 2^10=1024 neurons in the hidden layer, one for each input pattern, and let the output neuron respond to the neurons corresponding to inputs with even number of ones. With this procedure, you can model every boolean function.

Related

Multiple outcome values for simple neural network. What activate function to use

Hi I'm trying to build a simple neural network with tensorflow, where I give the model the training_data, which contains the standard values and i give it the target_data, which is the result I want it to have if the predicted value is near one of those numbers.
For example, if I give the y_test a value of 3.5, the model would predict and give a number close to 4. So the condition would say it was a lightsmoker. I searched a bit for activation functions and I learned I can't use sigmoid for what I want to do. I'm quite new on this matter. What i've done so far it's by error and trial.
import random
import tensorflow as tf
import numpy as np
training_data=[]
for i in range(0,5):
training_data.append([random.uniform(0,0.2944)])
for i in range(0,5):
training_data.append([random.uniform(0.2944,1.7394)])
for i in range(0,5):
training_data.append([random.uniform(1.7394,3.2394)])
for i in range(0,5):
training_data.append([random.uniform(3.2394,6)])
target_data=[]
for i in range(0,5):
target_data.append([1])
for i in range(0,5):
target_data.append([2])
for i in range(0,5):
target_data.append([3])
for i in range(0,5):
target_data.append([4])
y_test= np.array([100])
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(len(target_data),input_dim=1,activation='softmax'))
model.add(tf.keras.layers.Dense(1,activation='relu'))
model.compile( loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
training_data = np.asarray(training_data)
target_data = np.asarray(target_data)
model.fit(training_data, target_data, epochs=50, verbose=0)
target_pred= model.predict(y_test)
target_pred=float(target_pred)
print("X=%s, Predicted=%s" % (y_test, target_pred))
if( 0<= target_pred <= 1.5):
print("\nNon-Smoker")
elif(1.5<= target_pred <2.5):
print("\nPassive Smoker")
elif(2.5<= target_pred <3.5 ):
print("Lghtsmoker")
else:
print("Smoker\n")

Here is a helpful guide to using activation functions in the final layer as well as corresponding losses for different type of problems.
In your case, I am assuming you are working with a regression task with arbitrary values (any float value as output, not restricted between 0 to 1 or -1 to 1). So, skip the activation function and keep mse or mean_squared_error as your loss function.
EDIT:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(3,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(1))

You are defining your problem as a regression problem where the result of model.predict is a linear value. For that kind of situation the last layer in your model is a linear layer that does not have an activation function. For this kind of problem your loss as mse is fine. Now you could elect to define your problem as a classification problem. Where you have 3 classes, Non-Smoker, Passive-Smoker and Light smoker. Now in that case, your target data in training is not a number in the numerical sense but an integer that indicates which class the training sample represents. For example you could have Non_Smoker with the label 0, Passive_Smoker with the label 1 and Light_Smoker with the label 2. Now the last layer in your model would use a softmax activation function. In model.compile your loss would be sparse_categorical_crossentropy because your labels are integers. If you one-hot encode your labels, for example Non_Smoker coded as 100, Light_Smoker as 010 and Passive_Smoker coded as 001 then your loss fuction would be categorical_cross_entropy. Now when you ran model.predict on a test sample it will produce a list containing 3 probabilities. The first in the list is the probability for class 0 - Non_Smoker, second is the probability for class 1 Light Smoker and the third is the probability of the third class Passive_Smoker. Now what you do is use np.argmax to find which index has the highest probability value and that is then the model's prediction.

Tensorflow NN not giving any reasonable output

I want to train a network on the isolet dataset, consisting of 6238 samples with 300 features each.
This is my code so far:
import tensorflow as tf
import sklearn.preprocessing as prep
import numpy as np
import matplotlib.pyplot as plt
def main():
X, C, Xtst, Ctst = load_isolet()
#normalize
#X = (X - np.mean(X, axis = 1)[:, np.newaxis]) / np.std(X, axis = 1)[:, np.newaxis]
#Xtst = (Xtst - np.mean(Xtst, axis = 1)[:, np.newaxis]) / np.std(Xtst, axis = 1)[:, np.newaxis]
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.transform(Xtst)
# Build the tf.keras.Sequential model by stacking layers. Choose an optimizer and loss function for training:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(X.shape[1], activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(26, activation='softmax')
])
ES_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=1e-2, patience=10, verbose=1)
initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate,decay_steps=100000,decay_rate=0.9999,staircase=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(scaledX, C, epochs=100, callbacks=[ES_callback], batch_size = 32)
plt.figure(1)
plt.plot(range(len(history.history['loss'])), history.history['loss']);
plt.plot(range(len(history.history['accuracy'])), history.history['accuracy']);
plt.show()
Up to now, I have pretty much turned every knob I know:
different number of layers
different sizes of layers
different activation functions
different learning rates
different optimizers (we should test with 'adam' and 'stochastic gradient decent'
different batch sizes
different data preparations (the features range from -1 to 1 values. I tried normalizing along the feature axes, batch normalizing (z_i = (x_i - mean) / std(x_i)) and as seen in the code above scaling the values from 0 to 1 (since I guess 'relu' activation won't work well with negative input values)
Pretty much everything I tried gives weird outputs with extremely high loss values (depending on the learning rate) and very low accuracies during learning. The loss is increasing over epochs pretty much all of the time, but seems to be independent from the accuracy values.
For the code, I followed tutorials I got provided, however something is very off, since I should find the best hyper parameters, but I'm not able to find any good whatsoever.
I'd be very glad to get some points, where got the code wrong or need to preprocess the data differently.
Edit: Using loss='categorical_crossentropy'was given, so at least this one should be correct.

first of all:
Your convergence problems may be due to "incorrect" loss function. tf.keras supports a variety of losses that depend on the shape of your input labels.
Try different possibilities like
tf.keras.losses.SparseCategoricalCrossentropy if your labels are one-hot vectors.
tf.keras.losses.CategoricalCrossentropy if your lables are 1,2,3...
or tf.keras.losses.BinaryCrossentropy if your labels are just 0,1.
Honestly, this part of tf.keras is a bit tricky and some settings like that might need tuning.
Second of all - this part:
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.fit_transform(Xtst)
assuming Xtst is your test set you want to scale it based on your training set. So the correct scaling would be just
scaledXtst = scaler.transform(Xtst)
Hope this helps!

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())

can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

By which technique adapted to time-series can I replace cross-validation in my Keras MLP regression model in Python

I'm currently working with a time series dataset of 46 lines about meteorological measurements on approximately each 3 hours by day during one week. My explanatory variables (X) is composed of 26 variables and some variable has different units of measurement (degree, minimeters, g/m3 etc.). My variable to explain (y) is composed of only one variable temperature.
My goal is to predict temperature (y) on a slot of 12h-24h with the ensemble of variables (X)
For that I used Keras Tensorflow and Python, with MLP regressor model :
X = df_forcast_cap.loc[:, ~df_forcast_cap.columns.str.startswith('l')]
X = X.drop(['temperature_Y'],axis=1)
y = df_forcast_cap['temperature_Y']
y = pd.DataFrame(data=y)
# normalize the dataset X
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(X)
normalized = scaler.transform(X)
# normalize the dataset y
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(y)
normalized = scaler.transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# define base model
def norm_model():
# create model
model = Sequential()
model.add(Dense(26, input_dim=26, kernel_initializer='normal', activation='relu'))# 30 is then number of neurons
#model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=norm_model, epochs=(100), batch_size=5, verbose=1)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold)
print(results)
[-0.00454741 -0.00323181 -0.00345096 -0.00847261 -0.00390925 -0.00334816
-0.00239754 -0.00681044 -0.02098541 -0.00140129]
# invert predictions
X_train = scaler.inverse_transform(X_train)
y_train = scaler.inverse_transform(y_train)
X_test = scaler.inverse_transform(X_test)
y_test = scaler.inverse_transform(y_test)
results = scaler.inverse_transform(results)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
Results: -0.01 (0.01) MSE
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree. In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable. But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?

The model that you present above is looking at a single instance of 26 measurements to make a prediction. From your description it seems that you would like to make predictions from a sequence of these measurements. I'm not sure if I fully understood the description but I'll assume that you have a sequence of 46 measurements, each with 26 values that you believe should be good predictors of the temperature. If that is the case, the input shape of your model should be (46, 26,). The 46 here is called time_steps, 26 is the number of features.
For a time series you need to select a model design. There are 2 approaches: a recurrent network or a convolutional network (or a mixture of the 2nd). A convolutional network is typically used to detect patterns in the input data which may be located somewhere in the data. For instance, suppose you want to detect a given shape in an image. Convolutional Networks are a good starting point. Recurrent networks, update their internal state after each time step. They can detect patterns as well as a convolutional network, but you can think of them as being less position independent.
Simple example of a convolutional approach.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, Model
average_tmp = 0.0
model = Sequential([
InputLayer(input_shape=(46,26,)),
Conv1D(16, 4),
Conv1D(32, 4),
Conv1D(64, 2),
Conv1D(128, 4),
MaxPooling1D(),
Flatten(),
Dense(256, activation='relu'),
Dense(1, bias_initializer=keras.initializers.Constant(average_tmp)),
])
model.compile('adam', 'mse')
model.summary()
A mixed approach, would replace the ```Flatten`` layer above with an LSTM node. That would probably be a reasonable starting point to start experimenting.
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
cross validation is a technique that is very well suited for this problem. If you try the example model above, I can almost guarantee that it will overfit your dataset very significantly. cross-validation can help you determine the right regularisation parameters for your model in order to avoid overfitting.
Examples of regularisation techniques that you probably want to consider:
Saving the model weights at the epoch with lower validation score.
Dropout and/or BatchNormalization.
kernel regularisation.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree.
Good call. It will avoid training cycles of your model trying to discover the bias at very high values from the random initialisation.
In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable.
This is orthogonal. The inputs are not assumed to be in the same unit as y. We assume in a DNN that we can create a combination of linear transformation of weights (plus non-linear activations). That has no implicit assumption of units.
But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?
scaler.inverse_transform(results) should do the trick.
It doesn't make sense to inverse transform the inputs X_ and Y_. And it would probably help you keep your code straight to not use the same variable name for both the X and Y scalers.
It is also possible to refrain from scaling Y. If you choose to do so, I'd suggest that you initialise the output layer bias with the mean of the Ys.

LSTM parity generator

I am trying to learn deep learning, I have stumbled on one exercise here
It is first warm-up exercise. I am stuck. For constant sequence of small lengths(2,3) it solves it no problem. However when I try whole sequence of 50. it stops at 50% accuracy, which is basically random guess.
According to here it is too big flat space ant cant find gradient to solve it. So i tried approach of continuously increasing length ans saving model each time (2,5,10,15,20,30,40,50).It seems it does not generalise well, as if i type bigger sequence then what I learned it on, it fails.
According to here it should be easy problem. I cant figure it out. There is used some different LSTM architecture hoverer.
And one solution here to exactly same problem says it works with Adagrad optimizer and learning rate of 0.5.
I am unsure about one bit at time, if I am feeding it right in first place. I hope I got it right.
And for variable length, i tried and failed miserably.
Code:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, LSTM
from keras.optimizers import SGD, Adagrad, Adadelta
from keras.callbacks import TensorBoard
from keras.models import load_model
import numpy as np
import time
import os.path
# building the model
def build_model():
model = Sequential()
model.add(LSTM(
32,
input_shape=(None, 1),
return_sequences=False))
model.add(Dropout(0.1))
model.add(Dense(
1))
model.add(Activation("sigmoid"))
return model
# generating random data
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.sum(axis=1) % 2
return x, y
X, y = generate_data(100000, 50)
X_test, y_test = generate_data(1000, 50)
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=0,
write_graph=True, write_images=False)
if os.path.isfile('model.h5'):
model = load_model('model.h5')
else:
model = build_model()
opti = Adagrad(lr=0.5)
model.compile(loss="mse", optimizer=opti, metrics=['binary_accuracy'])
model.fit(
X, y,
batch_size=10, callbacks=[tensorboard],
epochs=5)
score = model.evaluate(X_test, y_test,
batch_size=1,
verbose=1)
print('Test score:', score)
print('Model saved')
model.save('model.h5')
I am so confused now. Thanks for any response!
Edit: Fixed return_sequences to False typo from previous experiments.

Well, this might be a really valuable exercise about LSTM and vanishing gradient. So let's dive into it. I'd start from changing task a little bit. Let's change our dataset to:
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.cumsum(axis=1) % 2
return x, y
and model by setting return_sequences=True, changing the loss to binary_crossentropy and epochs=10. So well - if we solve this task perfectly - then we'd also solve the initial task. Well - in 10 out of 10 runs of the setup I provided I observed the following behavior - for first few epochs model saturated around 50% of accuracy - and then suddenly dropped to 99% of accuracy.
Why have this happened?
Well - in LSTM a sweet spot for parameters is a synchrony between memory cells dynamics and normal activation dynamics. Very often one should wait a lot of time in order to get such behavior. Moreover - the architecture needs to be sufficient in order to catch valuable dependencies. In a changed behavior - we are providing much more insights to a network thanks to which it could be trained faster. Still - it takes some time to find the sweet spot.
Why your network failed?
Vanishing gradient problem and problem complexity - it's completely not obvious what information network should extract if it gets only a single signal at the end of the sequence of computations. This is why it needs either supervision in the form which I provided (cumsum) or a lot of time and luck in order to finally find a sweet stop on its own.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Big difference when using Relu over Tanh on a simple problem - python

Related

Multiple outcome values for simple neural network. What activate function to use

Tensorflow NN not giving any reasonable output

Neural network versus random forest performance discrepancy

By which technique adapted to time-series can I replace cross-validation in my Keras MLP regression model in Python

LSTM parity generator

Categories

Resources