Multiple parameters recovery using Deep Learning - python

As a simplified version of my actual research problem, let's say I have a second-order polynomial function y = ax^2 + bx + c and I want to use a deep neural network to predict the parameters a, b and c given the variable x and the value of the function y. The variable x and the parameters a,b,c are exctracted from a uniform distribution in the range [0,1].
When I try to train the network using different architectures, cost functions and hyperparameters combinations among the most used, I always got the same issue: the train and test losses rapidly converge to a value significantly higher than 0, then starts to fluctuate in a strange way and the predictions are not accurate (see figures as a general example, the predictions for b are similar, c is slightly better but still not satisfactory). This happens even if I set higher momentum or lower learning rates. Also, I got the same issue if I try to recover one parameter at a time.
As an example, here is the PyTorch code I used for my first test (4 layers, first 3 followed by ReLU, MSELoss, RMSprop optimizer with learning rate = 0.001 and momentum 0.9).
class PRNet(nn.Module):
def __init__(self, input_size, output_size):
super(PRNet, self).__init__()
self.input_size = input_size
self.fc1 = nn.Linear(self.input_size, 32)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(32, 64)
self.relu2 = nn.ReLU()
self.fc3 = nn.Linear(64, 64)
self.relu3 = nn.ReLU()
self.fc4 = nn.Linear(64, output_size)
def forward(self, x):
output = self.fc1(x)
output = self.relu1(output)
output = self.fc2(output)
output = self.relu2(output)
output = self.fc3(output)
output = self.relu3(output)
output = self.fc4(output)
return output
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
var_x = np.random.rand(100000)
pars_abc = np.random.rand(3, 100000)
func_y = pars[0]*var**2 + pars[1] * var + pars[2]
data = np.vstack((var_x, func_y)).T
parameters = pars_abc.T
X = torch.Tensor(data).to(device).float()
y = torch.Tensor(parameters).to(device).float()
train_size = int(0.8 * len(data))
batch_size = 100
train_dataset = TensorDataset(X[:train_size], y[:train_size])
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
prnet = PRNet(X.shape[1], 3).to(device)
loss_function = nn.MSELoss()
optimizer = torch.optim.RMSprop(prnet.parameters(), lr=1e-4, momentum=0.9)
num_epochs = 25
for epoch in range(0, num_epochs):
print(f'Starting epoch {epoch+1}')
current_loss = 0.0
for i, batch in enumerate(train_dataloader, 0):
inputs, targets = batch
optimizer.zero_grad()
outputs = prnet(inputs)
test_outputs = prnet(X[train_size:].to(device))
train_loss = loss_function(outputs, targets)
test_loss = loss_function(test_outputs, y[train_size:])
train_loss_plot[epoch,i] = train_loss.item()
test_loss_plot[epoch,i] = test_loss.item()
train_loss.backward()
optimizer.step()
What could be the cause of this issue? Are the features not representative enough? Do I need a custom loss more suitable for this problem?

During training, when a model's loss starts fluctuating, the most probable cause for such a pattern to show up is that the learning rate is high for the weights to get to the required value.
Consider this example. Suppose in your model, a parameter (weight), initialized with a value of 0.1, needs to get to a value of 0.00423 and the learning rate is set to 0.001.
Now, let's assume that the parameter has reached a value of 0.004 after a few epochs of training. Gradient descent will try to increase the value in order to make it equal to the target value but since the learning rate is only upto 3 decimal digits, the parameter value will now become 0.005. Since the value has now increased, gradient descent will try to decrease the value which will change the parameter value back to 0.004 and thus starting a fluctuation pattern.
To solve this issue, using a small learning rate will not help. Because if you use a small learning rate then the model will learn too slowly and might not converge at all. What you are probably looking for is a way to use a variable learning rate policy in your training. With such a policy, you can begin with a large learning rate initially so that the model learns faster. And later on, when the model parameters get close to the target values, the learning rate should decrease automatically in order to make the parameters reach as close as possible to the target. These policies are called learning rate schedulers.
There are several functions in PyTorch that let you use a learning rate scheduler of your choice. You can look for them in their documentation.
I'll suggest you to go for the Reduce LR on Plateau scheduler. It will let you set a threshold and a factor. Whenever your model loss does not improve over the specified threshold of number of epochs, it will decrease the learning rate by the factor.
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html#torch.optim.lr_scheduler.ReduceLROnPlateau

Related

BatchNorm makes accuracy at prediction time around 10% of what's reported during training in tensorflow 2.6

I know this has been previously discussed, but I did not find a concrete answer, and some answers did not work after trying them, the case is simple, I have a model, if I use batch norm, the training accuracy reported by model.fit(training_data) is above 0.9 (it consistently increases, and the loss decreases), but then after training if I run model.evaluate(training_data) (notice is the same data) it returns 0.09, also predictions are really bad (the accuracy is low too if manually calculated using the results from model.predict(training_data). I know the difference between training and testing time in batch norm, and I know differences should be expected, but a drop from 0.9 to 0.09 seems just wrong(and the model is completely unusable). I tried some solutions from other threads:
use batch_size in .evaluate to be the same as .fit: did not make a difference
set tf.keras.backend.set_learning_phase(0): got a message saying it is now deprecated and made no difference.
set all batch norm layers to have layer.trainable=False before .predict and .evaluate: it did not a difference.
If I remove batch norm layers, the report from model.fit(training_data) coincides with model.evaluate(training_data) but the training is not doing any progress (results are consistent but bad) so I need to add it.
Is this a major bug in TF 2.6?
Update: also tested TF 2.5, result is the same.
Sample code(omitting irrelevant code, like data reading and pre-processing):
### model definition
class CLS_BERT_Embedding(tf.keras.Model):
"""Will only use the CLS token"""
def __init__(self, bert_trainable=False, number_filters=50,FNN_units=512,
number_clases=2,dropout_rate=0.1,name="dcnn"):
super(CLS_BERT_Embedding,self).__init__(name)
self.checkpoint_id ="CLS_BERT_Embedding_bn_3fc_{}filters_{}fc_units_berttrainable{}".format(number_filters,
FNN_units,bert_trainable)
# trainable= False so we don't fine-tune bert, just use as embedding layer
self.bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
trainable=bert_trainable,
input_shape=(3,376))
self.dense_1 = layers.Dense(units = FNN_units,activation="relu")
self.bn1 = layers.BatchNormalization()
self.dense_2 = layers.Dense(units = FNN_units, activation="relu")
self.bn2 = layers.BatchNormalization()
self.dense_3 = layers.Dense(units = FNN_units, activation="relu")
self.bn3 = layers.BatchNormalization()
self.dropout = layers.Dropout(rate=dropout_rate)
if number_clases == 2:
self.last_dense = layers.Dense(units=1,activation="sigmoid")
else:
self.last_dense = layers.Dense(units=number_clases,activation="softmax")
def get_bert_embeddings(self,all_tokens):
CLS_embedding ,embeddings = self.bert_layer([all_tokens[:,0,:],
all_tokens[:,1,:],
all_tokens[:,2,:]])
return CLS_embedding,embeddings
def call(self,inputs,training):
CLS_embedding, x_seq = self.get_bert_embeddings(inputs)
x = self.dense_1(CLS_embedding)
x = self.bn1(x,training)
x = self.dense_2(x)
x = self.bn2(x,training)
x = self.dense_3(x)
x = self.bn3(x,training)
output = self.last_dense(x)
return output
#### config and hyper-params
NUMBER_FILTERS = 1024
FNN_UNITS = 2048
BERT_TRAINABLE = False
NUMBER_CLASSES = len(tokenizer.vocab)
DROPOUT_RATE = 0.2
NUMBER_EPOCHS = 3
LR = 0.001
DEVICE = '/GPU:0'
#### optimization definition
with tf.device(DEVICE):
model = CLS_BERT_Embedding(
bert_trainable = BERT_TRAINABLE,
number_filters=NUMBER_FILTERS,
FNN_units=FNN_UNITS,
number_clases=NUMBER_CLASSES,
dropout_rate = DROPOUT_RATE)
if NUMBER_CLASSES == 2:
loss = "binary_crossentropy"
metrics = ["accuracy"]
else:
loss="sparse_categorical_crossentropy"
metrics = ["sparse_categorical_accuracy"]
optimizer = tf.keras.optimizers.Adam(learning_rate = LR)
loss="sparse_categorical_crossentropy"
model.compile(loss=loss,optimizer=optimizer,metrics=metrics)
### training
with tf.device(DEVICE):
model.fit(train_dataset,
batch_size = BATCH_SIZE ,
epochs=NUMBER_EPOCHS,
shuffle=True,
callbacks=[MyCustomCallback(),
tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",patience=5),
tensorboard,lr_tensorboard])
### testing
train_results = model.evaluate(train_dataset,batch_size = BATCH_SIZE)
print(train_results)
Try running inference without adjusting the trainable flag at all and then verifying that self.bn1.trainable=True. Then run forward prop by calling the model as a callable on each batch of your training data but with training=True, and evaluating each time. So that would be something like
for idx d in enumerate(train_dataset_:
_ = model(d[0], training=True)
model.evaluate(d)
if idx > 100:
break
If your loss starts dropping, then this is an example of your batch norm moving statistics not updating fast enough, which is possible given the BNs are not trained but your bert mode/layer is. If not, ignore the rest of this because you may have a different issue.
If that's the case, you have two options. One is to keep calling the model to allow the BN moving statistics to stabilize.
The other is to statistically analyze the output of your bert layer (get the mean and var) and directly update the BN's moving statistic weights. Probably the first BN is sufficient given your latter Dense layers are
Xavier Glorot initialized, but given you are using Relu, you might also try Kaiming He initialization on them.
I agree with #Yaoshiang, it is likely that the internal statistics (moving average, moving var) do not coincide with the mean and variance per batch, hence a different normalization in the BN layer at training and at testing. Thinking about it, if we use the same batch size for training and testing, then we can keep training=True in the test without it being a real problem (when using predict or evaluate). Otherwise, we can force in the training to use the moving average and moving variance for the normalization rather than the mean and variance by batch and still estimate beta and gamma. (This implies a minor modification of the BatchNormalization class.)

What is the correct way to maximize one loss and minimize another during NN training?

I have a simple NN:
import torch
import torch.nn as nn
import torch.optim as optim
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(1, 5)
self.fc2 = nn.Linear(5, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = self.fc1(x)
x = torch.relu(x)
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Model()
opt = optim.Adam(net.parameters())
I also have some input features:
features = torch.rand((3,1))
I can train it normally with a simple loss function that will be minimized:
for i in range(10):
opt.zero_grad()
out = net(features)
loss = torch.mean(torch.square(torch.tensor(5) - torch.sum(out)))
print('loss:', loss)
loss.backward()
opt.step()
However, if I'll add another loss component to this that I'd want to maximize--loss2:
loss2s = []
for i in range(10000):
opt.zero_grad()
out = net(features)
loss1 = torch.mean(torch.square(torch.tensor(5) - torch.sum(out)))
loss2 = torch.sum(torch.tensor([torch.sum(w_arr) for w_arr in net.parameters()]))
loss2s.append(loss2)
loss = loss1 + loss2
loss.backward()
opt.step()
It becomes seemingly unstable as the 2 losses have different scales. Also, I'm not sure that this is the correct way because how would the loss know to maximize one part and minimize the other. Note that this is just an example, obviously there's no point in increasing the weights.
import matplotlib.pyplot as plt
plt.plot(loss2s, c='r')
plt.plot(loss1s, c='b')
And also I believe that minimizing functions is the common way to train in ML, so I wasn't sure if changing the maximization problem into minimization problem in some way will be better.
The standard way to denote "minimization" and "maximization" is changing the sign. PyTorch always minimizes a loss if the following is done
loss.backward()
So, if another loss2 needs to be maximized, we add negative of it
overall_loss = loss + (- loss2)
overall_loss.backward()
since minimizing a negative quantity is equivalent to maximizing the original positive quantity.
With regard to "scale", yes scales do matter. Often the following is done in order to match scales
overall_loss = loss + alpha * (- loss2)
where alpha is a fraction denoting relative importance of one loss w.r.t to the other. Its a hyperparameter and needs to experimented with.
Keeping technicalities aside, whether the resulting loss will be stable depends a lot on the specific problem and loss functions involved. If the losses are contradicting, you may experience instability. The ways to deal them is itself a research problem and much beyond the scope of this question.

Is there a way to run a function before the optimizer updates the weights?

I'm going through PyTorch tutorial and just learned about optimizer.step and how it makes an update to the network's parameters (here).
Is there a way to create a function that whenever theres a gradient updates to each learnable parameter (e.g weight), will take the weight value and the loss, and multiply that value by some percentage of that, say 90%?
So if the update should be:
w1 -= lr * loss_value = 1e-5 * 50
I want it to go through the function before the update and make it 1e-5 * 50 * 90%
def func(loss_value, percentage):
return loss_value * percentage #new update should be w1 -= loss_value * percentage
Example model:
import torch
import torch.nn as nn
import torch.optim as optim
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(1, 5)
self.fc2 = nn.Linear(5, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = self.fc1(x)
x = torch.relu(x)
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Model()
opt = optim.Adam(net.parameters())
features = torch.rand((3,1))
opt.zero_grad()
out = net(features)
loss = torch.tensor(5) - torch.sum(out)
loss.backward()
# need to have the function change the value of the loss update before the optimizer?
opt.step()
I got this bit of code from https://discuss.pytorch.org/t/how-to-modify-the-gradient-manually/7483/2 and edited it slightly:
loss.backward()
for p in model.parameters():
weights = p.data
scales = def_scales(weights)
p.grad *= scales # or whatever other operation
optimizer.step()
This goes through every parameter in the model (between the loss.backward() and BEFORE the optimizer step) and adjusts its stored gradient BEFORE backprop is applied.
An example def_scales will look something like this (SUPER ugly), where vals are the compared parameter values, and scales are the desired loss scaling values:
def def_scales(weights,scales=[0.1,0.5,1,1],vals=[0,5,10,float('inf')]):
out = torch.zeros_like(weights)
for V,v in enumerate(vals[::-1]): #backwards because we're doing less than
out[weights<=v] = scales[len(scales)-V-1] #might want to compare to abs
return out
This is a completely unnecessary step since the whole purpose of the learning rate is to take a small fraction of the gradient and make the update to the weights, so I really don't understand what you are trying to do here. and also, the weights update is not according to the equation you wrote, you need to understand the underlying calculus and the gradient descent algorithm. we take the partial derivative of the cost/loss value w.r.t the weights and then multiply it with a small number (learning rate) and subtract it from the initial weights.
if you are doing classification and want to give more importance to one class than the other then you can use Focal Loss function.

Cannot improve model accuracy

I am building a general-purpose NN that would classify images (Dog/No Dog) and movie reviews(Good/Bad). I have to stick to a very specific architecture and loss function so changing these two seems out of the equation. My architecture is a two-layer network with relu followed by a sigmoid and a cross-entropy loss function. With 1000 epochs and a learning rate of around .001 I am getting 100 percent training accuracy and .72 testing accuracy.I was looking for suggestions to improve my testing accuracy.This is the layout of what I have:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
n_x,n_h,n_y=model_size
model = Net(n_x, n_h, n_y)
optim = torch.optim.Adam(model.parameters(),lr=0.005)
loss_function = nn.BCELoss()
train_losses = []
accuracy = []
for epoch in range(epochs):
count=0
model.train()
train_loss = []
batch_accuracy = []
for idx in range(0, train_x.shape[0], batch_size):
batch_x = torch.from_numpy(train_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(train_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
batch_accuracy=[]
loss = loss_function(model_output, batch_y)
train_loss.append(loss.item())
preds = model_output > 0.5
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
optim.zero_grad()
loss.backward()
# Scheduler made it worse
# scheduler.step(loss.item())
optim.step()
if epoch % 100 == 1:
train_losses.append(train_loss)
print("Iteration : {}, Training loss: {} ,Accuracy %: {}".format(epoch,np.mean(train_loss),(count/train_x.shape[0])*100))
plt.plot(np.squeeze(train_losses))
plt.ylabel('loss')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(lr))
plt.show()
return model
My model parameters:
batch_size = 32
lr = 0.0001
epochs = 1500
n_x = 12288 # num_px * num_px * 3
n_h = 7
n_y = 1
model_size=n_x,n_h,n_y
model=train_net(epochs,batch_size,train_x,train_y,model_size,or)
and this is the testing phase.
model.eval() #Setting the model to eval mode, hence making it deterministic.
test_loss = []
count=0;
loss_function = nn.BCELoss()
for idx in range(0, test_x.shape[0], batch_size):
with torch.no_grad():
batch_x = torch.from_numpy(test_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(test_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
preds = model_output > 0.5
loss = loss_function(model_output, batch_y)
test_loss.append(loss.item())
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
print("test loss: {},test accuracy: {}".format(np.mean(test_loss),count/test_x.shape[0]))
Things I have tried:
Messing around with the learning rate, having momentum, using schedulers and changing batch sizes.Of course these were mainly guesses and not based on any valid assumptions.
The issue you're facing is overfitting. With 100% accuracy on the training set, your model is effectively memorizing the training set, then failing to generalize to unseen samples. The good news is this is a very common major challenge!
You need regularization. One method is dropout, whereby on different training epochs a random set of the NN connections are dropped, forcing the network to "learn" alternate pathways and weights, and softening sharp peaks in parameter space. Since you need to keep your architecture and loss function the same, you won't be able to add such an option in (though for completeness, read this article for a description and implementation of dropout in PyTorch).
Given your constraints, you'll want to use something like L2 or L1 weight regularization. This typically shows up in the way of adding an additional term to the cost/loss function, which penalizes large weights. In PyTorch, L2 regularization is implemented via the torch.optim construct, with the option weight_decay. (See documentation: torch.optim, search for 'L2')
For your code, try something like:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
...
optim = torch.optim.Adam(model.parameters(),...,weight_decay=0.01)
...
Based on your statement that your training accuracy is 100%, while your testing accuracy is significantly lower at 72%, it seems that you are significantly overfitting your dataset.
In short, this means that your model is training itself too specifically to the training data that you've given it, picking up on quirks that may exist in the training data but which are not inherent to the classification. For example, if the dogs in your training data were all white, the model would eventually learn to associate the color white with dogs, and be hard-pressed to recognize dogs of other colors given to it in the test data set.
There are many avenues to address this issue: a well sourced overview of the subject written in simple terms can be found here.
Without more information on the specific constraints you have around changing the architecture of the neural network, it's tough to say for sure what you will and will not be able to change. However, weight regularization and dropout are often used to great effect (and are described in the above article.) You should also be free to implement early stopping and a weight constraint to the model.
I'll leave it you to find resources on how to implement these specific strategies in pytorch, but this should provide a good jumping off point.

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())
can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

Categories

Resources