My activation function right now is the logistic function f(x) = 1/(1+e^-x). But the values of x range from 10,000 to 100,000 so I don't think its feasible. Is there another way?
Related
I'm creating a Deep Neural Network for linear regression. The net has 3 hidden layers with 256 units per layer. Here is the model:
Each unit has ReLU as activation function. I also used Early Stopping to make sure it doesn't overfit.
The target is an integer and in the training set its values goes from 0 to 7860.
After the training i've got the following losses:
train_MSE = 33640.5703, train_MAD = 112.6294,
val_MSE = 53932.8125, val_MAD = 138.7836,
test_MSE = 52595.9414, test_MAD= 137.2564
I've tried many different configurations of the net (different optimizer, loss functions, normalizations, regularizers...) but nothing seems to help me to reduce the loss even further. Even if the training error decrease, the test error never goes under a value of MAD = 130.
Here's the behavior of my net:
My question is if there's a way to improve my dnn to make more accurate predictions or this is the best that i can achieve with my dataset?
If your problem is linear by nature, meaning the real function behind your data is of the from: y = a*x + b + epsilon where the last term is just random noise.
You won't get any better than fitting the underlying function y = a*x + b. Fitting espilon would only result in loss of generalization over new data.
You can try a many different things to improve DNN,
Increase hidden layers
Scale or Normalize your data
Try rectified linear unit as Activation
Take More data
Change learning algorithm parameters like learning rates
I am trying to implement a neural network in PyTorch to solve an ordinary differential equation (ODE). The network architecture is straight-forward. It is just a feed-forward neural network with n inputs and outputs and k layers.
class PINN(torch.nn.Module):
def __init__(self,n):
super().__init__()
# Layers
self.L1=torch.nn.Linear(1,n)
self.L2=torch.nn.Linear(n,n)
self.L3=torch.nn.Linear(n,1)
# Activation functions
self.t=torch.nn.Tanh()
self.r=torch.nn.ReLU()
def forward(self,x):
a1=self.r(self.L1(x))
a2=self.r(self.L2(a1))
a3=self.r(self.L3(a2))+x
return a3
I want to minimize the loss between the gradient of the output of my neural network and the right-hand side of the ODE. I have chosen to work with a mean-squared error loss. I know that PyTorch includes a built-in MSE loss. However, I defined my own loss function since I have to pass in the gradient of a tensor.
def ODELoss(x,y,x0):
# Number of collocation points to sample
n=len(x)
# Initialize the loss to zero
loss=torch.tensor(0.,requires_grad=True)
# Loop over the "data". Technically, this is an unsupervised problem.
# The "data" are points sampled on the domain which are then evaluated
# according to the ODE.
for (xx,yy) in zip(x,y):
xx=torch.tensor([[xx]],requires_grad=True)
yy=torch.tensor([[yy]])
g(xx,x0).backward()
dg=xx.grad.clone().requires_grad_(True)
loss=loss+(dg-yy)**2
loss=loss/n
return loss
Here, g(x) is called the universal predictor. It is used in the literature to account for the initial condition(s).
def g(x,x0):
return x*model(x)+x0
This doesn't seem to work because it seems like I am not passing in the gradient of the output correctly. Can anyone give me some guidance on how to do this?
I am trying to implement an Inverse Sigmoid function to the last layer of my Convolutional Neural Network?
I am trying to build the network in Pytorch and I want to take the output from the last Convolutional Layer and then apply Inverse Sigmoid Function to it.
I have read that the logit function is the opposite of sigmoid function and I tried implementing it but its not working.
I used the logit function from the scipy library and used it in the function.
def InverseSigmoid(self, x):
x = logit(x)
return x
Sigmoid is just 1 / (1 + e**-x). So if you want to invert it you can just -ln((1 / x) - 1). For numerical stability purposes, you can also do -ln((1 / (x + 1e-8)) - 1). This is the inverse function of sigmoid, implementation is straightforward.
I have a neural net with two loss functions, one is binary cross entropy for the 2 classes, and another is a regression. Now I want the regression loss to be evaluated only for class_2, and return 0 for class_1, because the regressed feature is meaningless for class_1.
How can I implement such an algorithm in Keras?
Training it separately on only class_1 data doesn't work because I get nan loss. There are more elegant ways to define the loss to be 0 for one half of the dataset and mean_square_loss for another half?
This is a question that's important in multi-task learning where you have multiple loss functions, a shared neural network structure in the middle, and inputs that may not all be valid for all loss functions.
You can pass in a binary mask which are 1 or 0 for each of your loss functions, in the same way that you pass in the labels. Then multiply each loss by its corresponding mask. The derivative of 1x is just dx, and the derivative of 0x is 0. You end up zeroing out the gradient in the appropriate loss functions. Virtually all optimizers are additive optimizers, meaning you're summing the gradient, adding a zero is a null operation. Your final loss function should be the sum of all your other losses.
I don't know much about Keras. Another solution is to change your loss function to use the labels only: L = cross_entropy * (label / (label + 1e-6)). That term will be almost 0 and almost 1. Close enough for government work and neural networks at least. This is what I actually used the first time before I realized it was as simple as multiplying by an array of mask values.
Another solution to this problem is to us tf.where and tf.gather_nd to select only the subset of labels and outputs that you want to compare and then pass that subset to the appropriate loss function. I've actually switched to using this method rather than multiplying by a mask. But both work.
I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.
def backprop_deep(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:
def backprop_relu(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.
Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.
The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.
I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.