I'm trying to use the following (custom) loss function to train a keras neural network:
y_pred and y_true are arrays of length 40.
Say y_true is 0 everywhere except on the jth component where it is equal to 1,
write y and z for y_true and y_pred resp. then:
blank">
{i<40}(|i-j|+1)\cdot(y_i-z_i)^2" title="boostSquare(y,z)=\sum_{i<40}(|i-j|+1)\cdot(y_i-z_i)^2" />
Here's the code I intended to use :
import keras.backend as K
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float64)
return K.sum(K.transpose(w * y_true) * K.square(y_true - y_pred))
Running this works and prints 2.25 as expected :
y_true = np.array([int(i == 2) for i in range(40)])
y_pred = np.array([0.5 * int(i < 2) for i in range(40)])
print(K.eval(boost_square(y_true, y_pred)
Yet, this fails to compile with the following error message :
from keras.layers import Input, Dense
input_layer = Input(shape=(40,), name='input_layer')
output_layer = Dense(units=40, name='output_layer')(input_layer)
model = Model([input_layer], [output_layer])
model.compile(optimizer='adam', loss=boost_square,
metrics=['accuracy'])
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type float64 of argument 'x'.
Since I'm stubborn, I also tried this, which didn't fix anything and might hinder performance :
def boost_square_bis(y_true, y_pred):
z_true = K.cast(y_true, np.float64)
z_pred = K.cast(y_pred, np.float64)
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float64)
boost = K.transpose(w * z_true)
boost = K.cast(boost, dtype=np.float64)
square = K.square(z_true - z_pred)
square = K.cast(square, np.float64)
ret = K.sum(boost * square)
return K.cast(ret, dtype=np.float64)
What am I missing? Where does this error come from?
Solution 1
Credits to AnnaKrogager : the dtype of w wasn't compatible with the model. The
model compiles when one defines :
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float64)
return K.sum(K.transpose(w * y_true) * K.square(y_true - y_pred))
Iteration 1
Now, the model compiles but won't fit, I get this error message (128 is the batch_size) :
ValueError: Dimensions must be equal, but are 40 and 128 for 'mul_2' (op: 'Mul') with input shapes: [40,40], [128,40].
My custom loss function behaves oddly with respect to this first axis indeed,
this code will raise the very same error :
fake_input = np.random.rand(128,40)
fake_output = np.random.rand(128,40)
print(K.eval(boost_square(fake_intput,fake_output)))
Iteration 2
As AnnaKrogager pointed out, it is more consistent to use a proper np.dot than * followed by a transposition (that messes with batch axis). So I came up with this new definition of boost_square :
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float32)
return K.sum(K.dot(w, y_true) * K.square(y_true - y_pred))
But this triggers following when I try to fit the model :
AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'
Hence, I tried
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float32)
return K.sum(K.dot(K.dot(w, y_true), K.square(y_true - y_pred)))
And got a brand new error message \o/ :
Matrix size-incompatible: In[0]: [40,40], In[1]: [32,40]
Definitive Solution
Credits to AnnaKrogager
Ingredients
Use proper matrice product K.dot ratter than * .
Though w was meant to be applied to y_true, don't use K.dot(w,y_true) since
it messes with the batch axis. Ratter, use K.dot(y_true,w) and transpose to have matching shapes.
If you want to test the loss function with np.arrays, say y_true and y_pred, make sure you recast them as K.constant.
Here's the code :
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float32)
return K.sum(K.dot(K.dot(y_true, w), K.transpose(K.square(y_true -
y_pred))))
And for the test :
y_true = K.constant(np.array([[int(i == 2) for i in range(40)]],
dtype=np.float32))
y_pred = K.constant(np.array([[0.5 * int(i < 2) for i in range(40)]],
dtype=np.float32))
print(K.eval(boost_square(y_true,y_pred)))
>>2.25
The problem is that your model outputs float32 whereas the constant w inside your loss function is of type float64. You can fix this by simply changing the data type of w:
def boost_square(y_true, y_pred):
w = K.constant(np.array([[np.abs(i - j) + 1 for i in range(40)] for j in
range(40)]), dtype=np.float32)
return K.sum(K.transpose(w * y_true) * K.square(y_pred))
Answer to your second question: If you multiply tensors in Keras it means that the tensors get multiplied element wise, hence they must have the same shape. What you want is the matrix product so you should use K.dot(y, w) instead of w * y.
Related
I am trying to create a multi-layered perceptron for the purpose of classifying a dataset of hand drawn digits obtained from the MNIST database. It implements 2 hidden layers that have a sigmoid activation function while the output layer utilizes SoftMax. However, for whatever reason I am not able to get it to work. I have attached the training loop from my code below, this I am confident is where the problems stems from. Can anyone identify possible issues with my implementation of the perceptron?
def train(self, inputs, targets, eta, niterations):
"""
inputs is a numpy array of shape (num_train, D) containing the training images
consisting of num_train samples each of dimension D.
targets is a numpy array of shape (num_train, D) containing the training labels
consisting of num_train samples each of dimension D.
eta is the learning rate for optimization
niterations is the number of iterations for updating the weights
"""
ndata = np.shape(inputs)[0] # number of data samples
# adding the bias
inputs = np.concatenate((inputs, -np.ones((ndata, 1))), axis=1)
# numpy array to store the update weights
updatew1 = np.zeros((np.shape(self.weights1)))
updatew2 = np.zeros((np.shape(self.weights2)))
updatew3 = np.zeros((np.shape(self.weights3)))
for n in range(niterations):
# forward phase
self.outputs = self.forwardPass(inputs)
# Error using the sum-of-squares error function
error = 0.5*np.sum((self.outputs-targets)**2)
if (np.mod(n, 100) == 0):
print("Iteration: ", n, " Error: ", error)
# backward phase
deltao = self.outputs - targets
placeholder = np.zeros(np.shape(self.outputs))
for j in range(np.shape(self.outputs)[1]):
y = self.outputs[:, j]
placeholder[:, j] = y * (1 - y)
for y in range(np.shape(self.outputs)[1]):
if not y == j:
placeholder[:, j] += -y * self.outputs[:, y]
deltao *= placeholder
# compute the derivative of the second hidden layer
deltah2 = np.dot(deltao, np.transpose(self.weights3))
deltah2 = self.hidden2*self.beta*(1.0-self.hidden2)*deltah2
# compute the derivative of the first hidden layer
deltah1 = np.dot(deltah2[:, :-1], np.transpose(self.weights2))
deltah1 = self.hidden1*self.beta*(1.0-self.hidden1)*deltah1
# update the weights of the three layers: self.weights1, self.weights2 and self.weights3
updatew1 = eta*(np.dot(np.transpose(inputs),deltah1[:, :-1])) + (self.momentum * updatew1)
updatew2 = eta*(np.dot(np.transpose(self.hidden1),deltah2[:, :-1])) + (self.momentum * updatew2)
updatew3 = eta*(np.dot(np.transpose(self.hidden2),deltao)) + (self.momentum * updatew3)
self.weights1 -= updatew1
self.weights2 -= updatew2
self.weights3 -= updatew3
def forwardPass(self, inputs):
"""
inputs is a numpy array of shape (num_train, D) containing the training images
consisting of num_train samples each of dimension D.
"""
# layer 1
# the forward pass on the first hidden layer with the sigmoid function
self.hidden1 = np.dot(inputs, self.weights1)
self.hidden1 = 1.0/(1.0+np.exp(-self.beta*self.hidden1))
self.hidden1 = np.concatenate((self.hidden1, -np.ones((np.shape(self.hidden1)[0], 1))), axis=1)
# layer 2
# the forward pass on the second hidden layer with the sigmoid function
self.hidden2 = np.dot(self.hidden1, self.weights2)
self.hidden2 = 1.0/(1.0+np.exp(-self.beta*self.hidden2))
self.hidden2 = np.concatenate((self.hidden2, -np.ones((np.shape(self.hidden2)[0], 1))), axis=1)
# output layer
# the forward pass on the output layer with softmax function
outputs = np.dot(self.hidden2, self.weights3)
outputs = np.exp(outputs)
outputs /= np.repeat(np.sum(outputs, axis=1),outputs.shape[1], axis=0).reshape(outputs.shape)
return outputs
Update: I have since figured something out that I messed up during the backpropagation of the SoftMax algorithm. The actual deltao should be:
deltao = self.outputs - targets
placeholder = np.zeros(np.shape(self.outputs))
for j in range(np.shape(self.outputs)[1]):
y = self.outputs[:, j]
placeholder[:, j] = y * (1 - y)
# the counter for the for loop below used to also be named y causing confusion
for i in range(np.shape(self.outputs)[1]):
if not i == j:
placeholder[:, j] += -y * self.outputs[:, i]
deltao *= placeholder
After this correction the overflow errors have seemed to have sorted themselves however, there is now a new problem, no matter my efforts the accuracy of the perceptron does not exceed 15% no matter what variables I change
Second Update: After a long time I have finally found a way to get my code to work. I had to change the backpropogation of SoftMax (in code this is called deltao) to the following:
deltao = np.exp(self.outputs)
deltao/=np.repeat(np.sum(deltao,axis=1),deltao.shape[1]).reshape(deltao.shape)
deltao = deltao * (1 - deltao)
deltao *= (self.outputs - targets)/np.shape(inputs)[0]
Only problem is I have no idea why this works as a derivative of SoftMax could anyone explain this?
Problem
I am trying to write a custom loss function for my Tensorflow 2 model. I have written the following function that calculates the loss I am seeking when I manually pass in an input and output Tensor.
def on_off_balance_loss(y_true: EagerTensor, y_pred: EagerTensor) -> float:
y_true_array: ndarray = np.asarray(y_true).flatten()
y_predict_array: ndarray = np.asarray(y_pred).flatten()
on_delta: float = 0.999
on_loss: float = 0
off_loss: float = 0
on_count: int = 0
off_count: int = 0
for i in range(len(y_true_array)):
loss: float = cell_loss(y_true_array[i], y_predict_array[i])
if y_true_array[i] > on_delta:
on_count += 1
on_loss = on_loss * ((on_count - 1) / on_count) + (loss / on_count)
else:
off_count += 1
off_loss = off_loss * ((off_count - 1) / off_count) + (loss / off_count)
on_factor: int = 4
return (on_factor * on_loss + off_loss) / (on_factor + 1)
For context, y_true consists of a 2D matrix of 1's and 0's as floats, where 0's are much more common. As such, my model was getting a good loss value by just getting most of the 0's correct, even though where the 1's are is the more important metric. This custom loss puts more proportional emphasis on the location of the 1's.
I changed model.compile(loss="binary_crossentropy") to model.compile(loss=on_off_balance_loss) in the attempt to use the new loss function. This doesn't seem to work, as the loss function is supposed to take in an entire batch of data. So, I tried something like this with model.compile(loss=on_off_balance_batch_loss):
def on_off_balance_batch_loss(y_true, y_pred) -> float:
y_trues: list = tf.unstack(y_true)
y_preds: list = tf.unstack(y_pred)
loss: float = 0
for i in range(0, len(y_trues)):
loss = loss * (i / (i + 1)) + (on_off_balance_loss(y_trues[i], y_preds[i]) / (i + 1))
return loss
This doesn't work. The shape of y_true is (None, None, None), and the shape of y_pred is (None, X, Y), where X and Y are the dimensions of the 2D array of 1's and 0's.
I am working in Google Colaboratory. However, locally, np.asarray() seems to work in the way that throws an error on Colaboratory. So, I'm not really sure if the error lies in my loss function or with some setup thing in Colaboratory. I have ensured that I am using Tensorflow 2.3.0 both locally and on Colaboratory.
EDITS:
I tried adding run_eagerly=True to model.compile() and using .numpy() instead of np.asarray() in on_off_balance_loss(). This changed the type of input in on_off_balance_batch_loss from Tensor to EagerTensor. This leads to the error ValueError: No gradients provided for any variable: ['lstm_3/lstm_cell_3/kernel:0', 'lstm_3/lstm_cell_3/recurrent_kernel:0', 'lstm_3/lstm_cell_3/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'lstm_4/lstm_cell_4/kernel:0', 'lstm_4/lstm_cell_4/recurrent_kernel:0', 'lstm_4/lstm_cell_4/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0', 'lstm_5/lstm_cell_5/kernel:0', 'lstm_5/lstm_cell_5/recurrent_kernel:0', 'lstm_5/lstm_cell_5/bias:0'].. The same error occurs if I use
def on_off_balance_batch_loss(y_true: EagerTensor, y_pred: EagerTensor) -> float:
y_trues = tf.TensorArray(tf.float32, 1, dynamic_size=True, infer_shape=False).unstack(y_true)
y_preds = tf.TensorArray(tf.float32, 1, dynamic_size=True, infer_shape=False).unstack(y_pred)
loss: float = 0.0
i: int = 0
for tensor in range(y_trues.size()):
elem_loss: float = on_off_balance_loss(y_trues.read(i), y_preds.read(i))
loss = loss * (i / (i + 1)) + (elem_loss / (i + 1))
i += 1
return loss
and omit run_eagerly=True. Even before the errors are reached, it seems that the whole program is running slower that when I used a default loss function.
Well, turns out the solution is a lot more simple than what I was trying above. Below is the style in which such a function should be implemented. I compared it to the output of my original function, on_off_balance_loss, and it matches.
def on_off_equal_loss(y_true: Tensor, y_pred: Tensor) -> Tensor:
on_delta: float = 0.99
on_mask: Tensor = tf.greater_equal(y_true, on_delta)
off_mask: Tensor = tf.less(y_true, on_delta)
on_loss: Tensor = tf.divide(tf.reduce_sum(tf.abs(tf.subtract(
y_true[on_mask], y_pred[on_mask]
))), tf.cast(tf.math.count_nonzero(on_mask), tf.float32))
off_loss: Tensor = tf.divide(tf.reduce_sum(tf.abs(tf.subtract(
y_true[off_mask], y_pred[off_mask]
))), tf.cast(tf.math.count_nonzero(off_mask), tf.float32))
on_factor: float = 4.0
return tf.divide(tf.add(tf.multiply(on_factor, on_loss), off_loss), on_factor + 1.0)
I'm new to ML, I've been trying to implement a Neural Network using python, but when I use the minimize function with the tnc method from the scipy library I get the following error:
ValueError: tnc: invalid gradient vector.
I looked it up a bit and found this in the source code
arr_grad = (PyArrayObject *)PyArray_FROM_OTF((PyObject *)py_grad, NPY_DOUBLE, NPY_ARRAY_IN_ARRAY);
if (arr_grad == NULL)
{
PyErr_SetString(PyExc_ValueError, "tnc: invalid gradient vector.");
goto failure;
Edit: This is my implementation of backpropagation and cost function as methods of the Network class I created, I am currently using a [400 25 10] structure similar to the one used in Andrew Ng's ML Coursea Course
def cost_function(self, theta, x, y):
u = self.num_layers
m = len(x)
Reg = 0 # Regulaization Term init and Calculation
for i in range(u - 1):
k = np.power(theta[i], 2)
Reg = np.sum(Reg + np.sum(k))
Reg = lmbda / (2 * m) * Reg
h = self.forwardprop(x)[-1] # Getting the activation of the last layer
J = (-1 / m) * np.sum(np.multiply(y, np.log(h)) + np.multiply((1 - y), np.log(1 - h))) + Reg # Cost Func
return J
def backprop(self, theta, x, y):
m = len(x) # number of training example
theta = np.asmatrix(theta) #
theta = self.rollPara(theta) # Roll weights into Matrices, Original shape (1, 10285), after rolling [(25, 401), (26, 10)]
tot_delta = list(range((self.num_layers-1))) # accumulated error init
delta =list(range(self.num_layers-1)) # error from each example init
for i in range(m): # loop for calculating error
a = self.forwardprop(x[i:i+1, :]) # get activation of each layer for ith example
delta[-1] = a[-1] - y[i] # error of output layer of ith example
for j in range(1, self.num_layers-1): # loop to calculate error of each layer for ith example
theta_ = theta[-1-j+1][:, 1:] # weights of jth layer (from back to front)('-1' represents last element)(1. weights index 2.exclude bias units)
act = (a[:-1])[-1-j+1][:, 1:] # activation of current layer (1. exclude output layer layer 2. activation index 3. exclude bias units)
delta_prv = delta[-1-j+1] # error of previous layer
delta[-1-j] = np.multiply(delta_prv#theta_, act) # error of current layer
delta = delta[::-1] # reverse the order of elements since BP starts from back to front
for j in range(self.num_layers-1): # loop to add ith example error to accumlated error
tot_delta[j] = tot_delta[j] + np.transpose(delta[j])#a[self.num_layers-2-j] # add jth layer error from ith example to jth layer accumulated error
ThetaGrad = np.add((1/m)*np.asarray(tot_delta[::-1]), (lmbda/m)*np.asarray(theta)) # calculate gradient
grad = self.unrollPara(ThetaGrad)
return grad
maxiter=500
options = {'maxiter': maxiter}
initTheta = N.unrollPara(N.weights) # flattening into vector
res = op.minimize(fun=N.cost_function, x0=initTheta, jac=N.backprop, method='tnc', args=(x, Y), options=options) # x, Y are training set that are already initialized
This is the scipy source code
Thanks in Advance,
After carefully reading the code I realized it the grad vector has to be a list and not a NumPy array. Not sure if my implementation works properly yet but the error is gone
This code is built up as follows: My robot takes a picture, some tf computer vision model calculates where in the picture the target object starts. This information (x1 and x2 coordinate) is passed to a pytorch model. It should learn to predict the correct motor activations, in order to get closer to the target. After the movement is executed, the robot takes a picture again and the tf cv model should calculate whether the motor activation brought the robot closer to the desired state (x1 at 10, x2 coordinate at at31)
However every time i run the code pytorch is not able to calculate the gradients.
I'm wondering if this is some data-type problem or if it is a more general one: Is it impossible to calculate the gradients if the loss is not calculated directly from the pytorch network's output?
Any help and suggestions will be greatly appreciated.
#define policy model (model to learn a policy for my robot)
import torch
import torch.nn as nn
import torch.nn.functional as F
class policy_gradient_model(nn.Module):
def __init__(self):
super(policy_gradient_model, self).__init__()
self.fc0 = nn.Linear(2, 2)
self.fc1 = nn.Linear(2, 32)
self.fc2 = nn.Linear(32, 64)
self.fc3 = nn.Linear(64,32)
self.fc4 = nn.Linear(32,32)
self.fc5 = nn.Linear(32, 2)
def forward(self,x):
x = self.fc0(x)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.relu(self.fc4(x))
x = F.relu(self.fc5(x))
return x
policy_model = policy_gradient_model().double()
print(policy_model)
optimizer = torch.optim.AdamW(policy_model.parameters(), lr=0.005, betas=(0.9,0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
#make robot move as predicted by pytorch network (not all code included)
def move(motor_controls):
#define curvature
# motor_controls[0] = sigmoid(motor_controls[0])
activation_left = 1+(motor_controls[0])*99
activation_right = 1+(1- motor_controls[0])*99
print("activation left:", activation_left, ". activation right:",activation_right, ". time:", motor_controls[1]*100)
#start movement
#main
import cv2
import numpy as np
import time
from torch.autograd import Variable
print("start training")
losses=[]
losses_end_of_epoch=[]
number_of_steps_each_epoch=[]
loss_function = nn.MSELoss(reduction='mean')
#each epoch
for epoch in range(2):
count=0
target_reached=False
while target_reached==False:
print("epoch: ", epoch, ". step:", count)
###process and take picture
indices = process_picture()
###binary_network(sliced)=indices as input for policy model
optimizer.zero_grad()
###output: 1 for curvature, 1 for duration of movement
motor_controls = policy_model(Variable(torch.from_numpy(indices))).detach().numpy()
print("NO TANH output for motor: 1)activation left, 2)time ", motor_controls)
motor_controls[0] = np.tanh(motor_controls[0])
motor_controls[1] = np.tanh(motor_controls[1])
print("TANH output for motor: 1)activation left, 2)time ", motor_controls)
###execute suggested action
move(motor_controls)
###take and process picture2 (after movement)
indices = (process_picture())
###loss=(binary_network(picture2) - desired
print("calculate loss")
print("idx", indices, type(torch.tensor(indices)))
# loss = 0
# loss = (indices[0]-10)**2+(indices[1]-31)**2
# loss = loss/2
print("shape of indices", indices.shape)
array=np.zeros((1,2))
array[0]=indices
print(array.shape, type(array))
array2 = torch.ones([1,2])
loss = loss_function(torch.tensor(array).double(), torch.tensor([[10.0,31.0]]).double()).float()
print("loss: ", loss, type(loss), loss.shape)
# array2[0] = loss_function(torch.tensor(array).double(),
torch.tensor([[10.0,31.0]]).double()).float()
losses.append(loss)
#start line causing the error-message (still part of main)
###calculate gradients
loss.backward()
#end line causing the error-message (still part of main)
###apply gradients
optimizer.step()
#Output (so far as intented) (not all included)
#calculate loss
idx [14. 15.] <class 'torch.Tensor'>
shape of indices (2,)
(1, 2) <class 'numpy.ndarray'>
loss: tensor(136.) <class 'torch.Tensor'> torch.Size([])
#Error Message:
Traceback (most recent call last):
File "/home/pi/Desktop/GradientPolicyLearning/PolicyModel.py", line 259, in <module>
array2.backward()
File "/home/pi/.local/lib/python3.7/site-packages/torch/tensor.py", line 134, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/pi/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in
backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
If you call .detach() on the prediction, that will delete the gradients. Since you are first getting indices from the model and then trying to backprop the error, I would suggest
prediction = policy_model(torch.from_numpy(indices))
motor_controls = prediction.clone().detach().numpy()
This would keep the predictions as it is with the calculated gradients that can be backproped.
Now you can do
loss = loss_function(prediction, torch.tensor([[10.0,31.0]]).double()).float()
Note, you might wanna call double of the prediction if it throws an error.
It is indeed impossible to calculate the gradients if the loss is not calculated directly from the PyTorch network's output because then you would not be able to apply the chain rule which is used to optimise the gradients.
simple solution, turn on the Context Manager that sets gradient calculation to ON, if it is off
torch.set_grad_enabled(True) # Context-manager
Make sure that all your inputs into the NN, the output of NN and ground truth/target values are all of type torch.tensor and not list, numpy.array or any other iterable.
Also, make sure that they are not converted to list or numpy.array at any point either.
In my case, I got this error because I performed list comprehension on the tensor containing predicted values from NN. I did this to get the max value in each row. Then, converted the list back to a torch.tensor. before calculating the loss.
This back and forth conversion disables the gradient calculations
In my case, I got past this error by specifying requires_grad=True when defining my input tensors
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('dark_background')
# define rosenbrock function and gradient
a = 1
b = 5
def f(x):
return (a - x[0]) ** 2 + b * (x[1] - x[0] ** 2) ** 2
def jac(x):
dx1 = -2 * a + 4 * b * x[0] ** 3 - 4 * b * x[0] * x[1] + 2 * x[0]
dx2 = 2 * b * (x[1] - x[0] ** 2)
return np.array([dx1, dx2])
# create stochastic rosenbrock function and gradient
def f_rand(x):
return f(x) * np.random.uniform(0.5, 1.5)
def jac_rand(x): return jac(x) * np.random.uniform(0.5, 1.5)
# use hand coded adam
x = np.array([0.1, 0.1])
x0 = x.copy()
j = jac_rand(x)
beta1=0.9
beta2=0.999
eps=1e-8
m = x * 0
v = x * 0
learning_rate = .1
for ii in range(200):
m = (1 - beta1) * j + beta1 * m # first moment estimate.
v = (1 - beta2) * (j ** 2) + beta2 * v # second moment estimate.
mhat = m / (1 - beta1 ** (ii + 1)) # bias correction.
vhat = v / (1 - beta2 ** (ii + 1))
x = x - learning_rate * mhat / (np.sqrt(vhat) + eps)
x -= learning_rate * v
j = jac_rand(x)
print('hand code finds optimal to be ', x, f(x))
# attempt to use pytorch
import torch
x_tensor = torch.tensor(x0, requires_grad=True)
optimizer = torch.optim.Adam([x_tensor], lr=learning_rate)
def closure():
optimizer.zero_grad()
loss = f_rand(x_tensor)
loss.backward()
return loss
for ii in range(200):
optimizer.step(closure)
print('My PyTorch attempt found ', x_tensor, f(x_tensor))
Following worked for me:
loss.requires_grad = True
loss.backward()
I have a following loop where I am calculating softmax transform for batches of different sizes as below
import numpy as np
def softmax(Z,arr):
"""
:param Z: numpy array of any shape (output from hidden layer)
:param arr: numpy array of any shape (start, end)
:return A: output of multinum_logit(Z,arr), same shape as Z
:return cache: returns Z as well, useful during back propagation
"""
A = np.zeros(Z.shape)
for i in prange(len(arr)):
shiftx = Z[:,arr[i,1]:arr[i,2]+1] - np.max(Z[:,int(arr[i,1]):int(arr[i,2])+1])
A[:,arr[i,1]:arr[i,2]+1] = np.exp(shiftx)/np.exp(shiftx).sum()
cache = Z
return A,cache
Since this for loop is not vectorized it is the bottleneck in my code. What is a possible solution to make it faster. I have tried using #jit of numba which makes it little faster but not enough. I was wondering if there is another way to make it faster or vectorize/parallelize it.
Sample input data for the function
Z = np.random.random([1,10000])
arr = np.zeros([100,3])
arr[:,0] = 1
temp = int(Z.shape[1]/arr.shape[0])
for i in range(arr.shape[0]):
arr[i,1] = i*temp
arr[i,2] = (i+1)*temp-1
arr = arr.astype(int)
EDIT:
I forgot to stress here that my number of class is varying. For example batch 1 has say 10 classes, batch 2 may have 15 classes. Therefore I am passing an array arr which keeps track of the which rows belong to batch1 and so on. These batches are different than the batches in traditional neural network framework
In the above example arr keeps track of starting index and end index of rows. So the denominator in the softmax function will be sum of only those observations whose index lie between the starting and ending index.
Here's a vectorized softmax function. It's the implementation of an assignment from Stanford's cs231n course on conv nets.
The function takes in optimizable parameters, input data, targets, and a regularizer. (You can ignore the regularizer as that references another class exclusive to some cs231n assignments).
It returns a loss and gradients of the parameters.
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.amax(scores,axis=1).reshape(-1,1)
softmax = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis=1).reshape(-1,1)
loss = -np.sum(np.log(softmax[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dSoftmax = softmax.copy()
dSoftmax[range(num_train), list(y)] += -1
dW = (X.T).dot(dSoftmax)
dW = dW/num_train + reg * W
return loss, dW
For comparison's sake, here is a naive (non-vectorized) implementation of the same method.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss_i = -shift_scores[y[i]] + np.log(sum(np.exp(shift_scores)))
loss += loss_i
for j in xrange(num_classes):
softmax = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j==y[i]:
dW[:,j] += (-1 + softmax) * X[i]
else:
dW[:,j] += softmax *X[i]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train + reg * W
return loss, dW
Source