I am newbie to NN and I am trying to implement NN with Python/Numpy from the code I found at:
"Create a Simple Neural Network in Python from Scratch"
enter link description here
My input array is:
array([[5.71, 5.77, 5.94],
[5.77, 5.94, 5.51],
[5.94, 5.51, 5.88],
[5.51, 5.88, 5.73]])
Output array is:
array([[5.51],
[5.88],
[5.73],
[6.41]])
after running the code, I see following results which are not correct:
synaptic_weights after training
[[1.90625275]
[2.54867698]
[1.07698312]]
outputs after training
[[1.]
[1.]
[1.]
[1.]]
Here is the core of the code:
for iteration in range(1000):
input_layer = tr_input
outputs = sigmoid(np.dot(input_layer, synapic_weights))
error = tr_output - outputs
adjustmnets = error * sigmoid_derivative(outputs)
synapic_weights +=np.dot(input_layer.T, adjustmnets )
print('synaptic_weights after training')
print(synapic_weights)
print('outputs after training')
print(outputs)
What should I change in this code so it works for my data? Or shall I take different method? Any help is highly appreciated.
That's because you are using a wrong activation function (i.e. sigmoid). The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
If you want to train a model to predict the values in your array, you should use a regression model. Otherwise, you can convert your output into labels (for example are 5.x to 0 and 6.x to 1) and retrain your model.
These are the steps involved in my neural network implementation.
Randomly initialize weights (θ Theta)
Implement forward propagation
Compute cost function
Implement back propagation to compute partial derivative
Use gradient descent
def forward_prop(X, theta_list):
m = X.shape[0]
a_list = []
z_list = []
a_list.append(np.insert(X, 0, values=np.ones(m), axis=1))
idx = 0
for idx, thera in enumerate(theta_list):
z_list.append(a_list[idx] * (theta_list[idx].T))
if idx != (len(theta_list)-1):
a_list.append(np.insert(sigmoid(z_list[idx]), 0, values=np.ones(m), axis=1))
else:
a_list.append(sigmoid(z_list[idx]))
return a_list, z_list
def back_prop(params, input_size, hidden_layers, num_labels, X, y, regularization, regularize):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
theta_list = []
startCount = 0
idx = 0
for idx, val in enumerate(hidden_layers):
if idx == 0:
startCount = val * (input_size + 1)
theta_list.append(np.matrix(np.reshape(params[:startCount], (val, (input_size + 1)))))
if idx != 0:
tempCount = startCount
startCount += (val * (hidden_layers[idx-1] + 1))
theta_list.append(np.matrix(np.reshape(params[tempCount:startCount], (val, (hidden_layers[idx-1] + 1)))))
if idx == (len(hidden_layers)-1):
theta_list.append(np.matrix(np.reshape(params[startCount:], (num_labels, (val + 1)))))
a_list, z_list= forward_prop(X, theta_list)
J = cost(X, y, a_list[len(a_list)-1], theta_list, regularization, regularize)
d_list = []
d_list.append(a_list[len(a_list)-1] - y)
idx = 0
while idx < (len(theta_list)-1):
d_temp = np.multiply(d_list[idx] * theta_list[len(a_list) - 2 - idx], sigmoid_gradient(a_list[len(a_list) - 2 - idx]))
d_list.append(d_temp[:,1:])
idx += 1
delta_list = []
for theta in theta_list:
delta_list.append(np.zeros(theta.shape))
for idx, delta in enumerate(delta_list):
delta_list[idx] = delta_list[idx] + ((d_list[len(d_list) - 1 -idx].T) * a_list[idx])
delta_list[idx] = delta_list[idx] / m
if regularize:
for idx, delta in enumerate(delta_list):
delta_list[idx][:, 1:] = delta_list[idx][:, 1:] + (theta_list[idx][:, 1:] * regularization)
grad_list = np.ravel(delta_list[0])
idx = 1
while idx < (len(delta_list)):
grad_list = np.concatenate((grad_list, np.ravel(delta_list[idx])), axis=None)
idx += 1
return J, grad_list
def cost(X, y, h, theta_list, regularization, regularize):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
J = (np.multiply(-y, np.log(h)) - np.multiply((1 - y), np.log(1 - h))).sum() / m
if regularize:
regularization_value = 0.0
for theta in theta_list:
regularization_value += np.sum(np.power(theta[:, 1:], 2))
J += (float(regularization) / (2 * m)) * regularization_value
return J
Implementation
Related
I'm trying to implement my own neural network with (almost) fully vectorized operations. There are lots of posts out there but I can't seem to find one that fits all three of these:
separate cross-entropy and softmax terms in the gradient calculation (so I can interchange the last activation and loss)
multi-class classification (y is one-hot encoded)
all operations are fully vectorized
My main question is: How do I get to dE/dz (N x K) given dE/da (N x K) and da/dz (N x K x K) using a fully vectorized operation? i.e. How do I vectorize dE_dz_test2?
My second question is:
Is there a better way to write softmax_derivative?
I used this as a reference for calculating the gradient one sample at a time:
http://saitcelebi.com/tut/output/part2.html
and this for figuring out how to do backprop
https://peterroelants.github.io/posts/neural-network-implementation-part04/
def one_hot_encode(y, n_classes):
y_onehot = np.zeros((len(y), n_classes))
for i, y_i in enumerate(y):
y_onehot[i, y_i] = 1
return y_onehot
def cross_entropy_derivative(y_true, y_pred):
# dE / da
# input: N x K
# output: N x K array
N = len(y_true)
return -(y_true / y_pred) / N
def softmax(x):
# activation (a)
# input: N x K array
# output: N x K array
# https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
exp = np.exp(x - np.max(x))
return exp / np.sum(exp, axis=1)[:, None]
def softmax_derivative(Z):
# da/dz
#input: N x K array
#output: N x K x K array
#http://saitcelebi.com/tut/output/part2.html
N, K = Z.shape
s = softmax(Z)[:, :, np.newaxis]
a = np.tensordot(s, np.ones((1, K)), axes=([-1],[0]))
I = np.repeat(np.eye(K, K)[np.newaxis, :, :], N, axis=0)
b = I - np.tensordot(np.ones((K, 1)), s.T, axes=([-1],[0])).T
return a * np.swapaxes(b, 1, 2)
def softmax_derivative_test(Z):
# da/dz
# non-vectorized softmax gradient calculation
#http://saitcelebi.com/tut/output/part2.html
N, K = Z.shape
da_dz = np.zeros((N, K, K))
kron_delta = np.eye(K)
s = softmax(Z)
for n in range(N):
for i in range(K):
for j in range(K):
da_dz[n, i, j] = s[n, i] * (kron_delta[i, j] - s[n, j])
return da_dz
def dE_dz_test2(dE_da, da_dz):
# array (N x K)
# array (N x K x K)
# output: array (N x K)
N, K = dE_da.shape
dE_dz = np.zeros((N, K))
for n in range(N):
dE_dz[n, :] = np.matmul(da_dz[n], dE_da[n, :, np.newaxis]).T
return dE_dz
def some_type_of_matrix_multiplication_(dE_da, da_dz):
# how do i get dE/dz from dE_da and da_dz
pass
X = np.random.rand(100, 2)
W = np.random.rand(2, 4)
y = np.random.randint(0, 4, size=100)
y = one_hot_encode(y, 4)
Z = X # W
S = softmax(Z)
N, K = Z.shape
# da / dz for softmax
da_dz = softmax_derivative(Z) # (100, 4, 4)
da_dz_test = softmax_derivative_test(Z) # (100, 4, 4) - non vectorized implementation
print(np.isclose(da_dz, da_dz_test).all()) # equivalence test
dE_da = cross_entropy_derivative(y, S) # (100, 4)
dE_dz = some_type_of_matrix_multiplication_(dE_da, da_dz) # what do I do here? *****
dE_dz_test = (S - y) / N # (100, 4) If you combine dE/da and da/dz terms
dE_dz_test2 = dE_dz_test2(dE_da, da_dz)
print(np.isclose(dE_dz_test, dE_dz_test2).all()) # equivalence test
True
True
Here is an approach using np.einsum:
def da_dz_pp(z,sm=None):
if sm is None:
sm = softmax(z)
res = np.einsum('ij,ik->ijk',sm,-sm)
np.einsum('ijj->ij',res)[...] += sm
return res
def dE_dz_pp(y,z,sm=None):
if sm is None:
sm = softmax(z)
dE_da = cross_entropy_derivative(y,sm)
da_dz = da_dz_pp(z,sm)
return np.einsum('ij,ijk->ik',dE_da,da_dz)
It seems to reproduce what your code outputs and is a bit faster.
so what I try to do is to simulate with Monte Carlo a American Option (Stock) and use TensorFlow to price it.
I use two helper function , get_continuation_function to create the TF operators. And the pricing_function to create the computational graph for the pricing.
The npv operator is sum of the optimal exercise decisions. At each time I check if the exercise value is greater than the predicted continuation value (in other words, whether the option is in the money).
And the actual pricing function is american_tf. I execute the function to create the paths, the exercise values for the training path. Then, I iterate backward through the training_functions and learn the value and decision on each exercise date.
def get_continuation_function():
X = tf.placeholder(tf.float32, (None,1),name="X")
y = tf.placeholder(tf.float32, (None,1),name="y")
w = tf.Variable(tf.random_uniform((1,1))*0.1,,name="w")
b = tf.Variable(initial_value = tf.ones(1)*1,name="b")
y_hat = tf.add(tf.matmul(X, w), b)
pre_error = tf.pow(y-y_hat,2)
error = tf.reduce_mean(pre_error)
train = tf.train.AdamOptimizer(0.1).minimize(error)
return(X, y, train, w, b, y_hat)
def pricing_function(number_call_dates):
S = tf.placeholder(tf.float32,name="S")
# First excerise date
dts = tf.placeholder(tf.float32,name="dts")
# 2nd exersice date
K = tf.placeholder(tf.float32,name="K")
r = tf.placeholder(tf.float32,,name="r")
sigma = tf.placeholder(tf.float32,name="sigma")
dW = tf.placeholder(tf.float32,name="dW")
S_t = S * tf.cumprod(tf.exp((r-sigma**2/2) * dts + sigma * tf.sqrt(dts) * dW), axis=1)
E_t = tf.exp(-r * tf.cumsum(dts)) * tf.maximum(K-S_t, 0)
continuationValues = []
training_functions = []
previous_exersies = 0
npv = 0
for i in range(number_call_dates-1):
(input_x, input_y, train, w, b, y_hat) = get_continuation_function()
training_functions.append((input_x, input_y, train, w, b, y_hat))
X = tf.keras.activations.relu(S_t[:, i])
contValue = tf.add(tf.matmul(X, w),b)
continuationValues.append(contValue)
inMoney = tf.cast(tf.greater(E_t[:,i], 0.), tf.float32)
exercise = tf.cast(tf.greater(E_t[:,i], contValue[:,0]), tf.float32) * inMoney * (1-previous_exersies)
previous_exersies += exercise
npv += exercise*E_t[:,i]
# Last exercise date
inMoney = tf.cast(tf.greater(E_t[:,-1], 0.), tf.float32)
exercise = inMoney * (1-previous_exersies)
npv += exercise*E_t[:,-1]
npv = tf.reduce_mean(npv)
return([S, dts, K, r, sigma,dW, S_t, E_t, npv, training_functions])
def american_tf(S_0, strike, M, impliedvol, riskfree_r, random_train, random_pricing):
n_exercise = len(M)
with tf.Session() as sess:
S,dts,K,r,sigma,dW,S_t,E_t,npv,training_functions = pricing_function(n_exercise)
sess.run(tf.global_variables_initializer())
paths, exercise_values = sess.run([S_t,E_t], {
S: S_0,
dts: M,
K: strike,
r: riskfree_r,
sigma: impliedvol,
dW: random_train
})
for i in range(n_exercise-1)[::-1]:
(input_x,input_y,train,w,b,y_hat) = training_functions[i]
y= exercise_values[:,i+1:i+2]
X = paths[:,i]
print(input_x.shape)
print((exercise_values[:,i]>0).shape)
for epochs in range(100):
_ = sess.run(train, {input_x:X[exercise_values[:,i]>0],
input_y:y[exercise_values[:,i]>0]})
cont_value = sess.run(y_hat, {input_x:X, input_y:y})
exercise_values[:,i+1:i+2] = np.maximum(exercise_values[:,i+1:i+2], cont_value)
npv = sess.run(npv, {S: S_0, K: strike, r: riskfree_r, sigma: impliedvol, dW: N_pricing})
return npv
N_samples_learn = 1000
N_samples_pricing = 1000
calldates = 12
N = np.random.randn(N_samples_learn,calldates)
N_pricing = np.random.randn(N_samples_pricing,calldates)
american_tf(100., 90., [1.]*calldates, 0.25, 0.05, N, N_pricing)
Calldates is the number of steps
training sample set = 1000
test sample size = 1000
But my error is very weird
---> 23 nput_y:y[exercise_values[:,i]>0]})
ValueError: Cannot feed value of shape (358,) for Tensor 'Placeholder_441:0', which has shape '(?, 1)'
There are a bunch of things discussed in comment with #hallo12. I just want to upload a working version incorporating all the changes. The code is tested and runs without error. But to make sure the final training output is correct, you may want to compare against some benchmark.
General comment: It's good to separate the variable and time dimension in this type of application, especially when you only have 1 variable. For example, your input array should be 3D with
[time, training sample, input variable]
rather than 2D with [training sample, time]. This way when you iterate over the time dimension, the rest of the dimensions are kept unchanged.
import tensorflow as tf
import numpy as np
def get_continuation_function():
X = tf.placeholder(tf.float32, (None,1),name="X")
y = tf.placeholder(tf.float32, (None,1),name="y")
w = tf.Variable(tf.random_uniform((1,1))*0.1,name="w")
b = tf.Variable(initial_value = tf.ones(1)*1,name="b")
y_hat = tf.add(tf.matmul(X, w), b)
pre_error = tf.pow(y-y_hat,2)
error = tf.reduce_mean(pre_error)
train = tf.train.AdamOptimizer(0.1).minimize(error)
return(X, y, train, w, b, y_hat)
def pricing_function(number_call_dates):
S = tf.placeholder(tf.float32,name="S")
# First excerise date
dts = tf.placeholder(tf.float32,name="dts")
# 2nd exersice date
K = tf.placeholder(tf.float32,name="K")
r = tf.placeholder(tf.float32,name="r")
sigma = tf.placeholder(tf.float32,name="sigma")
dW = tf.placeholder(tf.float32,name="dW")
S_t = S * tf.cumprod(tf.exp((r-sigma**2/2) * dts + sigma * tf.sqrt(dts) * dW), axis=1)
E_t = tf.exp(-r * tf.cumsum(dts)) * tf.maximum(K-S_t, 0)
continuationValues = []
training_functions = []
previous_exersies = 0
npv = 0
for i in range(number_call_dates-1):
(input_x, input_y, train, w, b, y_hat) = get_continuation_function()
training_functions.append((input_x, input_y, train, w, b, y_hat))
X = tf.keras.activations.relu(S_t[:, i:i+1])
contValue = tf.add(tf.matmul(X, w),b)
continuationValues.append(contValue)
inMoney = tf.cast(tf.greater(E_t[:,i], 0.), tf.float32)
exercise = tf.cast(tf.greater(E_t[:,i], contValue[:,0]), tf.float32) * inMoney * (1-previous_exersies)
previous_exersies += exercise
npv += exercise*E_t[:,i]
# Last exercise date
inMoney = tf.cast(tf.greater(E_t[:,-1], 0.), tf.float32)
exercise = inMoney * (1-previous_exersies)
npv += exercise*E_t[:,-1]
npv = tf.reduce_mean(npv)
return([S, dts, K, r, sigma,dW, S_t, E_t, npv, training_functions])
def american_tf(S_0, strike, M, impliedvol, riskfree_r, random_train, random_pricing):
n_exercise = len(M)
with tf.Session() as sess:
S,dts,K,r,sigma,dW,S_t,E_t,npv,training_functions = pricing_function(n_exercise)
sess.run(tf.global_variables_initializer())
paths, exercise_values = sess.run([S_t,E_t], {
S: S_0,
dts: M,
K: strike,
r: riskfree_r,
sigma: impliedvol,
dW: random_train
})
for i in range(n_exercise-1)[::-1]:
(input_x,input_y,train,w,b,y_hat) = training_functions[i]
y= exercise_values[:,i+1:i+2]
X = paths[:,i]
print(input_x.shape)
print((exercise_values[:,i]>0).shape)
for epochs in range(100):
_ = sess.run(train, {input_x:(X[exercise_values[:,i]>0]).reshape(len(X[exercise_values[:,i]>0]),1),
input_y:(y[exercise_values[:,i]>0]).reshape(len(y[exercise_values[:,i]>0]),1)})
cont_value = sess.run(y_hat, {input_x:X.reshape(len(X),1), input_y:y.reshape(len(y),1)})
exercise_values[:,i+1:i+2] = np.maximum(exercise_values[:,i+1:i+2], cont_value)
npv = sess.run(npv, {S: S_0, K: strike, dts:M, r: riskfree_r, sigma: impliedvol, dW: N_pricing})
return npv
N_samples_learn = 1000
N_samples_pricing = 1000
calldates = 12
N = np.random.randn(N_samples_learn,calldates)
N_pricing = np.random.randn(N_samples_pricing,calldates)
print(american_tf(100., 90., [1.]*calldates, 0.25, 0.05, N, N_pricing))
I am trying to implement GMM Clustering for both 24 Dimension feature vector and 32 dimension feature vector, where assignment of initial parameters are done by Kmeans algorightm (K mean clustering is providing cluster centers - MU - only).
I am following this link, where it's implemented only for 2D feature vector and predefined Mu and sigma.
If anyone have the code for GMM clustering kindly post.
Predefined Lib for GMM is also there in sklearn, but it's not giving me likelyhood for each iteration. sklearn GMM
def kmeans(dataSet, k, c):
# 1. Randomly choose clusters
rng = np.random.RandomState(c)
p = rng.permutation(dataSet.shape[0])[:k]
centers = dataSet[p]
while True:
labels = pairwise_distances_argmin(dataSet, centers)
new_centers = np.array([dataSet[labels == i].mean(0) for i in range(k)]
if np.all(centers == new_centers):
break
centers = new_centers
cluster_data = [dataSet[labels == i] for i in range(k)]
l = []
covs = []
for i in range(k):
l.append(len(cluster_data[i]) * 1.0 / len(dataSet))
covs.append(np.cov(np.array(cluster_data[i]).T))
return centers, l, covs, cluster_data
return new_mu, new_covs, cluster_data
class gaussian_Mix_Model:
def __init__(self, k = 8, eps = 0.0000001):
self.k = k ## number of clusters
self.eps = eps ## threshold to stop `epsilon`
def calculate_Exp_Maxim(self, X, max_iters = 1000):
# n = number of data-points, d = dimension of data points
n, d = X.shape
mu, Cov = [], []
for i in range(1,k):
new_mu, new_covs, cluster_data = kmeans(dataSet, k, c)
# Initialize new
mu[k] = new_mu
Cov[k]= new_cov
# initialize the weights
w = [1./self.k] * self.k
R = np.zeros((n, self.k))
### LLhoods
LLhoods = []
P = lambda mu, s: np.linalg.det(s) ** -.5 ** (2 * np.pi) ** (-X.shape[1]/2.) \
* np.exp(-.5 * np.einsum('ij, ij -> i',\
X - mu, np.dot(np.linalg.inv(s) , (X - mu).T).T ) )
# Iterate till max_iters iterations
while len(LLhoods) < max_iters:
# Expectation Calcultion
## membership for each of K Clusters
for k in range(self.k):
R[:, k] = w[k] * P(mu[k], Cov[k])
# Finding the log likelihood
LLhood = np.sum(np.log(np.sum(R, axis = 1)))
# Now store the log likelihood to the list.
LLhoods.append(LLhood)
# Number of data points to each clusters
R = (R.T / np.sum(R, axis = 1)).T
N_ks = np.sum(R, axis = 0)
# Maximization and calculating the new parameters.
for k in range(self.k):
# Calculate the new means
mu[k] = 1. / N_ks[k] * np.sum(R[:, k] * X.T, axis = 1).T
x_mu = np.matrix(X - mu[k])
# Calculate new cov
Cov[k] = np.array(1 / N_ks[k] * np.dot(np.multiply(x_mu.T, R[:, k]), x_mu))
# Calculate new PiK
w[k] = 1. / n * N_ks[k]
# check for convergence
if (np.abs(LLhood - LLhoods[-2]) < self.eps) and (iteration < max_iters): break
else:
Continue
from collections import namedtuple
self.params = namedtuple('params', ['mu', 'Cov', 'w', 'LLhoods', 'num_iters'])
self.params.mu = mu
self.params.Cov = Cov
self.params.w = w
self.params.LLhoods = LLhoods
self.params.num_iters = len(LLhoods)
return self.params
# Call the GMM to find the model
gmm = gaussian_Mix_Model(3, 0.000001)
params = gmm.fit_EM(X, max_iters= 150)
# Plotting of Log-Likelihood VS Iterations.
plt.plot(LLhoods[0])
plt.savefig('Dataset_2A_GMM_Class_1_K_16.png')
plt.clf()
plt.plot(LLhoods[1])
plt.savefig('Dataset_2A_GMM_Class_2_K_16.png')
plt.clf()
plt.plot(LLhoods[2])
plt.savefig('Dataset_2A_GMM_Class_3_K_16.png')
plt.clf()
I am writing a recurrent neural network (specifically, a ConvLSTM). Recently, I have noticed an interesting inconsistency that I cannot quite figure out. I have written this neural network from scratch using numpy (technically cupy for gpu) and a few Chainer lines (specifically for their F.convolution_2D function).
When running this same network twice, for the first 4 or so training examples, the losses are EXACTLY the same. However, around the 5th training example, the losses start to fluctuate in their value.
I have ensured that each time I am running this network, they are reading from the same initial state text file (and thus have the same initial weights and biases). I have also ensured that the data they are inputting are exactly the same.
Is there some inconsistency with Numpy that is the root of this problem? The only thing I can think that is different around the 4th training example is the first usage of gradient clipping. Is there some problem with numpy's linalg function? Is there some rounding error I am not familiar with? I have scanned through my code and there is no instance of utilizing random numbers.
I have added my backpropagation function below:
def bptt(x2, y2, iteration):
x = cp.asarray(x2)
y = cp.asarray(y2)
global connected_weights
global main_kernel
global bias_i
global bias_f
global bias_c
global bias_o
global bias_y
global learning_rate
# Perform forward prop
prediction, pre_sigmoid_prediction, hidden_prediction, i, f, a, c, o, h = forward_prop(x)
loss = calculate_loss(prediction, y)
print("LOSS BEFORE: ")
print(loss)
# Calculate loss with respect to final layer
dLdy_2 = loss_derivative(prediction, y)
# Calculate loss with respect to pre sigmoid layer
dLdy_1 = cp.multiply(sigmoid_derivative(pre_sigmoid_prediction), dLdy_2)
# Calculate loss with respect to last layer of lstm
dLdh = cp.zeros([T + 1, channels_hidden, M, N])
dLdh[T - 1] = cp.reshape(cp.matmul(cp.transpose(connected_weights), dLdy_1.reshape(1, M * N)), (channels_hidden, M, N)) # reshape dLdh to the appropriate size
dLdw_0 = cp.matmul(dLdy_1.reshape(1, M*N), hidden_prediction.transpose(1,0))
# Calculate loss with respect to bias y
dLdb_y = dLdy_1
#--------------------fully connected------------------
bias_y = bias_y - learning_rate*dLdb_y
connected_weights = connected_weights - learning_rate*dLdw_0
# Initialize corresponding matrices
dLdo = cp.zeros([T, channels_hidden, M, N])
dLdc = cp.zeros([T + 1, channels_hidden, M, N])
dLda = cp.zeros([T, channels_hidden, M, N])
dLdf = cp.zeros([T, channels_hidden, M, N])
dLdi = cp.zeros([T, channels_hidden, M, N])
dLdI = cp.zeros([T, channels_hidden+ channels_img, M, N])
dLdW = cp.zeros([4*channels_hidden, channels_img + channels_hidden, kernel_dimension, kernel_dimension])
# Initialize other stuff
dLdo_hat = cp.zeros([T, channels_hidden, M, N])
dLda_hat = cp.zeros([T, channels_hidden, M, N])
dLdf_hat = cp.zeros([T, channels_hidden, M, N])
dLdi_hat = cp.zeros([T, channels_hidden, M, N])
# initialize biases
dLdb_c = cp.empty([channels_hidden, M, N])
dLdb_i = cp.empty([channels_hidden, M, N])
dLdb_f = cp.empty([channels_hidden, M, N])
dLdb_o = cp.empty([channels_hidden, M, N])
for t in cp.arange(T - 1, -1, -1):
dLdo[t] = cp.multiply(dLdh[t], tanh(c[t]))
dLdc[t] += cp.multiply(cp.multiply(dLdh[t], o[t]), (cp.ones((channels_hidden, M, N)) - cp.multiply(tanh(c[t]), tanh(c[t]))))
dLdi[t] = cp.multiply(dLdc[t], a[t])
dLda[t] = cp.multiply(dLdc[t], i[t])
dLdf[t] = cp.multiply(dLdc[t], c[t - 1])
dLdc[t - 1] = cp.multiply(dLdc[t], f[t])
dLda_hat[t] = cp.multiply(dLda[t], (cp.ones((channels_hidden, M, N)) - cp.multiply(a[t], a[t])))
dLdi_hat[t] = cp.multiply(cp.multiply(dLdi[t], i[t]), cp.ones((channels_hidden, M, N)) - i[t])
dLdf_hat[t] = cp.multiply(cp.multiply(dLdf[t], f[t]), cp.ones((channels_hidden, M, N)) - f[t])
dLdo_hat[t] = cp.multiply(cp.multiply(dLdo[t], o[t]), cp.ones((channels_hidden, M, N)) - o[t])
dLdb_c += dLda_hat[t]
dLdb_i += dLdi_hat[t]
dLdb_f += dLdf_hat[t]
dLdb_o += dLdo_hat[t]
# CONCATENATE Z IN THE RIGHT ORDER SAME ORDER AS THE WEIGHTS
dLdz_hat = cp.concatenate((dLdi_hat[t], dLdf_hat[t], dLda_hat[t], dLdo_hat[t]), axis = 0)
#determine convolution derivatives
#here we will use the fact that in z = w * I, dLdW = dLdz * I
temporary = cp.concatenate((x[t], h[t - 1]), axis=0).reshape(channels_hidden + channels_img, 1, M, N)
dLdI[t] = cp.asarray(F.convolution_2d(dLdz_hat.reshape(1, 4*channels_hidden, M, N), main_kernel.transpose(1, 0, 2, 3), b=None, pad=1)[0].data) # reshape into flipped kernel dimensions
dLdW += cp.asarray((F.convolution_2d(temporary, dLdz_hat.reshape(4*channels_hidden, 1, M, N), b=None, pad=1).data).transpose(1,0,2,3)) #reshape into kernel dimensions
#gradient clipping
if cp.amax(dLdW) > 1 or cp.amin(dLdW) < -1:
dLdW = dLdW/cp.linalg.norm(dLdW)
if cp.amax(dLdb_c) > 1 or cp.amin(dLdb_c) < -1:
dLdb_c = dLdb_c/cp.linalg.norm(dLdb_c)
if cp.amax(dLdb_i) > 1 or cp.amin(dLdb_i) < -1:
dLdb_i = dLdb_i/cp.linalg.norm(dLdb_i)
if cp.amax(dLdb_f) > 1 or cp.amin(dLdb_f) < -1:
dLdb_f = dLdb_f/cp.linalg.norm(dLdb_f)
if cp.amax(dLdb_o) > 1 or cp.amin(dLdb_o) < -1:
dLdb_o = dLdb_o/cp.linalg.norm(dLdb_o)
if cp.amax(dLdw_0) > 1 or cp.amin(dLdw_0) < -1:
dLdw_0 = dLdw_0/cp.linalg.norm(dLdw_0)
if cp.amax(dLdb_y) > 1 or cp.amin(dLdb_y) < -1:
dLdb_y = dLdb_y/cp.linalg.norm(dLdb_y)
print("dLdW on step: " + str(t) + " is this: " + str(dLdW[0][0][0][0]))
#print("dLdw_0")
#print("dLdW")
#print(dLdW)
#print(str(cp.amax(dLdw_0)) + " : " + str(cp.amin(dLdw_0)))
#print("dLdW")
#print(str(cp.amax(dLdW)) + " : " + str(cp.amin(dLdW)))
#print("dLdb_c")
#print(str(cp.amax(dLdb_c)) + " : " + str(cp.amin(dLdb_c)))
dLdh[t-1] = dLdI[t][channels_img: channels_img+channels_hidden]
#.reshape(4*channels_hidden, channels_hidden+channels_img, kernel_dimension, kernel_dimension)
#update weights with convolution derivatives
#----------------------------adam optimizer code-----------------------------------
#---------------------update main kernel---------
main_kernel = main_kernel - learning_rate*dLdW
#--------------------update bias c-----------------------
bias_c = bias_c - learning_rate*dLdb_c
#--------------------update bias i-----------------------
bias_i = bias_i - learning_rate*dLdb_i
#--------------------update bias f-----------------------
bias_f = bias_f - learning_rate*dLdb_f
#--------------------update bias c-----------------------
bias_o = bias_o - learning_rate*dLdb_o
prediction2, pre_sigmoid_prediction2, hidden_prediction2, i2, f2, a2, c2, o2, h2 = forward_prop(x)
print("dLdW is: " + str(dLdW[0][0][0][0]))
loss2 = calculate_loss(prediction2, y)
print("LOSS AFTER: ")
print(loss2)
print("backpropagation complete")
Wow, that took some time.
If you look at the back propagation code, look closely at these lines:
dLdb_c = cp.empty([channels_hidden, M, N])
dLdb_i = cp.empty([channels_hidden, M, N])
dLdb_f = cp.empty([channels_hidden, M, N])
dLdb_o = cp.empty([channels_hidden, M, N])
However, notice how the code proceeds to use the += operator on these empty arrays. Simply change the arrays to cp.zeros, and the code gives consistent loss.
I am trying to code logistic regression from scratch. In this code I have, I thought my cost derivative was my regularization, but I've been tasked with adding L1norm regularization. How do you add this in python? Should this be added where I have defined the cost derivative? Any help in the right direction is appreciated.
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, X):
return Sigmoid(X # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy: ', my_accuracy, "%")
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
print #('theta: ', theta)
print #('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 10000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
Regularization adds a term to the cost function so that there is a compromise between minimize cost and minimizing the model parameters to reduce overfitting. You can control how much compromise you would like by adding a scalar e for the regularization term.
So just add the L1 norm of theta to the original cost function:
J = J + e * np.sum(abs(theta))
Since this term is added to the cost function, then it should be considered when computing the gradient of the cost function.
This is simple since the derivative of the sum is the sum of derivatives. So now just need to figure out what is the derivate of the term sum(abs(theta)). Since it is a linear term, then the derivative is constant. It is = 1 if theta >= 0, and -1 if theta < 0 (note there is a mathematical undeterminity at 0, but we don't care about it).
So in the function Cost_Function_Derivative we add:
J = J + alpha * e * (theta >= 0).astype(float)