Vectorizing softmax cross-entropy gradient

Vectorizing softmax cross-entropy gradient - python

I'm trying to implement my own neural network with (almost) fully vectorized operations. There are lots of posts out there but I can't seem to find one that fits all three of these:
separate cross-entropy and softmax terms in the gradient calculation (so I can interchange the last activation and loss)
multi-class classification (y is one-hot encoded)
all operations are fully vectorized
My main question is: How do I get to dE/dz (N x K) given dE/da (N x K) and da/dz (N x K x K) using a fully vectorized operation? i.e. How do I vectorize dE_dz_test2?
My second question is:
Is there a better way to write softmax_derivative?
I used this as a reference for calculating the gradient one sample at a time:
http://saitcelebi.com/tut/output/part2.html
and this for figuring out how to do backprop
https://peterroelants.github.io/posts/neural-network-implementation-part04/
def one_hot_encode(y, n_classes):
y_onehot = np.zeros((len(y), n_classes))
for i, y_i in enumerate(y):
y_onehot[i, y_i] = 1
return y_onehot
def cross_entropy_derivative(y_true, y_pred):
# dE / da
# input: N x K
# output: N x K array
N = len(y_true)
return -(y_true / y_pred) / N
def softmax(x):
# activation (a)
# input: N x K array
# output: N x K array
# https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
exp = np.exp(x - np.max(x))
return exp / np.sum(exp, axis=1)[:, None]
def softmax_derivative(Z):
# da/dz
#input: N x K array
#output: N x K x K array
#http://saitcelebi.com/tut/output/part2.html
N, K = Z.shape
s = softmax(Z)[:, :, np.newaxis]
a = np.tensordot(s, np.ones((1, K)), axes=([-1],[0]))
I = np.repeat(np.eye(K, K)[np.newaxis, :, :], N, axis=0)
b = I - np.tensordot(np.ones((K, 1)), s.T, axes=([-1],[0])).T
return a * np.swapaxes(b, 1, 2)
def softmax_derivative_test(Z):
# da/dz
# non-vectorized softmax gradient calculation
#http://saitcelebi.com/tut/output/part2.html
N, K = Z.shape
da_dz = np.zeros((N, K, K))
kron_delta = np.eye(K)
s = softmax(Z)
for n in range(N):
for i in range(K):
for j in range(K):
da_dz[n, i, j] = s[n, i] * (kron_delta[i, j] - s[n, j])
return da_dz
def dE_dz_test2(dE_da, da_dz):
# array (N x K)
# array (N x K x K)
# output: array (N x K)
N, K = dE_da.shape
dE_dz = np.zeros((N, K))
for n in range(N):
dE_dz[n, :] = np.matmul(da_dz[n], dE_da[n, :, np.newaxis]).T
return dE_dz
def some_type_of_matrix_multiplication_(dE_da, da_dz):
# how do i get dE/dz from dE_da and da_dz
pass
X = np.random.rand(100, 2)
W = np.random.rand(2, 4)
y = np.random.randint(0, 4, size=100)
y = one_hot_encode(y, 4)
Z = X # W
S = softmax(Z)
N, K = Z.shape
# da / dz for softmax
da_dz = softmax_derivative(Z) # (100, 4, 4)
da_dz_test = softmax_derivative_test(Z) # (100, 4, 4) - non vectorized implementation
print(np.isclose(da_dz, da_dz_test).all()) # equivalence test
dE_da = cross_entropy_derivative(y, S) # (100, 4)
dE_dz = some_type_of_matrix_multiplication_(dE_da, da_dz) # what do I do here? *****
dE_dz_test = (S - y) / N # (100, 4) If you combine dE/da and da/dz terms
dE_dz_test2 = dE_dz_test2(dE_da, da_dz)
print(np.isclose(dE_dz_test, dE_dz_test2).all()) # equivalence test
True
True

Here is an approach using np.einsum:
def da_dz_pp(z,sm=None):
if sm is None:
sm = softmax(z)
res = np.einsum('ij,ik->ijk',sm,-sm)
np.einsum('ijj->ij',res)[...] += sm
return res
def dE_dz_pp(y,z,sm=None):
if sm is None:
sm = softmax(z)
dE_da = cross_entropy_derivative(y,sm)
da_dz = da_dz_pp(z,sm)
return np.einsum('ij,ijk->ik',dE_da,da_dz)
It seems to reproduce what your code outputs and is a bit faster.

Related

How to efficiently calculate score = dot(a, LeakyReLU(x_i+y_j)) for each i, j in [N]?

I have to compute score = dot(a, LeakyReLU(x_i+y_j)) for each i, j in [N], where a, x_i, y_j is the D-dimensional vecotr, and dot() is the dot-product that outputs a scalar value. So finally, I have to get NxN score.
In keras, I implemented as:
#given X (N x D), Y(N x D), A (D x 1)
X = tf.expand_dims(X, axis=1) #(N x 1 x D)
Y = tf.expand_dims(Y, axis=0) #(1 x N x D)
feature_sum = X+ Y #(N x N x D) broadcast automatically
dense = K.dot(LeakyReLU(alpha=0.1)(feature_sum), A) # (N x N x 1)
The problem is that feature_sum is GPU-memory expensive, where N,D>1000. Then any other efficient implementation?

The dot product is a commutative operation with respect to the sum. Therefore:
dot(LRelu(X + Y), A) = dot(LRelu(X), A) + dot(LRelu(Y), A)
So, you can do:
dense_x = K.dot(LRelu(X), A)
dense_y = K.dot(LRelu(Y), A)
dense_x = tf.expand_dims(dense_x, axis=1)
dense_y = tf.expand_dims(dense_y, axis=0)
dense = dense_x + dense_y
In this way, all operations are done at most on N x D elements and you only have to store a maximum of N x N elements (assuming N > D).
Quantitative comparisons. N=1000, D=500
def timeit(func):
def run(*args, **kwargs):
start = time.time()
out = func(*args, **kwargs)
end = time.time()
print(f"Exec: {(end-start)*1000:.4f}ms")
return out
return run
#timeit
def fast(X, Y, A, N, D):
X = X.reshape(N, 1, D)
Y = Y.reshape(1, N, D)
feature_sum = X + Y
dense = feature_sum # A
return dense
#timeit
def fast(X, Y, A, N, D):
dense_x = X # A
dense_y = Y # A
dense_x = dense_x.reshape(N, 1, 1)
dense_y = dense_y.reshape(1, N, 1)
dense = dense_x + dense_y
return dense
def main():
N = 1000
D = 500
X = np.random.rand(N, D)
Y = np.random.rand(N, D)
A = np.random.rand(D, 1)
dense1 = slow(X, Y, A, N, D)
dense2 = fast(X, Y, A, N, D)
print("Same result: ", np.allclose(dense1, dense2))
Output:
Exec: 1547.9290ms # slow
Exec: 2.9860ms # fast
Same result: True

Neural Networks Using Python and NumPy

I am newbie to NN and I am trying to implement NN with Python/Numpy from the code I found at:
"Create a Simple Neural Network in Python from Scratch"
enter link description here
My input array is:
array([[5.71, 5.77, 5.94],
[5.77, 5.94, 5.51],
[5.94, 5.51, 5.88],
[5.51, 5.88, 5.73]])
Output array is:
array([[5.51],
[5.88],
[5.73],
[6.41]])
after running the code, I see following results which are not correct:
synaptic_weights after training
[[1.90625275]
[2.54867698]
[1.07698312]]
outputs after training
[[1.]
[1.]
[1.]
[1.]]
Here is the core of the code:
for iteration in range(1000):
input_layer = tr_input
outputs = sigmoid(np.dot(input_layer, synapic_weights))
error = tr_output - outputs
adjustmnets = error * sigmoid_derivative(outputs)
synapic_weights +=np.dot(input_layer.T, adjustmnets )
print('synaptic_weights after training')
print(synapic_weights)
print('outputs after training')
print(outputs)
What should I change in this code so it works for my data? Or shall I take different method? Any help is highly appreciated.

That's because you are using a wrong activation function (i.e. sigmoid). The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
If you want to train a model to predict the values in your array, you should use a regression model. Otherwise, you can convert your output into labels (for example are 5.x to 0 and 6.x to 1) and retrain your model.

These are the steps involved in my neural network implementation.
Randomly initialize weights (θ Theta)
Implement forward propagation
Compute cost function
Implement back propagation to compute partial derivative
Use gradient descent
def forward_prop(X, theta_list):
m = X.shape[0]
a_list = []
z_list = []
a_list.append(np.insert(X, 0, values=np.ones(m), axis=1))
idx = 0
for idx, thera in enumerate(theta_list):
z_list.append(a_list[idx] * (theta_list[idx].T))
if idx != (len(theta_list)-1):
a_list.append(np.insert(sigmoid(z_list[idx]), 0, values=np.ones(m), axis=1))
else:
a_list.append(sigmoid(z_list[idx]))
return a_list, z_list
def back_prop(params, input_size, hidden_layers, num_labels, X, y, regularization, regularize):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
theta_list = []
startCount = 0
idx = 0
for idx, val in enumerate(hidden_layers):
if idx == 0:
startCount = val * (input_size + 1)
theta_list.append(np.matrix(np.reshape(params[:startCount], (val, (input_size + 1)))))
if idx != 0:
tempCount = startCount
startCount += (val * (hidden_layers[idx-1] + 1))
theta_list.append(np.matrix(np.reshape(params[tempCount:startCount], (val, (hidden_layers[idx-1] + 1)))))
if idx == (len(hidden_layers)-1):
theta_list.append(np.matrix(np.reshape(params[startCount:], (num_labels, (val + 1)))))
a_list, z_list= forward_prop(X, theta_list)
J = cost(X, y, a_list[len(a_list)-1], theta_list, regularization, regularize)
d_list = []
d_list.append(a_list[len(a_list)-1] - y)
idx = 0
while idx < (len(theta_list)-1):
d_temp = np.multiply(d_list[idx] * theta_list[len(a_list) - 2 - idx], sigmoid_gradient(a_list[len(a_list) - 2 - idx]))
d_list.append(d_temp[:,1:])
idx += 1
delta_list = []
for theta in theta_list:
delta_list.append(np.zeros(theta.shape))
for idx, delta in enumerate(delta_list):
delta_list[idx] = delta_list[idx] + ((d_list[len(d_list) - 1 -idx].T) * a_list[idx])
delta_list[idx] = delta_list[idx] / m
if regularize:
for idx, delta in enumerate(delta_list):
delta_list[idx][:, 1:] = delta_list[idx][:, 1:] + (theta_list[idx][:, 1:] * regularization)
grad_list = np.ravel(delta_list[0])
idx = 1
while idx < (len(delta_list)):
grad_list = np.concatenate((grad_list, np.ravel(delta_list[idx])), axis=None)
idx += 1
return J, grad_list
def cost(X, y, h, theta_list, regularization, regularize):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
J = (np.multiply(-y, np.log(h)) - np.multiply((1 - y), np.log(1 - h))).sum() / m
if regularize:
regularization_value = 0.0
for theta in theta_list:
regularization_value += np.sum(np.power(theta[:, 1:], 2))
J += (float(regularization) / (2 * m)) * regularization_value
return J
Implementation

Wiki example for Arnoldi iteration only works for real matrices?

The Wikipedia entry for the Arnoldi method provides a Python example that produces basis of the Krylov subspace of a matrix A. Supposedly, if A is Hermitian (i.e. if A == A.conj().T) then the Hessenberg matrix h generated by this algorithm is tridiagonal (source). However, when I use the Wikipedia code on a real-world Hermitian matrix, the Hessenberg matrix is not at all tridiagonal. When I perform the computation on the real part of A (so that A == A.T) then I do get a tridiagonal Hessenberg matrix, so there seems to be a problem with the imaginary components of A. Does anybody know why the Wikipedia code doesn't produce the expected results?
Working example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import circulant
def arnoldi_iteration(A, b, n):
m = A.shape[0]
h = np.zeros((n + 1, n), dtype=np.complex)
Q = np.zeros((m, n + 1), dtype=np.complex)
q = b / np.linalg.norm(b) # Normalize the input vector
Q[:, 0] = q # Use it as the first Krylov vector
for k in range(n):
v = A.dot(q) # Generate a new candidate vector
for j in range(k + 1): # Subtract the projections on previous vectors
h[j, k] = np.dot(Q[:, j], v)
v = v - h[j, k] * Q[:, j]
h[k + 1, k] = np.linalg.norm(v)
eps = 1e-12 # If v is shorter than this threshold it is the zero vector
if h[k + 1, k] > eps: # Add the produced vector to the list, unless
q = v / h[k + 1, k] # the zero vector is produced.
Q[:, k + 1] = q
else: # If that happens, stop iterating.
return Q, h
return Q, h
# Construct matrix A
N = 2**4
I = np.eye(N)
k = np.fft.fftfreq(N, 1.0 / N) + 0.5
alpha = np.linspace(0.1, 1.0, N)*2e2
c = np.fft.fft(alpha) / N
C = circulant(c)
A = np.einsum("i, ij, j->ij", k, C, k)
# Show that A is Hermitian
print(np.allclose(A, A.conj().T))
# Arbitrary (random) initial vector
np.random.seed(0)
v = np.random.rand(N)
# Perform Arnoldi iteration with complex A
_, h = arnoldi_iteration(A, v, N)
# Perform Arnoldi iteration with real A
_, h2 = arnoldi_iteration(np.real(A), v, N)
# Plot results
plt.subplot(121)
plt.imshow(np.abs(h))
plt.title("Complex A")
plt.subplot(122)
plt.imshow(np.abs(h2))
plt.title("Real A")
plt.tight_layout()
plt.show()
Result:

After browsing through some conference presentation slides, I realised that at some point Q had to be conjugated when A is complex. The correct algorithm is posted below for reference, with the code change marked (note that this correction has also been submitted to the Wikipedia entry):
import numpy as np
def arnoldi_iteration(A, b, n):
m = A.shape[0]
h = np.zeros((n + 1, n), dtype=np.complex)
Q = np.zeros((m, n + 1), dtype=np.complex)
q = b / np.linalg.norm(b)
Q[:, 0] = q
for k in range(n):
v = A.dot(q)
for j in range(k + 1):
h[j, k] = np.dot(Q[:, j].conj(), v) # <-- Q needs conjugation!
v = v - h[j, k] * Q[:, j]
h[k + 1, k] = np.linalg.norm(v)
eps = 1e-12
if h[k + 1, k] > eps:
q = v / h[k + 1, k]
Q[:, k + 1] = q
else:
return Q, h
return Q, h

How to implement GMM Clustering EM algorighm(Expectation Maximisation algorithm) which work for N Dimension feature vector in python

I am trying to implement GMM Clustering for both 24 Dimension feature vector and 32 dimension feature vector, where assignment of initial parameters are done by Kmeans algorightm (K mean clustering is providing cluster centers - MU - only).
I am following this link, where it's implemented only for 2D feature vector and predefined Mu and sigma.
If anyone have the code for GMM clustering kindly post.
Predefined Lib for GMM is also there in sklearn, but it's not giving me likelyhood for each iteration. sklearn GMM

def kmeans(dataSet, k, c):
# 1. Randomly choose clusters
rng = np.random.RandomState(c)
p = rng.permutation(dataSet.shape[0])[:k]
centers = dataSet[p]
while True:
labels = pairwise_distances_argmin(dataSet, centers)
new_centers = np.array([dataSet[labels == i].mean(0) for i in range(k)]
if np.all(centers == new_centers):
break
centers = new_centers
cluster_data = [dataSet[labels == i] for i in range(k)]
l = []
covs = []
for i in range(k):
l.append(len(cluster_data[i]) * 1.0 / len(dataSet))
covs.append(np.cov(np.array(cluster_data[i]).T))
return centers, l, covs, cluster_data
return new_mu, new_covs, cluster_data
class gaussian_Mix_Model:
def __init__(self, k = 8, eps = 0.0000001):
self.k = k ## number of clusters
self.eps = eps ## threshold to stop `epsilon`
def calculate_Exp_Maxim(self, X, max_iters = 1000):
# n = number of data-points, d = dimension of data points
n, d = X.shape
mu, Cov = [], []
for i in range(1,k):
new_mu, new_covs, cluster_data = kmeans(dataSet, k, c)
# Initialize new
mu[k] = new_mu
Cov[k]= new_cov
# initialize the weights
w = [1./self.k] * self.k
R = np.zeros((n, self.k))
### LLhoods
LLhoods = []
P = lambda mu, s: np.linalg.det(s) ** -.5 ** (2 * np.pi) ** (-X.shape[1]/2.) \
* np.exp(-.5 * np.einsum('ij, ij -> i',\
X - mu, np.dot(np.linalg.inv(s) , (X - mu).T).T ) )
# Iterate till max_iters iterations
while len(LLhoods) < max_iters:
# Expectation Calcultion
## membership for each of K Clusters
for k in range(self.k):
R[:, k] = w[k] * P(mu[k], Cov[k])
# Finding the log likelihood
LLhood = np.sum(np.log(np.sum(R, axis = 1)))
# Now store the log likelihood to the list.
LLhoods.append(LLhood)
# Number of data points to each clusters
R = (R.T / np.sum(R, axis = 1)).T
N_ks = np.sum(R, axis = 0)
# Maximization and calculating the new parameters.
for k in range(self.k):
# Calculate the new means
mu[k] = 1. / N_ks[k] * np.sum(R[:, k] * X.T, axis = 1).T
x_mu = np.matrix(X - mu[k])
# Calculate new cov
Cov[k] = np.array(1 / N_ks[k] * np.dot(np.multiply(x_mu.T, R[:, k]), x_mu))
# Calculate new PiK
w[k] = 1. / n * N_ks[k]
# check for convergence
if (np.abs(LLhood - LLhoods[-2]) < self.eps) and (iteration < max_iters): break
else:
Continue
from collections import namedtuple
self.params = namedtuple('params', ['mu', 'Cov', 'w', 'LLhoods', 'num_iters'])
self.params.mu = mu
self.params.Cov = Cov
self.params.w = w
self.params.LLhoods = LLhoods
self.params.num_iters = len(LLhoods)
return self.params
# Call the GMM to find the model
gmm = gaussian_Mix_Model(3, 0.000001)
params = gmm.fit_EM(X, max_iters= 150)
# Plotting of Log-Likelihood VS Iterations.
plt.plot(LLhoods[0])
plt.savefig('Dataset_2A_GMM_Class_1_K_16.png')
plt.clf()
plt.plot(LLhoods[1])
plt.savefig('Dataset_2A_GMM_Class_2_K_16.png')
plt.clf()
plt.plot(LLhoods[2])
plt.savefig('Dataset_2A_GMM_Class_3_K_16.png')
plt.clf()

Different loss values with same data, same initial state, same recurrent neural network

I am writing a recurrent neural network (specifically, a ConvLSTM). Recently, I have noticed an interesting inconsistency that I cannot quite figure out. I have written this neural network from scratch using numpy (technically cupy for gpu) and a few Chainer lines (specifically for their F.convolution_2D function).
When running this same network twice, for the first 4 or so training examples, the losses are EXACTLY the same. However, around the 5th training example, the losses start to fluctuate in their value.
I have ensured that each time I am running this network, they are reading from the same initial state text file (and thus have the same initial weights and biases). I have also ensured that the data they are inputting are exactly the same.
Is there some inconsistency with Numpy that is the root of this problem? The only thing I can think that is different around the 4th training example is the first usage of gradient clipping. Is there some problem with numpy's linalg function? Is there some rounding error I am not familiar with? I have scanned through my code and there is no instance of utilizing random numbers.
I have added my backpropagation function below:
def bptt(x2, y2, iteration):
x = cp.asarray(x2)
y = cp.asarray(y2)
global connected_weights
global main_kernel
global bias_i
global bias_f
global bias_c
global bias_o
global bias_y
global learning_rate
# Perform forward prop
prediction, pre_sigmoid_prediction, hidden_prediction, i, f, a, c, o, h = forward_prop(x)
loss = calculate_loss(prediction, y)
print("LOSS BEFORE: ")
print(loss)
# Calculate loss with respect to final layer
dLdy_2 = loss_derivative(prediction, y)
# Calculate loss with respect to pre sigmoid layer
dLdy_1 = cp.multiply(sigmoid_derivative(pre_sigmoid_prediction), dLdy_2)
# Calculate loss with respect to last layer of lstm
dLdh = cp.zeros([T + 1, channels_hidden, M, N])
dLdh[T - 1] = cp.reshape(cp.matmul(cp.transpose(connected_weights), dLdy_1.reshape(1, M * N)), (channels_hidden, M, N)) # reshape dLdh to the appropriate size
dLdw_0 = cp.matmul(dLdy_1.reshape(1, M*N), hidden_prediction.transpose(1,0))
# Calculate loss with respect to bias y
dLdb_y = dLdy_1
#--------------------fully connected------------------
bias_y = bias_y - learning_rate*dLdb_y
connected_weights = connected_weights - learning_rate*dLdw_0
# Initialize corresponding matrices
dLdo = cp.zeros([T, channels_hidden, M, N])
dLdc = cp.zeros([T + 1, channels_hidden, M, N])
dLda = cp.zeros([T, channels_hidden, M, N])
dLdf = cp.zeros([T, channels_hidden, M, N])
dLdi = cp.zeros([T, channels_hidden, M, N])
dLdI = cp.zeros([T, channels_hidden+ channels_img, M, N])
dLdW = cp.zeros([4*channels_hidden, channels_img + channels_hidden, kernel_dimension, kernel_dimension])
# Initialize other stuff
dLdo_hat = cp.zeros([T, channels_hidden, M, N])
dLda_hat = cp.zeros([T, channels_hidden, M, N])
dLdf_hat = cp.zeros([T, channels_hidden, M, N])
dLdi_hat = cp.zeros([T, channels_hidden, M, N])
# initialize biases
dLdb_c = cp.empty([channels_hidden, M, N])
dLdb_i = cp.empty([channels_hidden, M, N])
dLdb_f = cp.empty([channels_hidden, M, N])
dLdb_o = cp.empty([channels_hidden, M, N])
for t in cp.arange(T - 1, -1, -1):
dLdo[t] = cp.multiply(dLdh[t], tanh(c[t]))
dLdc[t] += cp.multiply(cp.multiply(dLdh[t], o[t]), (cp.ones((channels_hidden, M, N)) - cp.multiply(tanh(c[t]), tanh(c[t]))))
dLdi[t] = cp.multiply(dLdc[t], a[t])
dLda[t] = cp.multiply(dLdc[t], i[t])
dLdf[t] = cp.multiply(dLdc[t], c[t - 1])
dLdc[t - 1] = cp.multiply(dLdc[t], f[t])
dLda_hat[t] = cp.multiply(dLda[t], (cp.ones((channels_hidden, M, N)) - cp.multiply(a[t], a[t])))
dLdi_hat[t] = cp.multiply(cp.multiply(dLdi[t], i[t]), cp.ones((channels_hidden, M, N)) - i[t])
dLdf_hat[t] = cp.multiply(cp.multiply(dLdf[t], f[t]), cp.ones((channels_hidden, M, N)) - f[t])
dLdo_hat[t] = cp.multiply(cp.multiply(dLdo[t], o[t]), cp.ones((channels_hidden, M, N)) - o[t])
dLdb_c += dLda_hat[t]
dLdb_i += dLdi_hat[t]
dLdb_f += dLdf_hat[t]
dLdb_o += dLdo_hat[t]
# CONCATENATE Z IN THE RIGHT ORDER SAME ORDER AS THE WEIGHTS
dLdz_hat = cp.concatenate((dLdi_hat[t], dLdf_hat[t], dLda_hat[t], dLdo_hat[t]), axis = 0)
#determine convolution derivatives
#here we will use the fact that in z = w * I, dLdW = dLdz * I
temporary = cp.concatenate((x[t], h[t - 1]), axis=0).reshape(channels_hidden + channels_img, 1, M, N)
dLdI[t] = cp.asarray(F.convolution_2d(dLdz_hat.reshape(1, 4*channels_hidden, M, N), main_kernel.transpose(1, 0, 2, 3), b=None, pad=1)[0].data) # reshape into flipped kernel dimensions
dLdW += cp.asarray((F.convolution_2d(temporary, dLdz_hat.reshape(4*channels_hidden, 1, M, N), b=None, pad=1).data).transpose(1,0,2,3)) #reshape into kernel dimensions
#gradient clipping
if cp.amax(dLdW) > 1 or cp.amin(dLdW) < -1:
dLdW = dLdW/cp.linalg.norm(dLdW)
if cp.amax(dLdb_c) > 1 or cp.amin(dLdb_c) < -1:
dLdb_c = dLdb_c/cp.linalg.norm(dLdb_c)
if cp.amax(dLdb_i) > 1 or cp.amin(dLdb_i) < -1:
dLdb_i = dLdb_i/cp.linalg.norm(dLdb_i)
if cp.amax(dLdb_f) > 1 or cp.amin(dLdb_f) < -1:
dLdb_f = dLdb_f/cp.linalg.norm(dLdb_f)
if cp.amax(dLdb_o) > 1 or cp.amin(dLdb_o) < -1:
dLdb_o = dLdb_o/cp.linalg.norm(dLdb_o)
if cp.amax(dLdw_0) > 1 or cp.amin(dLdw_0) < -1:
dLdw_0 = dLdw_0/cp.linalg.norm(dLdw_0)
if cp.amax(dLdb_y) > 1 or cp.amin(dLdb_y) < -1:
dLdb_y = dLdb_y/cp.linalg.norm(dLdb_y)
print("dLdW on step: " + str(t) + " is this: " + str(dLdW[0][0][0][0]))
#print("dLdw_0")
#print("dLdW")
#print(dLdW)
#print(str(cp.amax(dLdw_0)) + " : " + str(cp.amin(dLdw_0)))
#print("dLdW")
#print(str(cp.amax(dLdW)) + " : " + str(cp.amin(dLdW)))
#print("dLdb_c")
#print(str(cp.amax(dLdb_c)) + " : " + str(cp.amin(dLdb_c)))
dLdh[t-1] = dLdI[t][channels_img: channels_img+channels_hidden]
#.reshape(4*channels_hidden, channels_hidden+channels_img, kernel_dimension, kernel_dimension)
#update weights with convolution derivatives
#----------------------------adam optimizer code-----------------------------------
#---------------------update main kernel---------
main_kernel = main_kernel - learning_rate*dLdW
#--------------------update bias c-----------------------
bias_c = bias_c - learning_rate*dLdb_c
#--------------------update bias i-----------------------
bias_i = bias_i - learning_rate*dLdb_i
#--------------------update bias f-----------------------
bias_f = bias_f - learning_rate*dLdb_f
#--------------------update bias c-----------------------
bias_o = bias_o - learning_rate*dLdb_o
prediction2, pre_sigmoid_prediction2, hidden_prediction2, i2, f2, a2, c2, o2, h2 = forward_prop(x)
print("dLdW is: " + str(dLdW[0][0][0][0]))
loss2 = calculate_loss(prediction2, y)
print("LOSS AFTER: ")
print(loss2)
print("backpropagation complete")

Wow, that took some time.
If you look at the back propagation code, look closely at these lines:
dLdb_c = cp.empty([channels_hidden, M, N])
dLdb_i = cp.empty([channels_hidden, M, N])
dLdb_f = cp.empty([channels_hidden, M, N])
dLdb_o = cp.empty([channels_hidden, M, N])
However, notice how the code proceeds to use the += operator on these empty arrays. Simply change the arrays to cp.zeros, and the code gives consistent loss.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorizing softmax cross-entropy gradient - python

Related

How to efficiently calculate score = dot(a, LeakyReLU(x_i+y_j)) for each i, j in [N]?

Neural Networks Using Python and NumPy

Wiki example for Arnoldi iteration only works for real matrices?

How to implement GMM Clustering EM algorighm(Expectation Maximisation algorithm) which work for N Dimension feature vector in python

Different loss values with same data, same initial state, same recurrent neural network

Categories

Resources