Part 1
Im going through this article and wanted to try and calculate a forward and backward pass with batch normalization.
When doing the steps after the first layer I get a batch norm output that are equal for all features.
Here is the code (I have on purpose done it in very small steps):
w = np.array([[0.3, 0.4],[0.5,0.1],[0.2,0.3]])
X = np.array([[0.7,0.1],[0.3,0.8],[0.4,0.6]])
def mu(x,axis=0):
return np.mean(x,axis=axis)
def sigma(z, mu):
Ai = np.sum(z,axis=0)
return np.sqrt((1/len(Ai)) * (Ai-mu)**2)
def Ai(z):
return np.sum(z,axis=0)
def norm(Ai,mu,sigma):
return (Ai-mu)/sigma
z1 = np.dot(w1,X.T)
mu1 = mu(z1)
A1 = Ai(z1)
sigma1 = sigma(z1,mu1)
gamma1 = np.ones(len(A1))
beta1 = np.zeros(len(A1))
Ahat = norm(A1,mu1,sigma1) #since gamma is just ones it does change anything here
The output I get from this is:
[1.73205081 1.73205081 1.73205081]
Part 2
In this image:
Should the sigma_mov and mu_mov be set to zero for the first layer?
EDIT: I think I found what I did wrong. In the normalization step I used A1 and not z1. Also I think I found that its normal to use initlize moving average with zeros for mean and ones for variance. Nice if anyone can confirm this.
Related
I’m trying to apply multiclass logistic regression from scratch. The dataset is the MNIST.
I built some functions such as hypothesis, sigmoid, cost function, cost function derivate, and gradient descendent. My code is below.
I’m struggling with:
As all images are labeled with the respective digit that they represent. There are a total of 10 classes.
Inside the function gradient descendent, I need to loop through each class, but I do not know how to apply it using the One vs All method.
In other words, what I need to do are:
How to filter each class inside the gradient descendent.
After that, how to build a function to predict the test set.
Here is my code.
import numpy as np
import pandas as pd
# Only training data set
# the test data will be load later.
url='https://drive.google.com/file/d/1-MO8oCfq4KU361QeeL4DdafVBhZePUNT/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url,header = None)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
def hypothesis(X, thetas):
return sigmoid( X.dot(thetas)) #- 0.0000001
def sigmoid(z):
return 1/(1+np.exp(-z))
def losscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return -(1/m) * ( y.dot(np.log(h)) + (1-y).dot(np.log(1-h)) )
def derivativelosscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return (h-y).dot(X)/m
def descendinggradient(X, y, m, epoch, alpha, thetas):
n = np.size(X, 1)
J_historico = []
for i in range(epoch):
for j in range(0,10): # 10 classes
# How to filter each class inside here (inside this def descendinggradient)?
# 2 lines below are wrong.
#thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
#J_historico = J_historico + [losscost(X, y, m, thetas)]
return [thetas, J_historico]
alpha = 0.01
epoch = 50
(thetas, J_historico) = descendinggradient(X, y, m, epoch, alpha)
# After that, how to build a function to predict the test set.
Let me explain this problem step-by-step:
First since you code doesn't provides the actual data or a link to it I've created a random dataset followed by the same commands you used to create X and Y:
batch_size = 20
num_classes = 10
rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
4* rng.random((batch_size, num_classes + 1)) - 2, # Create Random Array Between -2, 2
columns=['X0','X1','X2','X3','X4','X5','X6','X7','X8', 'X9','Y']
)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
Next lets take a look at your hypothesis function. If we would just run hypothesis and take a look at the first sample, we will get a vector with the size (10,1). I also needed to provide the initial thetas for this case:
thetas = rng.random((X.shape[1],num_classes))
h = hypothesis(X, thetas)
print(h[0])
>>>[0.89701729 0.90050806 0.98358408 0.81786334 0.96636732 0.97819512
0.89118488 0.87238045 0.70612173 0.30256924]
Basically the function calculates a "propabilties"[1] for each class.
At this point we got to the first issue in your code. The result of the sigmoid function returns "propabilities" which are not "connected" to each other. So to set those "propabilties" in relation we need a another function: SOFTMAX. You will find plenty implementations about this functions. In short: It will calculate the "propabilites" based on the "sigmoid", so that the sum overall class-"propabilites" results to 1.
So for your second question "How to implement a predict after training", we only need to find the argmax value to determine the class:
h = hypothesis(X, thetas)
p = softmax(h) # needs to be implemented
prediction = np.argmax(p, axis=1)
print(prediction)
>>>[2 5 5 8 3 5 2 1 3 5 2 3 8 3 3 9 5 1 1 8]
Now that we know how to predict a class, we also need to know where to setup the training. We want to do this directly after the softmax function. But instead of using the argmax to determine the winning class, we use the costfunction and its derivative. Your problem in your code: You used the crossentropy loss for a binary problem. The binary problem also don't need to use the softmax function, because the sigmoid function already provides the connection of the two binary classes. So since we are not interested in the result at all of the cross-entropy-loss for multiple classes and only into its derivative, we also want to calculate this directly.
The conversion from binary crossentropy to multiclass is kind of unintuitive in the first view. I recommend to read a bit about it before implementing. After this you basicly use your line:
thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
for updating the thetas.
[1]These are not actuall propabilities, but this is a complete different topic.
I am taking this Coursera class on machine learning / linear regression. Here is how they describe the gradient descent algorithm for solving for the estimated OLS coefficients:
So they use w for the coefficients, H for the design matrix (or features as they call it), and y for the dependent variable. And their convergence criteria is the usual of the norm of the gradient of RSS being less than tolerance epsilon; that is, their definition of "not converged" is:
I am having trouble getting this algorithm to converge and was wondering if I was overlooking something in my implementation. Below is the code. Please note that I also ran the sample dataset I use in it (df) through the statsmodels regression library, just to see that a regression could converge and to get coefficient values to tie out with. It did and they were:
Intercept 4.344435
x1 4.387702
x2 0.450958
Here is my implementation. At each iteration, it prints the norm of the gradient of RSS:
import numpy as np
import numpy.linalg as LA
import pandas as pd
from pandas import DataFrame
# First define the grad function: grad(RSS) = -2H'(y-Hw)
def grad_rss(df, var_name_y, var_names_h, w):
# Set up feature matrix H
H = DataFrame({"Intercept" : [1 for i in range(0,len(df))]})
for var_name_h in var_names_h:
H[var_name_h] = df[var_name_h]
# Set up y vector
y = df[var_name_y]
# Calculate the gradient of the RSS: -2H'(y - Hw)
result = -2 * np.transpose(H.values) # (y.values - H.values # w)
return result
def ols_gradient_descent(df, var_name_y, var_names_h, epsilon = 0.0001, eta = 0.05):
# Set all initial w values to 0.0001 (not related to our choice of epsilon)
w = np.array([0.0001 for i in range(0, len(var_names_h) + 1)])
# Iteration counter
t = 0
# Basic algorithm: keep subtracting eta * grad(RSS) from w until
# ||grad(RSS)|| < epsilon.
while True:
t = t + 1
grad = grad_rss(df, var_name_y, var_names_h, w)
norm_grad = LA.norm(grad)
if norm_grad < epsilon:
break
else:
print("{} : {}".format(t, norm_grad))
w = w - eta * grad
if t > 10:
raise Exception ("Failed to converge")
return w
# ##########################################
df = DataFrame({
"y" : [20,40,60,80,100] ,
"x1" : [1,5,7,9,11] ,
"x2" : [23,29,60,85,99]
})
# Run
ols_gradient_descent(df, "y", ["x1", "x2"])
Unfortunately this does not converge, and in fact prints a norm that is exploding with each iteration:
1 : 44114.31506051333
2 : 98203544.03067812
3 : 218612547944.95386
4 : 486657040646682.9
5 : 1.083355358314664e+18
6 : 2.411675439503567e+21
7 : 5.368670935963926e+24
8 : 1.1951287949674022e+28
9 : 2.660496151835357e+31
10 : 5.922574875391406e+34
11 : 1.3184342751414824e+38
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
......
Exception: Failed to converge
If I increase the maximum number of iterations enough, it doesn't converge, but just blows out to infinity.
Is there an implementation error here, or am I misinterpreting the explanation in the class notes?
Updated w/ Answer
As #Kant suggested, the eta needs to updated at each iteration. The course itself had some sample formulas for this but none of them helped in the convergence. This section of the Wikipedia page about gradient descent mentions the Barzilai-Borwein approach as a good way of updating the eta. I implemented it and altered my code to update the eta with it at each iteration, and the regression converged successfully. Below is my translation of the Wikipedia version of the formula to the variables used in regression, as well as code that implements it. Again, this code is called in the loop of my original ols_gradient_descent to update the eta.
def eta_t (w_t, w_t_minus_1, grad_t, grad_t_minus_1):
delta_w = w_t - w_t_minus_1
delta_grad = grad_t - grad_t_minus_1
eta_t = (delta_w.T # delta_grad) / (LA.norm(delta_grad))**2
return eta_t
Try decreasing the value of eta. Gradient descent can diverge if eta is too high.
I previously implemented the original Bayesian Probabilistic Matrix Factorization (BPMF) model in pymc3. See my previous question for reference, data source, and problem setup. Per the answer to that question from #twiecki, I've implemented a variation of the model using LKJCorr priors for the correlation matrices and uniform priors for the standard deviations. In the original model, the covariance matrices are drawn from Wishart distributions, but due to current limitations of pymc3, the Wishart distribution cannot be sampled from properly. This answer to a loosely related question provides a succinct explanation for the choice of LKJCorr priors. The new model is below.
import pymc3 as pm
import numpy as np
import theano.tensor as t
n, m = train.shape
dim = 10 # dimensionality
beta_0 = 1 # scaling factor for lambdas; unclear on its use
alpha = 2 # fixed precision for likelihood function
std = .05 # how much noise to use for model initialization
# We will use separate priors for sigma and correlation matrix.
# In order to convert the upper triangular correlation values to a
# complete correlation matrix, we need to construct an index matrix:
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
logging.info('building the BPMF model')
with pm.Model() as bpmf:
# Specify user feature matrix
sigma_u = pm.Uniform('sigma_u', shape=dim)
corr_triangle_u = pm.LKJCorr(
'corr_u', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_u = corr_triangle_u[tri_index]
corr_matrix_u = t.fill_diagonal(corr_matrix_u, 1)
cov_matrix_u = t.diag(sigma_u).dot(corr_matrix_u.dot(t.diag(sigma_u)))
lambda_u = t.nlinalg.matrix_inverse(cov_matrix_u)
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
U = pm.MvNormal(
'U', mu=mu_u, tau=lambda_u,
shape=(n, dim), testval=np.random.randn(n, dim) * std)
# Specify item feature matrix
sigma_v = pm.Uniform('sigma_v', shape=dim)
corr_triangle_v = pm.LKJCorr(
'corr_v', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_v = corr_triangle_v[tri_index]
corr_matrix_v = t.fill_diagonal(corr_matrix_v, 1)
cov_matrix_v = t.diag(sigma_v).dot(corr_matrix_v.dot(t.diag(sigma_v)))
lambda_v = t.nlinalg.matrix_inverse(cov_matrix_v)
mu_v = pm.Normal(
'mu_v', mu=0, tau=beta_0 * lambda_v, shape=dim,
testval=np.random.randn(dim) * std)
V = pm.MvNormal(
'V', mu=mu_v, tau=lambda_v,
testval=np.random.randn(m, dim) * std)
# Specify rating likelihood function
R = pm.Normal(
'R', mu=t.dot(U, V.T), tau=alpha * np.ones((n, m)),
observed=train)
# `start` is the start dictionary obtained from running find_MAP for PMF.
# See the previous post for PMF code.
for key in bpmf.test_point:
if key not in start:
start[key] = bpmf.test_point[key]
with bpmf:
step = pm.NUTS(scaling=start)
The goal with this reimplementation was to produce a model that could be estimated using the NUTS sampler. Unfortunately, I'm still getting the same error at the last line:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 ... 1030 1031 1032 1033 1034 ]
I've made all the code for PMF, BPMF, and this modified BPMF available in this gist to make it simple to replicate the error. All you need to do is download the data (also referenced in the gist).
It looks like you are passing the complete precision matrix into the normal distribution:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
I assume you only want to pass the diagonal values:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * t.diag(lambda_u), shape=dim,
testval=np.random.randn(dim) * std)
Does this change to mu_u and mu_v fix it for you?
To simplify the problem, say when a dimension (or a feature) is already updated n times, the next time I see the feature, I want to set the learning rate to be 1/n.
I came up with these codes:
def test_adagrad():
embedding = theano.shared(value=np.random.randn(20,10), borrow=True)
times = theano.shared(value=np.ones((20,1)))
lr = T.dscalar()
index_a = T.lvector()
hist = times[index_a]
cost = T.sum(theano.sparse_grad(embedding[index_a]))
gradients = T.grad(cost, embedding)
updates = [(embedding, embedding+lr*(1.0/hist)*gradients)]
### Here should be some codes to update also times which are omitted ###
train = theano.function(inputs=[index_a, lr],outputs=cost,updates=updates)
for i in range(10):
print train([1,2,3],0.05)
Theano does not give any error, but the training result give Nan sometimes. Does anybody know how to correct this please ?
Thank you for your help
PS: I doubt it is the operations in sparse space which creates problems. So I tried to replace * by theano.sparse.mul. This gave the some results as I mentioned before
Perhaps you can utilize the following example for implementation of adadelta, and use it to derive your own. Please update if you succeeded :-)
I was looking for the same thing and ended up implementing it myself in the style of the resource zuuz already pointed out. So maybe this helps anyone looking for help here.
def adagrad(lr, tparams, grads, inp, cost):
# stores the current grads
gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
grads_updates = zip(gshared, grads)
# stores the sum of all grads squared
hist_gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
rgrads_updates = [(rg, rg + T.sqr(g)) for rg, g in zip(hist_gshared, grads)]
# calculate cost and store grads
f_grad_shared = theano.function(inp, cost,
updates=grads_updates + rgrads_updates,
on_unused_input='ignore')
# apply actual update with the initial learning rate lr
n = 1e-6
updates = [(p, p - (lr/(T.sqrt(rg) + n))*g)
for p, g, rg in zip(tparams.values(), gshared, hist_gshared)]
f_update = theano.function([lr], [], updates=updates, on_unused_input='ignore')
return f_grad_shared, f_update
I find this implementation from Lasagne very concise and readable. You can use it pretty much as it is:
for param, grad in zip(params, grads):
value = param.get_value(borrow=True)
accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
accu_new = accu + grad ** 2
updates[accu] = accu_new
updates[param] = param - (learning_rate * grad /
T.sqrt(accu_new + epsilon))
I am writing a program to do neural network in python I am trying to set up the backpropagation algorithm. The basic idea is that I look through 5,000 training examples and collect the errors and find out in which direction I need to move the thetas and then move them in that direction. There are the training examples, then I use one hidden layer, and then an output layer. However I am getting the gradient/derivative/error wrong here because I am not moving the thetas correct as they need to be moved. I put 8 hours into this today not sure what I'm doing wrong. Thanks for your help!!
x = 401x5000 matrix
y = 10x5000 matrix # 10 possible output classes, so one column will look like [0, 0, 0, 1, 0... 0] to indicate the output class was 4
theta_1 = 25x401
theta_2 = 10x26
alpha=.01
sigmoid= lambda theta, x: 1 / (1 + np.exp(-(theta*x)))
#move thetas in right direction for each iteration
for iter in range(0,1):
all_delta_1, all_delta_2 = 0, 0
#loop through each training example, 1...m
for t in range(0,5000):
hidden_layer = np.matrix(np.concatenate((np.ones((1,1)),sigmoid(theta_1,x[:,t]))))
output_layer = sigmoid(theta_2,hidden_layer)
delta_3 = output_layer - y[:,t]
delta_2= np.multiply((theta_2.T*delta_3),(np.multiply(hidden_layer,(1-hidden_layer))))
#print type(delta_3), delta_3.shape, type(hidden_layer.T), hidden_layer.T.shape
all_delta_2 += delta_3*hidden_layer.T
all_delta_1 += delta_2[1:]*x[:,t].T
delta_gradient_2 = (all_delta_2 / m)
delta_gradient_1 = (all_delta_1 / m)
theta_1 = theta_1- (alpha * delta_gradient_1)
theta_2 = theta_2- (alpha * delta_gradient_2)
It looks like your gradients are with respect to the unsquashed output layer.
Try changing output_layer = sigmoid(theta_2,hidden_layer) to output_layer = theta_2*hidden_layer.
Or recompute the gradients for squashed output.