I m trying to build my own implementation of neural network back propagation algorithm. The code i have written for training is this so far,
def train(x,labels,n):
lam = 0.5
w1 = np.random.uniform(0,0.01,(20,120)) #weights
w2 = np.random.uniform(0,0.01,20)
for i in xrange(n):
w1 = w1/np.linalg.norm(w1)
w2 = w2/np.linalg.norm(w2)
for j in xrange(x.shape[0]):
y1 = np.zeros((600)) #output
d1 = np.zeros((20))
p = np.mat(x[j,:])
a = np.dot(w1,p.T) #activation
z = 1/(1 + np.exp((-1)*a))
y1[j] = np.dot(w2,z)
for k in xrange(20):
d1[k] = z[k]*(1 - z[k])*(y1[j] - labels[j])*np.sum(w2) #delta update rule
w1[k,:] = w1[k,:] - lam*d1[k]*x[j,:] #weight update
w2[k] = w2[k] - lam*(y1[j]-labels[j])*z[k]
E = 1/2*pow((y1[j]-labels[j]),2) #mean squared error
print E
return 0
No of input units- 120,
No of hidden units- 20,
No of output units- 1,
No of training samples- 600
x is a 600*120 training set, with zero mean and unit variance, with max value 3.28 and min value -4.07. The first 200 samples belong to class 1, the second 200 to class 2 and last 200 to class 3. Labels are the class labels assigned to each sample, n is the number of iterations required for convergence. Each sample has 120 features.
I have initialized the weights between 0 and 0.01 and the input data is scaled to have unit variance and zero mean and still the code throws a Overflow warning, resulting in 'a' i.e. activation values being NaN. I cant understand what seems to be the problem.
Every sample has 120 elements. A sample row of x :
[ 0.80145231 1.29567936 0.91474224 1.37541992 1.16183938 1.43947296
1.32440357 1.43449479 1.32742415 1.40533852 1.28817561 1.37977183
1.2290933 1.34720161 1.15877069 1.29699635 1.05428735 1.21923531
0.92312685 1.1061345 0.66647463 1.00044203 0.34270708 1.05589558
0.28770958 1.21639524 0.31522575 1.32862243 0.42135899 1.3997094
0.5780146 1.44444501 0.75872771 1.47334256 0.95372771 1.48878048
1.13968139 1.49119962 1.33121905 1.47326017 1.47548571 1.4450047
1.58272343 1.39327328 1.62929132 1.31126604 1.62705274 1.21790335
1.59951034 1.12756958 1.56253815 1.04096709 1.52651382 0.95942134
1.48875633 0.87746762 1.45248623 0.78782313 1.40446404 0.68370011
Overflow
The logistic sigmoid function is prone to overflow in NumPy as the signal strength increase. Try appending the following line:
np.clip( signal, -500, 500 )
This will limit the values in NumPy matrices to be within the given interval. In turn, this will prevent the precision overflow in the sigmoid function. I find +-500 to be a convenient signal saturation level.
>>> arr
array([[-900, -600, -300],
[ 0, 300, 600]])
>>> np.clip( arr, -500, 500)
array([[-500, -500, -300],
[ 0, 300, 500]])
Implementation
This is the snippet I'm using in my projects:
def sigmoid_function( signal ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = 1.0/( 1 + np.exp( -signal ))
return signal
#end
Why does the Sigmoid function overflow?
As the training progress, the activation function improves its precision. The sigmoid signal will converge on 1 from below or 0 from above as the accuracy approaches perfection. E.g., either 0.99999999999... or 0.00000000000000001...
Since NumPy is focused on performing highly accurate numerical operations, it will maintain the highest possible precision and thus cause an overflow error.
Note: This error message could be ignored by setting:
np.seterr( over='ignore' )
Related
I’m trying to apply multiclass logistic regression from scratch. The dataset is the MNIST.
I built some functions such as hypothesis, sigmoid, cost function, cost function derivate, and gradient descendent. My code is below.
I’m struggling with:
As all images are labeled with the respective digit that they represent. There are a total of 10 classes.
Inside the function gradient descendent, I need to loop through each class, but I do not know how to apply it using the One vs All method.
In other words, what I need to do are:
How to filter each class inside the gradient descendent.
After that, how to build a function to predict the test set.
Here is my code.
import numpy as np
import pandas as pd
# Only training data set
# the test data will be load later.
url='https://drive.google.com/file/d/1-MO8oCfq4KU361QeeL4DdafVBhZePUNT/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url,header = None)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
def hypothesis(X, thetas):
return sigmoid( X.dot(thetas)) #- 0.0000001
def sigmoid(z):
return 1/(1+np.exp(-z))
def losscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return -(1/m) * ( y.dot(np.log(h)) + (1-y).dot(np.log(1-h)) )
def derivativelosscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return (h-y).dot(X)/m
def descendinggradient(X, y, m, epoch, alpha, thetas):
n = np.size(X, 1)
J_historico = []
for i in range(epoch):
for j in range(0,10): # 10 classes
# How to filter each class inside here (inside this def descendinggradient)?
# 2 lines below are wrong.
#thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
#J_historico = J_historico + [losscost(X, y, m, thetas)]
return [thetas, J_historico]
alpha = 0.01
epoch = 50
(thetas, J_historico) = descendinggradient(X, y, m, epoch, alpha)
# After that, how to build a function to predict the test set.
Let me explain this problem step-by-step:
First since you code doesn't provides the actual data or a link to it I've created a random dataset followed by the same commands you used to create X and Y:
batch_size = 20
num_classes = 10
rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
4* rng.random((batch_size, num_classes + 1)) - 2, # Create Random Array Between -2, 2
columns=['X0','X1','X2','X3','X4','X5','X6','X7','X8', 'X9','Y']
)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
Next lets take a look at your hypothesis function. If we would just run hypothesis and take a look at the first sample, we will get a vector with the size (10,1). I also needed to provide the initial thetas for this case:
thetas = rng.random((X.shape[1],num_classes))
h = hypothesis(X, thetas)
print(h[0])
>>>[0.89701729 0.90050806 0.98358408 0.81786334 0.96636732 0.97819512
0.89118488 0.87238045 0.70612173 0.30256924]
Basically the function calculates a "propabilties"[1] for each class.
At this point we got to the first issue in your code. The result of the sigmoid function returns "propabilities" which are not "connected" to each other. So to set those "propabilties" in relation we need a another function: SOFTMAX. You will find plenty implementations about this functions. In short: It will calculate the "propabilites" based on the "sigmoid", so that the sum overall class-"propabilites" results to 1.
So for your second question "How to implement a predict after training", we only need to find the argmax value to determine the class:
h = hypothesis(X, thetas)
p = softmax(h) # needs to be implemented
prediction = np.argmax(p, axis=1)
print(prediction)
>>>[2 5 5 8 3 5 2 1 3 5 2 3 8 3 3 9 5 1 1 8]
Now that we know how to predict a class, we also need to know where to setup the training. We want to do this directly after the softmax function. But instead of using the argmax to determine the winning class, we use the costfunction and its derivative. Your problem in your code: You used the crossentropy loss for a binary problem. The binary problem also don't need to use the softmax function, because the sigmoid function already provides the connection of the two binary classes. So since we are not interested in the result at all of the cross-entropy-loss for multiple classes and only into its derivative, we also want to calculate this directly.
The conversion from binary crossentropy to multiclass is kind of unintuitive in the first view. I recommend to read a bit about it before implementing. After this you basicly use your line:
thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
for updating the thetas.
[1]These are not actuall propabilities, but this is a complete different topic.
Part 1
Im going through this article and wanted to try and calculate a forward and backward pass with batch normalization.
When doing the steps after the first layer I get a batch norm output that are equal for all features.
Here is the code (I have on purpose done it in very small steps):
w = np.array([[0.3, 0.4],[0.5,0.1],[0.2,0.3]])
X = np.array([[0.7,0.1],[0.3,0.8],[0.4,0.6]])
def mu(x,axis=0):
return np.mean(x,axis=axis)
def sigma(z, mu):
Ai = np.sum(z,axis=0)
return np.sqrt((1/len(Ai)) * (Ai-mu)**2)
def Ai(z):
return np.sum(z,axis=0)
def norm(Ai,mu,sigma):
return (Ai-mu)/sigma
z1 = np.dot(w1,X.T)
mu1 = mu(z1)
A1 = Ai(z1)
sigma1 = sigma(z1,mu1)
gamma1 = np.ones(len(A1))
beta1 = np.zeros(len(A1))
Ahat = norm(A1,mu1,sigma1) #since gamma is just ones it does change anything here
The output I get from this is:
[1.73205081 1.73205081 1.73205081]
Part 2
In this image:
Should the sigma_mov and mu_mov be set to zero for the first layer?
EDIT: I think I found what I did wrong. In the normalization step I used A1 and not z1. Also I think I found that its normal to use initlize moving average with zeros for mean and ones for variance. Nice if anyone can confirm this.
I am looking for an efficient way to implement a simple filter with one coefficient that is time-varying and specified by a vector with the same length as the input signal.
The following is a simple implementation of the desired behavior:
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
output = myfilter(signal, weights)
Is there a way to do this more efficiently with numpy or scipy?
You can trade in the overhead of the loop for a couple of additional ops:
import numpy as np
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
def vectorised(signal, weights):
wp = np.r_[1, np.multiply.accumulate(1 - weights[1:])]
sw = weights * signal
sw[0] = signal[0]
sws = np.add.accumulate(sw / wp)
return wp * sws
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
print(np.allclose(myfilter(signal, weights), vectorised(signal, weights)))
On my machine the vectorised version is several times faster. It uses a "closed form" solution of your recurrence equation.
Edit: For very long signal / weight (100,000 samples, say) this method doesn't work because of overflow. In that regime you can still save a bit (more than 50% on my machine) using the following trick, which has the added bonus that you needn't solve the recurrence formula, only invert it.
from scipy import linalg
def solver(signal, weights):
rw = 1 / weights[1:]
v = np.r_[1, rw, 1-rw, 0]
v.shape = 2, -1
return linalg.solve_banded((1, 0), v, signal)
This trick uses the fact that your recurrence is formally similar to a Gauss elimination on a matrix with only one nonvanishing subdiagonal. It piggybacks on a library function that specialises in doing precisely that.
Actually, quite proud of this one.
I am taking this Coursera class on machine learning / linear regression. Here is how they describe the gradient descent algorithm for solving for the estimated OLS coefficients:
So they use w for the coefficients, H for the design matrix (or features as they call it), and y for the dependent variable. And their convergence criteria is the usual of the norm of the gradient of RSS being less than tolerance epsilon; that is, their definition of "not converged" is:
I am having trouble getting this algorithm to converge and was wondering if I was overlooking something in my implementation. Below is the code. Please note that I also ran the sample dataset I use in it (df) through the statsmodels regression library, just to see that a regression could converge and to get coefficient values to tie out with. It did and they were:
Intercept 4.344435
x1 4.387702
x2 0.450958
Here is my implementation. At each iteration, it prints the norm of the gradient of RSS:
import numpy as np
import numpy.linalg as LA
import pandas as pd
from pandas import DataFrame
# First define the grad function: grad(RSS) = -2H'(y-Hw)
def grad_rss(df, var_name_y, var_names_h, w):
# Set up feature matrix H
H = DataFrame({"Intercept" : [1 for i in range(0,len(df))]})
for var_name_h in var_names_h:
H[var_name_h] = df[var_name_h]
# Set up y vector
y = df[var_name_y]
# Calculate the gradient of the RSS: -2H'(y - Hw)
result = -2 * np.transpose(H.values) # (y.values - H.values # w)
return result
def ols_gradient_descent(df, var_name_y, var_names_h, epsilon = 0.0001, eta = 0.05):
# Set all initial w values to 0.0001 (not related to our choice of epsilon)
w = np.array([0.0001 for i in range(0, len(var_names_h) + 1)])
# Iteration counter
t = 0
# Basic algorithm: keep subtracting eta * grad(RSS) from w until
# ||grad(RSS)|| < epsilon.
while True:
t = t + 1
grad = grad_rss(df, var_name_y, var_names_h, w)
norm_grad = LA.norm(grad)
if norm_grad < epsilon:
break
else:
print("{} : {}".format(t, norm_grad))
w = w - eta * grad
if t > 10:
raise Exception ("Failed to converge")
return w
# ##########################################
df = DataFrame({
"y" : [20,40,60,80,100] ,
"x1" : [1,5,7,9,11] ,
"x2" : [23,29,60,85,99]
})
# Run
ols_gradient_descent(df, "y", ["x1", "x2"])
Unfortunately this does not converge, and in fact prints a norm that is exploding with each iteration:
1 : 44114.31506051333
2 : 98203544.03067812
3 : 218612547944.95386
4 : 486657040646682.9
5 : 1.083355358314664e+18
6 : 2.411675439503567e+21
7 : 5.368670935963926e+24
8 : 1.1951287949674022e+28
9 : 2.660496151835357e+31
10 : 5.922574875391406e+34
11 : 1.3184342751414824e+38
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
......
Exception: Failed to converge
If I increase the maximum number of iterations enough, it doesn't converge, but just blows out to infinity.
Is there an implementation error here, or am I misinterpreting the explanation in the class notes?
Updated w/ Answer
As #Kant suggested, the eta needs to updated at each iteration. The course itself had some sample formulas for this but none of them helped in the convergence. This section of the Wikipedia page about gradient descent mentions the Barzilai-Borwein approach as a good way of updating the eta. I implemented it and altered my code to update the eta with it at each iteration, and the regression converged successfully. Below is my translation of the Wikipedia version of the formula to the variables used in regression, as well as code that implements it. Again, this code is called in the loop of my original ols_gradient_descent to update the eta.
def eta_t (w_t, w_t_minus_1, grad_t, grad_t_minus_1):
delta_w = w_t - w_t_minus_1
delta_grad = grad_t - grad_t_minus_1
eta_t = (delta_w.T # delta_grad) / (LA.norm(delta_grad))**2
return eta_t
Try decreasing the value of eta. Gradient descent can diverge if eta is too high.
Given is some data, data, which corresponds to a binary sequence of coin flips, where heads are 1's and tails are 0's. Theta is a value between 0 and 1 representing the probability that a coin produces heads when flipped.
How does one go about calculating the likelihood? I faintly remember a formula where:
likelihood = (theta)^(h)*(1-theta)^(1-h)
where h is 1 if heads, and 0 if tails. I implemented the following code:
import numpy as np
(np.prod([theta*1 for i in data if i==1]) * np.prod([1-theta for i in data if i==0]))
This code works for some cases but not for some hidden cases (so I'm not sure what's wrong with it).
There are a couple of ways to interpret what you are trying to calculate:
Probability of exactly that sequence, including the order in which the head occurs (which is how your question is posed here)
Probability of the number of heads (lets call this X) occurring in your sequence, regardless of the order (which is what I think you were asking for).
option 1:
import numpy as np
theta = 0.2 # Probability of H is 0.2, hence NOT a fair coin
data = [0, 1, 0, 1, 1, 1, 0, 0, 1, 1] # T, H, T, H, H, ....
def likelihood(theta, h):
return (theta)**(h)*(1-theta)**(1-h)
likelihood(theta, 1) # 0.2
likelihood(theta, 0) # 0.8
singlethrow = [likelihood(theta, x) for x in data]
prob1 = np.prod(singlethrow) # 2.6214400000000015e-05
prob1 will converge to zero pretty quickly, because every additional coin toss will multiply the existing probability with a number smaller than 1 (either 0.2 if heads, 0.8 if tails)
option 2:
is a binomial distribution. This adds up the probability of all possible outcomes that results in a total of, say, 6 heads when tossing a coin 10 times. One particular sequence that results in 6 heads for 10 tosses we already evaluated in option 1 above. There are 210 such ways ( = 10! / (6!*(10−6)!) )
The scipy.stats.binom.pmf() functionality calculates this probability for you:
import scipy, scipy.stats
prob2 = scipy.stats.binom.pmf(6, 10, theta)
Or, more generally, if you rely on data in the form I defined above:
X = sum([toss == 1 for toss in data])
N = len(data)
prob3 = scipy.stats.binom.pmf(X, N, theta)
prob2 == prob3 # True
If you're interested in the Bayesian approach, you might want to have a look at the conjugate_prior package
from conjugate_prior import BetaBinomial
prior_model = BetaBinomial(1,1) # Uninformative prior
updated_model = prior_model.update(heads, tails)
credible_interval = updated_model.posterior(0.45, 0.55)
print ("There's {p:.2f}% chance that the coin is fair".format(p=credible_interval*100))