I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).
I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.
import numpy as np
import sig as s
def batchlogreg(X, y):
max_iterations = 800
alpha = 0.00001
(m,n) = np.shape(X)
X = np.insert(X, 0, 1, 1)
theta = np.array([0] * (n+1), 'float')
ll = np.array([0] * max_iterations, 'float')
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta))
d = y - hx
theta = theta + alpha*np.dot(np.transpose(X),d)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
return (theta, ll)
Note that the sigmoid function has:
sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5
Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
d = y - hx # then this will be "very" negative when y is 0
theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.
Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.
Possible solution: increase alpha and / or the number of iterations, or use momentum.
It sounds like you could be confusing probabilities with assignments.
The probability will be a real number between 0.0 and 1.0. A label will be an integer (0 or 1). Logistic regression is a model that provides the probability of a label being 1 given the input features. To obtain a label value, you need to make a decision using that probability. An easy decision rule is that the label is 0 if the probability is less than 0.5, and 1 if the probability is greater than or equal to 0.5.
So, for the example you gave, the decisions would both be 1 (which means the model is wrong for the first example where it should be 0).
I came to the same question and found the reason.
Normalize X first or set a scale-comparable intercept like 50.
Otherwise contours of cost function are too "narrow". A big alpha makes it overshoot and a small alpha fails to progress.
Related
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.
One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?
All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.
Imagine we tossed a biased coin 8 times (we don’t know how biased it is), and we recorded 5 heads (H) to 3 tails (T) so far. What is the probability of that the next 3 tosses will all be tails? In other words, we are wondering the expected probability of having 5Hs and 6Ts after 11th tosses.
I want to build a MCMC simulation model using pyMC3 to find the Bayesian solution. There is also an analytical solution within the Bayesian approach for this problem. So, I will be able to compare the results derived from the simulation, the analytical way as well as the classical frequentest way. Let me briefly explain what I could do so far:
Frequentest solution:
If we consider the probability for a single toss:
E(T) = p = (3/8) = 0,375
Then, the ultimate answer is E({T,T,T}) = p^3 = (3/8)^3 = 0,052.
2.1. Bayesian solution with analytical way:
Please assume the unknown parameter “p” represents for the bias of the coin.
If we consider the probability for a single toss:
E(T) = Integral0-1[p * P(p | H = 5, T = 3) dp] = 0,400 (I calculated the result after some algebraic manipulation)
Similarly, the ultimate answer is:
E({T,T,T}) = Integral0-1[p^3 * P(p | H = 5, T = 3) dp] = 10/11 = 0,909.
2.2. Bayesian solution with MCMC simulation:
When we consider the probability for a single toss, I built the model in pyMC3 as below:
Head: 0
Tail: 1
data = [0, 0, 0, 0, 0, 1, 1, 1]
import pymc3 as pm
with pm.Model() as coin_flipping:
p = pm.Uniform(‘p’, lower=0, upper=1)
y = pm.Bernoulli(‘y’, p=p, observed=data)
trace = pm.sample(1000)
pm.traceplot(trace)
After I run this code, I got that the posterior mean is E(T) =0,398 which is very close to the result of analytical solution (0,400). I am happy so far, but this is not the ultimate answer. I need a model that simulate the probability of E({T,T,T}). I appreciate if someone help me on this step.
One way is to do this empirically is with PyMC3's posterior predictive sampling. That is, once you have a posterior sampling, you can generate samplings from random parameterizations of the model. The pymc3.sample_posterior_predictive() method will generate new samples the size of your original observed data. Since you are only interested in three flips, we can just ignore the additional flips it generates. For example, if you wanted 10000 random sets of predicted flips, you would do
with pm.Model() as coin_flipping:
# this is still uniform, but I always prefer Beta for proportions
p = pm.Beta(‘p’, alpha=1, beta=1)
pm.Bernoulli(‘y’, p=p, observed=data)
# chains looked a bit waggly at 1K; 10K looks smoother
trace = pm.sample(10000, random_seed=2019, chains=4)
# note this generates (10000, 8) observations
post_pred = pm.sample_posterior_predictive(trace, samples=10000, random_seed=2019)
To then see how frequent the next three flips are (1,1,1), we can do
np.mean(np.sum(post_pred['y'][:,:3], axis=1) == 3)
# 0.0919
Analytic Solution
In this example, since we have an analytic posterior predictive distribution (Beta-Binomial[k | n, a=4, b=6] - see the Wikipedia table of conjugate distributions for details), we can exactly calculate the probability of observing three tails in the next three flips as follows:
from scipy.special import comb, beta as beta_fn
n, k = 3, 3 # flips, tails
a, b = 4, 6 # 1 + observed tails, 1 + observed heads
comb(n, k) * beta_fn(n + a, n - k + b) / beta_fn(a, b)
# 0.09090909090909091
Note that the beta_fn is the Euler Beta function, as distinct from the Beta distribution.
So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project.
I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/
Something along the lines of x - sin(2pi x)/(2pi)?
I'm sure there's a way to squish the slope to be a bit steeper.
You can use the fact that tf.maximum() and tf.minimum() are differentiable, and the inputs are probabilities from 0 to 1
# round numbers less than 0.5 to zero;
# by making them negative and taking the maximum with 0
differentiable_round = tf.maximum(x-0.499,0)
# scale the remaining numbers (0 to 0.5) to greater than 1
# the other half (zeros) is not affected by multiplication
differentiable_round = differentiable_round * 10000
# take the minimum with 1
differentiable_round = tf.minimum(differentiable_round, 1)
Example:
[0.1, 0.5, 0.7]
[-0.0989, 0.001, 0.20099] # x - 0.499
[0, 0.001, 0.20099] # max(x-0.499, 0)
[0, 10, 2009.9] # max(x-0.499, 0) * 10000
[0, 1.0, 1.0] # min(max(x-0.499, 0) * 10000, 1)
This works for me:
x_rounded_NOT_differentiable = tf.round(x)
x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)
Rounding is a fundamentally nondifferentiable function, so you're out of luck there. The normal procedure for this kind of situation is to find a way to either use the probabilities, say by using them to calculate an expected value, or by taking the maximum probability that is output and choose that one as the network's prediction. If you aren't using the output for calculating your loss function though, you can go ahead and just apply it to the result and it doesn't matter if it's differentiable. Now, if you want an informative loss function for the purpose of training the network, maybe you should consider whether keeping the output in the format of probabilities might actually be to your advantage (it will likely make your training process smoother)- that way you can just convert the probabilities to actual estimates outside of the network, after training.
Building on a previous answer, a way to get an arbitrarily good approximation is to approximate round() using a finite Fourier approximation and use as many terms as you need. Fundamentally, you can think of round(x) as adding a reverse (i. e. descending) sawtooth wave to x. So, using the Fourier expansion of the sawtooth wave we get
With N = 5, we get a pretty nice approximation:
Kind of an old question, but I just solved this problem for TensorFlow 2.0. I am using the following round function on in my audio auto-encoder project. I basically want to create a discrete representation of sound which is compressed in time. I use the round function to clamp the output of the encoder to integer values. It has been working well for me so far.
#tf.custom_gradient
def round_with_gradients(x):
def grad(dy):
return dy
return tf.round(x), grad
In range 0 1, translating and scaling a sigmoid can be a solution:
slope = 1000
center = 0.5
e = tf.exp(slope*(x-center))
round_diff = e/(e+1)
In tensorflow 2.10, there is a function called soft_round which achieves exactly this.
Fortunately, for those who are using lower versions, the source code is really simple, so I just copy-pasted those lines, and it works like a charm:
def soft_round(x, alpha, eps=1e-3):
"""Differentiable approximation to `round`.
Larger alphas correspond to closer approximations of the round function.
If alpha is close to zero, this function reduces to the identity.
This is described in Sec. 4.1. in the paper
> "Universally Quantized Neural Compression"<br />
> Eirikur Agustsson & Lucas Theis<br />
> https://arxiv.org/abs/2006.09952
Args:
x: `tf.Tensor`. Inputs to the rounding function.
alpha: Float or `tf.Tensor`. Controls smoothness of the approximation.
eps: Float. Threshold below which `soft_round` will return identity.
Returns:
`tf.Tensor`
"""
# This guards the gradient of tf.where below against NaNs, while maintaining
# correctness, as for alpha < eps the result is ignored.
alpha_bounded = tf.maximum(alpha, eps)
m = tf.floor(x) + .5
r = x - m
z = tf.tanh(alpha_bounded / 2.) * 2.
y = m + tf.tanh(alpha_bounded * r) / z
# For very low alphas, soft_round behaves like identity
return tf.where(alpha < eps, x, y, name="soft_round")
alpha sets how soft the function is. Greater values leads to better approximations of round function, but then it becomes harder to fit since gradients vanish:
x = tf.convert_to_tensor(np.arange(-2,2,.1).astype(np.float32))
for alpha in [ 3., 7., 15.]:
y = soft_round(x, alpha)
plt.plot(x.numpy(), y.numpy(), label=f'alpha={alpha}')
plt.legend()
plt.title('Soft round function for different alphas')
plt.grid()
In my case, I tried different values for alpha, and 3. looks like a good choice.
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1]) #flattens
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
Here is the code for linear regression cost function and gradient descent that I've found on a tutorial, but I am not quite sure how it works.
First I get how the computeCost code works since it's just (1/2M) where M is number of data.
For gradientDescent code, I just don't understand how it works in general. I know the formula for updating the theta is something like
theta = theta - (learningRate) * derivative of J(cost function). But I am not sure where alpha / len(X)) * np.sum(term) this comes from on the line updating temp[0,j].
Please help me to understand!
I'll break this down for you. So in your gradientDescent function, you are taking in the predictor variables(X), target variable(y),
the weight matrix(theta) and two more parameters(alpha, iters) which are the training parameters. The function's job is to figure out how much should
each column in the predictor variables(X) set should be multiplied by before adding to get the predicted values of target variable(y).
In the first line of the function you are initiating a weight matrix called temp to zeros. This basically is the starting point of the
final weight matrix(theta) that the function will output in the end. The params variable is basically the number of weights or number of predictor variables.
This line makes more sense in the context of neural networks where you create abstraction of features.
In linear regression, the weight matrix will mostly be a one dimensional array. In a typical feed forward neural networks we take in a lets say 5 features,
convert it to maybe 4 or 6 or n number of features by using a weight matrix and so on. In linear regression we simply distill all the input
features into one output feature. So theta will essentially be equivalent to a row vector(which is evident by the line temp[0,j]=...) and params will be the number of features. We cost array just stores
that array at each iteration. Now on to the two loops. In the line err = (X * theta.T) - y under the first for loop, we are calculating the prediction
error for each training example. The shape of the err variable be number of examples,1. In the second for loop we are training the model,
that is we are gradually updating our term matrix. We run the second for loop iters
number of times. In Neural network context, we generally call it epochs. It is basically the number of times you want to train the model.
Now the line term = np.multiply(err, X[:,j]): here we are calculating the individual adjustment that should be made to each weight in the temp matrix.
We define the cost as (y_predicted-y_actual)**2/number_of_training_points, where y_predicted = X_1*w_1 + X_2*w_2 + X_3*w_3 +... If we differentiate this cost vwith respect to
a particular weight(say W_i) we get (y_predicted-y_actual)*(X_i)/number_of_training_points where X_i is the column that W_i multiplies with. So the term = line
basically computes that differentiation part. We can multiply the term variable with the learning rate and subtract from W_i.
But as you might notice the term variable is an array. But that is taken care of in the next line.
In the next line we take the average of the term (by summing it up and then dividing it by len(X))
and subtract it from the corresponding weight in the temp matrix. Once the weights have been updated and stored and the temp matrix,
we replace the original theta by temp. We repeat this process iters number of times
If you , instead of writing it like ((alpha / len(X)) * np.sum(term)), write it like (alpha * (np.sum(term) / len(X))) ,which you're allowed to do since multiplication and division are comutative (if i remember the term correctly) then you just multiply alpha by the average error, since term is length X anyway.
This means that you substract the learning rate (alpha) times average error which would be something like X[j] * tetha (actual) - y (ideal) , which incidentally is also close enough to the derivative of (X*tetha - y)^2