Issue with gradient descent implementation of linear regression

Issue with gradient descent implementation of linear regression - python

I am taking this Coursera class on machine learning / linear regression. Here is how they describe the gradient descent algorithm for solving for the estimated OLS coefficients:
So they use w for the coefficients, H for the design matrix (or features as they call it), and y for the dependent variable. And their convergence criteria is the usual of the norm of the gradient of RSS being less than tolerance epsilon; that is, their definition of "not converged" is:
I am having trouble getting this algorithm to converge and was wondering if I was overlooking something in my implementation. Below is the code. Please note that I also ran the sample dataset I use in it (df) through the statsmodels regression library, just to see that a regression could converge and to get coefficient values to tie out with. It did and they were:
Intercept 4.344435
x1 4.387702
x2 0.450958
Here is my implementation. At each iteration, it prints the norm of the gradient of RSS:
import numpy as np
import numpy.linalg as LA
import pandas as pd
from pandas import DataFrame
# First define the grad function: grad(RSS) = -2H'(y-Hw)
def grad_rss(df, var_name_y, var_names_h, w):
# Set up feature matrix H
H = DataFrame({"Intercept" : [1 for i in range(0,len(df))]})
for var_name_h in var_names_h:
H[var_name_h] = df[var_name_h]
# Set up y vector
y = df[var_name_y]
# Calculate the gradient of the RSS: -2H'(y - Hw)
result = -2 * np.transpose(H.values) # (y.values - H.values # w)
return result
def ols_gradient_descent(df, var_name_y, var_names_h, epsilon = 0.0001, eta = 0.05):
# Set all initial w values to 0.0001 (not related to our choice of epsilon)
w = np.array([0.0001 for i in range(0, len(var_names_h) + 1)])
# Iteration counter
t = 0
# Basic algorithm: keep subtracting eta * grad(RSS) from w until
# ||grad(RSS)|| < epsilon.
while True:
t = t + 1
grad = grad_rss(df, var_name_y, var_names_h, w)
norm_grad = LA.norm(grad)
if norm_grad < epsilon:
break
else:
print("{} : {}".format(t, norm_grad))
w = w - eta * grad
if t > 10:
raise Exception ("Failed to converge")
return w
# ##########################################
df = DataFrame({
"y" : [20,40,60,80,100] ,
"x1" : [1,5,7,9,11] ,
"x2" : [23,29,60,85,99]
})
# Run
ols_gradient_descent(df, "y", ["x1", "x2"])
Unfortunately this does not converge, and in fact prints a norm that is exploding with each iteration:
1 : 44114.31506051333
2 : 98203544.03067812
3 : 218612547944.95386
4 : 486657040646682.9
5 : 1.083355358314664e+18
6 : 2.411675439503567e+21
7 : 5.368670935963926e+24
8 : 1.1951287949674022e+28
9 : 2.660496151835357e+31
10 : 5.922574875391406e+34
11 : 1.3184342751414824e+38
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
......
Exception: Failed to converge
If I increase the maximum number of iterations enough, it doesn't converge, but just blows out to infinity.
Is there an implementation error here, or am I misinterpreting the explanation in the class notes?
Updated w/ Answer
As #Kant suggested, the eta needs to updated at each iteration. The course itself had some sample formulas for this but none of them helped in the convergence. This section of the Wikipedia page about gradient descent mentions the Barzilai-Borwein approach as a good way of updating the eta. I implemented it and altered my code to update the eta with it at each iteration, and the regression converged successfully. Below is my translation of the Wikipedia version of the formula to the variables used in regression, as well as code that implements it. Again, this code is called in the loop of my original ols_gradient_descent to update the eta.
def eta_t (w_t, w_t_minus_1, grad_t, grad_t_minus_1):
delta_w = w_t - w_t_minus_1
delta_grad = grad_t - grad_t_minus_1
eta_t = (delta_w.T # delta_grad) / (LA.norm(delta_grad))**2
return eta_t

Try decreasing the value of eta. Gradient descent can diverge if eta is too high.

Related

Adding constraints to my fitting model using lmfit

I am trying to fit a complex conductivity model (the drude-smith-anderson model) using lmfit.minimize. In that fitting, I want constraints on my parameters c and c1 such that 0<c<1, -1<c1<0 and 0<1+c1-c<1. So, I am using the following code:
#reference: Juluri B.K. "Fitting Complex Metal Dielectric Functions with Differential Evolution Method". http://juluribk.com/?p=1597.
#reference: https://lmfit.github.io/lmfit-py/fitting.html
#import libraries (numdifftools needs to be installed but doesn't need to be imported)
import matplotlib.pyplot as plt
import numpy as np
import lmfit as lmf
import math as mt
#define the complex conductivity model
def model(params,w):
sigma0 = params["sigma0"].value
tau = params["tau"].value
c = params["c"].value
d = params["d"].value
c1 = params["c1"].value
druidanderson = (sigma0/(1-1j*2*mt.pi*w*tau))*(1 + c1/(1-1j*2*mt.pi*w*tau)) - sigma0*c/(1-1j*2*mt.pi*w*d*tau)
return druidanderson
#defining the complex residues (chi squared is sum of squares of residues)
def complex_residuals(params,w,exp_data):
delta = model(params,w)
residual = (abs((delta.real - exp_data.real) / exp_data.real) + abs(
(delta.imag - exp_data.imag) / exp_data.imag))
return residual
# importing data from CSV file
importpath = input("Path of CSV file: ") #Asking the location of where your data file is kept (give input in form of path\name.csv)
frequency = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(0)) #path to be changed to the file from which data is taken
conductivity = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(1)) + 1j*np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(2)) #path to be changed to the file from which data is taken
frequency = frequency[np.logical_not(np.isnan(frequency))]
conductivity = conductivity[np.logical_not(np.isnan(conductivity))]
w_for_fit = frequency
eps_for_fit = conductivity
#defining the bounds and initial guesses for the fitting parameters
params = lmf.Parameters()
params.add("sigma0", value = float(input("Guess for \u03C3\u2080: ")), min =10 , max = 5000) #bounds have to be changed manually
params.add("tau", value = float(input("Guess for \u03C4: ")), min = 0.0001, max =10) #bounds have to be changed manually
params.add("c1", value = float(input("Guess for c1: ")), min = -1 , max = 0) #bounds have to be changed manually
params.add("constraint", value = float(input("Guess for constraint: ")), min = 0, max=1)
params.add("c", expr="1+c1-constraint", min = 0, max = 1) #bounds have to be changed manually
params.add("d", value = float(input("Guess for \u03C4_1/\u03C4: ")),min = 100, max = 100000) #bounds have to be changed manually
# minimizing the chi square
minimizer_results = lmf.minimize(complex_residuals, params, args=(w_for_fit, eps_for_fit), method = 'differential_evolution', strategy='best1bin',
popsize=50, tol=0.01, mutation=(0, 1), recombination=0.9, seed=None, callback=None, disp=True, polish=True, init='latinhypercube')
lmf.printfuncs.report_fit(minimizer_results, show_correl=False)
As a result for the fit, I get the following output:
sigma0: 3489.38961 (init = 1000)
tau: 1.2456e-04 (init = 0.01)
c1: -0.99816132 (init = -1)
constraint: 0.98138820 (init = 1)
c: 0.00000000 == '1+c1-constraint'
d: 7333.82306 (init = 1000)
These values don't make any sense as 1+c1-c = -0.97954952 which is not 0 and is thus invalid. How to fix this issue?

Your code is not runnable. The use of input() is sort of stunning - please do not do that. Write code that is pleasant to read and separates i/o from logic.
To make a floating point residual from a complex array, use complex_array.view(float)
Guessing any parameter value to be at or very close to its limit (here, c) is a very bad idea, likely to make the fit harder.
More to your question, you defined c as "evaluate 1+c1-constant and then apply the bounds min=0, max=1". That is literally, precisely, and exactly what your
params.add("c", expr="1+c1-constraint", min = 0, max = 1)
means: calculate c as 1+c1-constraint, and then apply the bounds [0, 1]. The code is doing exactly what you told it to do.
Unless you know what you are doing (I suspect maybe not ;)), I would strongly advise doing a fit with the default leastsq method before trying to use differential_evolution. It turns out that differential_evolution is not a very good global fitting method (shgo is generally better, though no "global" solver should be considered as very reliable). But, unless you know that you need such a method, you probably do not.
I would also strongly advise you to plot your data and some models evaluated with what you think are reasonable parameters.

Calculating Batch normalization

Part 1
Im going through this article and wanted to try and calculate a forward and backward pass with batch normalization.
When doing the steps after the first layer I get a batch norm output that are equal for all features.
Here is the code (I have on purpose done it in very small steps):
w = np.array([[0.3, 0.4],[0.5,0.1],[0.2,0.3]])
X = np.array([[0.7,0.1],[0.3,0.8],[0.4,0.6]])
def mu(x,axis=0):
return np.mean(x,axis=axis)
def sigma(z, mu):
Ai = np.sum(z,axis=0)
return np.sqrt((1/len(Ai)) * (Ai-mu)**2)
def Ai(z):
return np.sum(z,axis=0)
def norm(Ai,mu,sigma):
return (Ai-mu)/sigma
z1 = np.dot(w1,X.T)
mu1 = mu(z1)
A1 = Ai(z1)
sigma1 = sigma(z1,mu1)
gamma1 = np.ones(len(A1))
beta1 = np.zeros(len(A1))
Ahat = norm(A1,mu1,sigma1) #since gamma is just ones it does change anything here
The output I get from this is:
[1.73205081 1.73205081 1.73205081]
Part 2
In this image:
Should the sigma_mov and mu_mov be set to zero for the first layer?
EDIT: I think I found what I did wrong. In the normalization step I used A1 and not z1. Also I think I found that its normal to use initlize moving average with zeros for mean and ones for variance. Nice if anyone can confirm this.

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

Modified BPMF in PyMC3 using `LKJCorr` priors: PositiveDefiniteError using `NUTS`

I previously implemented the original Bayesian Probabilistic Matrix Factorization (BPMF) model in pymc3. See my previous question for reference, data source, and problem setup. Per the answer to that question from #twiecki, I've implemented a variation of the model using LKJCorr priors for the correlation matrices and uniform priors for the standard deviations. In the original model, the covariance matrices are drawn from Wishart distributions, but due to current limitations of pymc3, the Wishart distribution cannot be sampled from properly. This answer to a loosely related question provides a succinct explanation for the choice of LKJCorr priors. The new model is below.
import pymc3 as pm
import numpy as np
import theano.tensor as t
n, m = train.shape
dim = 10 # dimensionality
beta_0 = 1 # scaling factor for lambdas; unclear on its use
alpha = 2 # fixed precision for likelihood function
std = .05 # how much noise to use for model initialization
# We will use separate priors for sigma and correlation matrix.
# In order to convert the upper triangular correlation values to a
# complete correlation matrix, we need to construct an index matrix:
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
logging.info('building the BPMF model')
with pm.Model() as bpmf:
# Specify user feature matrix
sigma_u = pm.Uniform('sigma_u', shape=dim)
corr_triangle_u = pm.LKJCorr(
'corr_u', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_u = corr_triangle_u[tri_index]
corr_matrix_u = t.fill_diagonal(corr_matrix_u, 1)
cov_matrix_u = t.diag(sigma_u).dot(corr_matrix_u.dot(t.diag(sigma_u)))
lambda_u = t.nlinalg.matrix_inverse(cov_matrix_u)
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
U = pm.MvNormal(
'U', mu=mu_u, tau=lambda_u,
shape=(n, dim), testval=np.random.randn(n, dim) * std)
# Specify item feature matrix
sigma_v = pm.Uniform('sigma_v', shape=dim)
corr_triangle_v = pm.LKJCorr(
'corr_v', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_v = corr_triangle_v[tri_index]
corr_matrix_v = t.fill_diagonal(corr_matrix_v, 1)
cov_matrix_v = t.diag(sigma_v).dot(corr_matrix_v.dot(t.diag(sigma_v)))
lambda_v = t.nlinalg.matrix_inverse(cov_matrix_v)
mu_v = pm.Normal(
'mu_v', mu=0, tau=beta_0 * lambda_v, shape=dim,
testval=np.random.randn(dim) * std)
V = pm.MvNormal(
'V', mu=mu_v, tau=lambda_v,
testval=np.random.randn(m, dim) * std)
# Specify rating likelihood function
R = pm.Normal(
'R', mu=t.dot(U, V.T), tau=alpha * np.ones((n, m)),
observed=train)
# `start` is the start dictionary obtained from running find_MAP for PMF.
# See the previous post for PMF code.
for key in bpmf.test_point:
if key not in start:
start[key] = bpmf.test_point[key]
with bpmf:
step = pm.NUTS(scaling=start)
The goal with this reimplementation was to produce a model that could be estimated using the NUTS sampler. Unfortunately, I'm still getting the same error at the last line:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 ... 1030 1031 1032 1033 1034 ]
I've made all the code for PMF, BPMF, and this modified BPMF available in this gist to make it simple to replicate the error. All you need to do is download the data (also referenced in the gist).

It looks like you are passing the complete precision matrix into the normal distribution:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
I assume you only want to pass the diagonal values:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * t.diag(lambda_u), shape=dim,
testval=np.random.randn(dim) * std)
Does this change to mu_u and mu_v fix it for you?

Overflow Error in Neural Networks implementation

I m trying to build my own implementation of neural network back propagation algorithm. The code i have written for training is this so far,
def train(x,labels,n):
lam = 0.5
w1 = np.random.uniform(0,0.01,(20,120)) #weights
w2 = np.random.uniform(0,0.01,20)
for i in xrange(n):
w1 = w1/np.linalg.norm(w1)
w2 = w2/np.linalg.norm(w2)
for j in xrange(x.shape[0]):
y1 = np.zeros((600)) #output
d1 = np.zeros((20))
p = np.mat(x[j,:])
a = np.dot(w1,p.T) #activation
z = 1/(1 + np.exp((-1)*a))
y1[j] = np.dot(w2,z)
for k in xrange(20):
d1[k] = z[k]*(1 - z[k])*(y1[j] - labels[j])*np.sum(w2) #delta update rule
w1[k,:] = w1[k,:] - lam*d1[k]*x[j,:] #weight update
w2[k] = w2[k] - lam*(y1[j]-labels[j])*z[k]
E = 1/2*pow((y1[j]-labels[j]),2) #mean squared error
print E
return 0
No of input units- 120,
No of hidden units- 20,
No of output units- 1,
No of training samples- 600
x is a 600*120 training set, with zero mean and unit variance, with max value 3.28 and min value -4.07. The first 200 samples belong to class 1, the second 200 to class 2 and last 200 to class 3. Labels are the class labels assigned to each sample, n is the number of iterations required for convergence. Each sample has 120 features.
I have initialized the weights between 0 and 0.01 and the input data is scaled to have unit variance and zero mean and still the code throws a Overflow warning, resulting in 'a' i.e. activation values being NaN. I cant understand what seems to be the problem.
Every sample has 120 elements. A sample row of x :
[ 0.80145231 1.29567936 0.91474224 1.37541992 1.16183938 1.43947296
1.32440357 1.43449479 1.32742415 1.40533852 1.28817561 1.37977183
1.2290933 1.34720161 1.15877069 1.29699635 1.05428735 1.21923531
0.92312685 1.1061345 0.66647463 1.00044203 0.34270708 1.05589558
0.28770958 1.21639524 0.31522575 1.32862243 0.42135899 1.3997094
0.5780146 1.44444501 0.75872771 1.47334256 0.95372771 1.48878048
1.13968139 1.49119962 1.33121905 1.47326017 1.47548571 1.4450047
1.58272343 1.39327328 1.62929132 1.31126604 1.62705274 1.21790335
1.59951034 1.12756958 1.56253815 1.04096709 1.52651382 0.95942134
1.48875633 0.87746762 1.45248623 0.78782313 1.40446404 0.68370011

Overflow
The logistic sigmoid function is prone to overflow in NumPy as the signal strength increase. Try appending the following line:
np.clip( signal, -500, 500 )
This will limit the values in NumPy matrices to be within the given interval. In turn, this will prevent the precision overflow in the sigmoid function. I find +-500 to be a convenient signal saturation level.
>>> arr
array([[-900, -600, -300],
[ 0, 300, 600]])
>>> np.clip( arr, -500, 500)
array([[-500, -500, -300],
[ 0, 300, 500]])
Implementation
This is the snippet I'm using in my projects:
def sigmoid_function( signal ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = 1.0/( 1 + np.exp( -signal ))
return signal
#end
Why does the Sigmoid function overflow?
As the training progress, the activation function improves its precision. The sigmoid signal will converge on 1 from below or 0 from above as the accuracy approaches perfection. E.g., either 0.99999999999... or 0.00000000000000001...
Since NumPy is focused on performing highly accurate numerical operations, it will maintain the highest possible precision and thus cause an overflow error.
Note: This error message could be ignored by setting:
np.seterr( over='ignore' )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with gradient descent implementation of linear regression - python

Try decreasing the value of eta. Gradient descent can diverge if eta is too high.

Related

Adding constraints to my fitting model using lmfit

Calculating Batch normalization

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

Modified BPMF in PyMC3 using `LKJCorr` priors: PositiveDefiniteError using `NUTS`

Overflow Error in Neural Networks implementation

Categories

Resources