Escaping local minimum with tensorflow - python

I am solving this system of equations with tensorflow:
f1 = y - x*x = 0
f2 = x - (y - 2)*(y - 2) + 1.1 = 0
If I choose bad starting point (x,y)=(-1.3,2), then I get into local minima optimising f1^2+f2^2 with this code:
f1 = y - x*x
f2 = x - (y - 2)*(y - 2) + 1.1
sq=f1*f1+f2*f2
o = tf.train.AdamOptimizer(1e-1).minimize(sq)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run([init])
for i in range(50):
sess.run([o])
r=sess.run([x,y,f1,f2])
print("x",r)
How can I escape this local minima with built-in tensorflow tools? May be there is any other TF approach I can use to solve this equation starting from this bad point?

At the moment, there is no global optimization method that is built-in tensorflow. There is a window opened on the scipy world via ScipyOptimizerInterface, but it (currently?) only wraps scipy's minimize, which is a local minimizer.
However, you can still treat tensorflow's execution result as any other function, that can be fed to the optimizer of your choice. Say you want to experiment with scipy's basinhopping global optimizer. You could write
import numpy as np
from scipy.optimize import basinhopping
import tensorflow as tf
v = tf.placeholder(dtype=tf.float32, shape=(2,))
x = v[0]
y = v[1]
f1 = y - x*x
f2 = x - (y - 2)*(y - 2) + 1.1
sq = f1 * f1 + f2 * f2
starting_point = np.array([-1.3, 2.0], np.float32)
with tf.Session() as sess:
o = basinhopping(lambda x: sess.run(sq, {v: x}), x0=starting_point, T=10, niter=1000)
print(o.x)
# [0.76925635 0.63757862]
(I had to tweak basinhopping's temperatures and number of iterations, as the default values would often not let the solution get out of the basin of the local minimum taken as the starting point here).
What you loose by treating tensorflow as a black box to the optimizer is that the later does not have access to the gradients that are automatically computed by tensorflow. In that sense, it is not optimal -- though you still benefit from the GPU acceleration to compute your function.
EDIT
Since you can provide explicitly the gradients to the local minimizer used by basinhopping, you could feed in the result of tensorflow's gradients:
import numpy as np
from scipy.optimize import basinhopping
import tensorflow as tf
v = tf.placeholder(dtype=tf.float32, shape=(2,))
x = v[0]
y = v[1]
f1 = y - x*x
f2 = x - (y - 2)*(y - 2) + 1.1
sq = f1 * f1 + f2 * f2
sq_grad = tf.gradients(sq, v)[0]
init_value = np.array([-1.3, 2.0], np.float32)
with tf.Session() as sess:
def f(x):
return sess.run(sq, {v: x})
def g(x):
return sess.run(sq_grad, {v: x})
o = basinhopping(f, x0 = init_value, T=10.0, niter=1000, minimizer_kwargs={'jac': g})
print(o.x)
# [0.79057982 0.62501636]
For some reason, this is much slower than without providing the gradient -- however it could be that gradients are provided, the minimization algorithm is not the same, so the comparison may not make sense.

Tensorflow (TF) does not include built-in global optimization methods. Depending on the initialization, all gradient-based methods (such as Adam) in TF can converge into local minimum for non-convex loss functions. This is generally acceptable (if not desirable) for large neural networks due to over-fitting issues when approaching the global minimum.
For this particular problem what you may want is root-solving routines from scipy:
https://docs.scipy.org/doc/scipy/reference/optimize.html#root-finding

Related

how to do curve fitting using google jax?

Extending the examples from http://implicit-layers-tutorial.org/neural_odes/ I am tying to mimic the curve fitting function in scipy , scipy.optimize.curve_fit ,using google jax. The function to be fitted is a first order ODE.
#Generate toy data for first order ode.
import jax.numpy as jnp
import jax
import numpy as np
#input data
u = np.zeros(100)
u[10:50] = 1
t = np.arange(len(u))
u = jnp.array(u)
#first order ODE
def f(y,t,k,tau,u):
return (k*u[t]-y)/tau
#Euler integration
def odeint_euler(f, y0, t, *args):
def step(state, t):
y_prev, t_prev = state
dt = t - t_prev
y = y_prev + dt * f(y_prev, t_prev, *args)
return (y, t), y
_, ys = jax.lax.scan(step, (y0, t[0]), t[1:])
return ys
pred = odeint_euler(f, jnp.array([0.0]),t,2.,5.,u)
pred_noise = pred.reshape(-1) + 0.05* np.random.randn(len(pred)) # this is the data to be fitted
# define loss function
def loss_function(params,u,targets):
k,tau = params
pred = odeint_euler(f, jnp.array([0.0]),t,k,tau,u)
return jnp.sum((pred-targets)**2)
def update(params, u, targets):
grads = jax.grad(loss_function)(params,u, targets)
return [w - 0.0001 * dw for w,dw in zip(params, grads)]
updated_params = jnp.array([1.0,2.0]) #initial parameters
for i in range(100):
updated_params = update(updated_params, u, pred_noise)
print(updated_params)
The code works fine. However , this runs pretty slow when compared to scipy curve fit. The accuracy of the solution is not good even after 500, 1000 iterations.
What is wrong with the above code ? Any idea how to make the code run faster and to get more accurate solution? Is there any better way of doing the curve fitting with jax?
I see two overall issues with your approach:
The reason your code is running slowly is because you are doing your looping in Python, which incurs JAX's dispatch overhead every loop. I'd recommend using JAX's built-in tools for minimization of loss functions; for example:
from jax.scipy.optimize import minimize
result = minimize(
loss_function, x0=jnp.array([1.0,2.0]),
method='BFGS', args=(u, pred_noise))
The reason your accuracy does not approach that of scipy is likely because JAX defaults to 32-bit computations (See Double (64 bit) Precision). To run your code in 64-bit, you can run this block before any other imports:
from jax import config
config.update('jax_enable_x64', True)

Neural Network Backpropogation code not working

I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x

Different results when optimizing hyperparameter for a Gaussian process regression

i'm studying gaussian process regression, and i'm trying to use the built-in functions from scikit-learn, and also trying to impement a custom function for doing so.
This is the code when using scikit-learn:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) # Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result
And this is the code i wrote manually:
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,var,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = var*np.exp(-(1/(2*(l**2)))*dist_X)
inverse = np.linalg.inv(k + (sigma_n**2)*np.eye(len(k)))
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k +
(sigma_n**2)*np.eye(len(k)))) + (n/2)*np.log(2*np.pi)
return ml
b= [0.0005,100]
bnd = [b,b,b] #bounds used for "minimize" function
start = np.array([1.1,1.6,0.05]) #initial hyperparameters values
re = minimize(marglike,start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
method used is the same as the one used by scikit-learn
re.x #Hyperparameter results
Out[]: array([3.55266484e+00, 9.99986210e+01, 5.00000000e-04])
As you can see, the hyperparameter i got from the 2 methods are different, but yet i used the same data(X,Y) and same minimization method.
Could somebody help me to understand why and maybe how to get same results ?!
As suggested by San Mason, adding noise actually works! Otherwise, while you do it manually (in the custom code), set the initial noise to reasonably low and have multiple restarts with different initializations then you will get values close by. By the way, noiseless data seems to be creating a stationary ridge in the space of hyperparameters (like Fig. 1.6 in Surrogates GP book). Note that scikit-learn noise is sigma_n^2 for your custom function. Below are the snippets of noisy and noise-less cases.
Noise-less case
scikit-learn
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) #+ np.random.normal(10)# Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
# Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result
custom function
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,std,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
inverse = np.linalg.inv(k)
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
return ml[0,0]
b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (9.920690495739379, 3.5657912350017575, 1.0000000000000002e-10)
Noisy case
scikit-learn
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) + np.random.normal(size=10).reshape(10,1)*0.1 # Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
# Out[]: 10.3**2 * RBF(length_scale=3.45) + WhiteKernel(noise_level=0.00792) #result
Custom function
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,std,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
inverse = np.linalg.inv(k)
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
return ml[0,0]
b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (10.268943740577331, 3.4462604625225106, 0.007922681239535326)

Constraint the sum of coefficients with scikit learn linear model

I am doing a LassoCV with 1000 coefs. Statsmodels did not seem to able to handle this many coefs. So I am using scikit learn. Statsmodel allowed for .fit_constrained("coef1 + coef2...=1"). This constrained the sum of the coefs to = 1. I need to do this in Scikit. I am also keeping the intercept at zero.
from sklearn.linear_model import LassoCV
LassoCVmodel = LassoCV(fit_intercept=False)
LassoCVmodel.fit(x,y)
Any help would be appreciated.
As mentioned in the comments: the docs and the sources do not indicate that this is supported within sklearn!
I just tried the alternative of using off shelf convex-optimization solvers. It's just a simple prototype-like approach and it might not be a good fit for your (incompletely defined) task (sample-size?).
Some comments:
implementation/model-formulation is easy
the problem is harder to solve than i thought
solver ECOS having general trouble
solver SCS reaches good accuracy (worse compared to sklearn)
but: tuning iterations to improve accuracy breaks the solver
problem will be infeasible for SCS!
SCS + bigM-based formulation (constraint is posted as penalization-term within objective) looks usable; but might need tuning
only open-source solvers were tested and commercial ones might be much better
Further things to try:
Tackling huge problems (where performance gets more important compared to robustness and accuracy), a (Accelerated) Projected Stochastic Gradient approach looks promising
Code
""" data """
from time import perf_counter as pc
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
A = diabetes.data
y = diabetes.target
alpha=0.1
print('Problem-size: ', A.shape)
def obj(x): # following sklearn's definition from user-guide!
return (1. / (2*A.shape[0])) * np.square(np.linalg.norm(A.dot(x) - y, 2)) + alpha * np.linalg.norm(x, 1)
""" sklearn """
print('\nsklearn classic l1')
from sklearn import linear_model
clf = linear_model.Lasso(alpha=alpha, fit_intercept=False)
t0 = pc()
clf.fit(A, y)
print('used (secs): ', pc() - t0)
print(obj(clf.coef_))
print('sum x: ', np.sum(clf.coef_))
""" cvxpy """
print('\ncvxpy + scs classic l1')
from cvxpy import *
x = Variable(A.shape[1])
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y) + alpha * norm(x, 1))
problem = Problem(objective, [])
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))
""" cvxpy -> sum x == 1 """
print('\ncvxpy + scs sum == 1 / 1st approach')
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y))
constraints = [sum(x) == 1]
problem = Problem(objective, constraints)
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))
""" cvxpy approach 2 -> sum x == 1 """
print('\ncvxpy + scs sum == 1 / 2nd approach')
M = 1e6
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y) + M*(sum(x) - 1))
constraints = [sum(x) == 1]
problem = Problem(objective, constraints)
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))
Output
Problem-size: (442, 10)
sklearn classic l1
used (secs): 0.001451024380348898
13201.3508496
sum x: 891.78869298
cvxpy + scs classic l1
used (secs): 0.011165673357417458
13203.6549995
sum x: 872.520510561
cvxpy + scs sum == 1 / 1st approach
used (secs): 0.15350853891775978
13400.1272148
sum x: -8.43795102327
cvxpy + scs sum == 1 / 2nd approach
used (secs): 0.012579569383536493
13397.2932976
sum x: 1.01207061047
Edit
Just for fun i implemented a slow non-optimized prototype solver using the approach of accelerated projected gradient (remarks in code!).
This one should scale much better for huge problems (as it's a first-order method), despite slow behaviour here (because not optimized). There should be a lot of potential!
Warning: might be seen as advanced numerical-optimization to some people :-)
Edit 2: I forgot to add the nonnegative-constraint on the projection (sum(x) == 1 makes not much sense if x can be nonnegative!). This makes the solving much harder (numerical-trouble) and it's obvious, that one of those fast special-purpose projections should be used (and i'm too lazy right now; i think n*log n algs are available). Again: this APG-solver is a prototype not ready for real tasks.
Code
""" accelerated pg -> sum x == 1 """
def solve_pg(A, b, momentum=0.9, maxiter=1000):
""" remarks:
algorithm: accelerated projected gradient
projection: proj on probability-simplex
-> naive and slow using cvxpy + ecos
line-search: armijo-rule along projection-arc (Bertsekas book)
-> suffers from slow projection
stopping-criterion: naive
gradient-calculation: precomputes AtA
-> not needed and not recommended for huge sparse data!
"""
M, N = A.shape
x = np.zeros(N)
AtA = A.T.dot(A)
Atb = A.T.dot(b)
stop_count = 0
# projection helper
x_ = Variable(N)
v_ = Parameter(N)
objective_ = Minimize(0.5 * square(norm(x_ - v_, 2)))
constraints_ = [sum(x_) == 1]
problem_ = Problem(objective_, constraints_)
def gradient(x):
return AtA.dot(x) - Atb
def obj(x):
return 0.5 * np.linalg.norm(A.dot(x) - b)**2
it = 0
while True:
grad = gradient(x)
# line search
alpha = 1
beta = 0.5
sigma=1e-2
old_obj = obj(x)
while True:
new_x = x - alpha * grad
new_obj = obj(new_x)
if old_obj - new_obj >= sigma * grad.dot(x - new_x):
break
else:
alpha *= beta
x_old = x[:]
x = x - alpha*grad
# projection
v_.value = x
problem_.solve()
x = np.array(x_.value.flat)
y = x + momentum * (x - x_old)
if np.abs(old_obj - obj(x)) < 1e-2:
stop_count += 1
else:
stop_count = 0
if stop_count == 3:
print('early-stopping # it: ', it)
return x
it += 1
if it == maxiter:
return x
print('\n acc pg')
t0 = pc()
x = solve_pg(A, y)
print('used (secs): ', pc() - t0)
print(obj(x))
print('sum x: ', np.sum(x))
Output
acc pg
early-stopping # it: 367
used (secs): 0.7714511330487027
13396.8642379
sum x: 1.00000000002
I am surprised nobody has stated this before in the comments, but I think there is a conceptual misunderstanding in your question statement.
Let us start with the definition of the Lasso Estimator, for example as given in Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright:
Given a collection of N predictor-response pairs {(xi,yi)}, the
lasso finds􏳥􏳥 the fit coefficients (β0,βi) to the least-square
optimization problem with the additional constraint that the L1-norm
of the vector of coefficients βi is less than or equal to t.
Where the L1-norm of the coefficient vector is the sum of the magnitudes of all coefficients. In the case where your coefficients are all positive, this is precisely tackling your question.
Now, what is the relationship between this t and the alpha parameter used in scikit-learn? Well, it turns out that by Lagrangian duality, there is a one-to-one correspondence between every value of t and a value for alpha.
This means that when you use LassoCV, since you are using a range of values for alpha, you are using by definition a range of allowable values for the sum of all your coefficients!
To sum up, the condition of the sum of all your coefficients being equal to one is equivalent to using Lasso for a particular value of alpha.

Gradients are not computed with respect to a variable that is stored in a TensorArray

For y=x**2 the gradient dy_dx won't be computed if x is retrieved from a TensorArray.
How can I store both the x and y ops in a TensorArray, then retrieve them, and call tf.gradients to compute the gradient?
The use case would be: one builds a while_loop, a bunch of different values (i.e. x,y) are generated within the iteration, they are pushed into TensorArrays, and then outside the loop I would like to get the derivative of one array with respect to another array.
Example illustrating the problem:
import tensorflow as tf
import numpy as np
x = tf.Variable(np.array([3]).astype(np.float32), trainable=False)
y = x ** 2
xa = tf.TensorArray(tf.float32, 1).write(0, x)
ya = tf.TensorArray(tf.float32, 1).write(0, y)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# these work as expected:
print(sess.run(tf.gradients(y, x))) # stdout: [array([ 6.], dtype=float32)]
print(sess.run(tf.gradients(ya.stack(), x))) # stdout: [array([ 6.], dtype=float32)]
# why no gradient?
print(tf.gradients(ya.stack(), xa.stack())) # stdout: [None]
print(tf.gradients(ya.read(0), xa.read(0))) # stdout: [None]
# desperate attempt, doesn't work either
za = tf.TensorArray(tf.float32, 1).write(0, xa.read(0) ** 2)
print(tf.gradients(za.read(0), xa.read(0))) # stdout: [None]
The reason is that the function tf.gradients is a misnomer. In fact, tf.gradients implements the backprop algorithm; and can only provide gradient calculations between nodes that are connected together in the graph. Since ya depends in no way on xa, there is no connection and backprop does not work. The example will provide a gradient:
x = tf.Variable(np.array([3]).astype(np.float32), trainable=False)
y = x ** 2
xa = tf.TensorArray(tf.float32, 1).write(0, x)
ya = tf.TensorArray(tf.float32, 1).write(0, xa.read(0))
tf.gradients(ya.stack(), x)
but the following will not:
y = x ** 2
z = x + 1
tf.gradients(y, z) # None
because there is no DAG path from z to y.
This latter example is more similar to what you're trying to do in your question.

Categories

Resources