Tensorflow: Linear regression with non-negative constraints

Tensorflow: Linear regression with non-negative constraints - python

I am trying to implement a linear regression model in Tensorflow, with additional constraints (coming from the domain) that the W and b terms must be non-negative.
I believe there are a couple of ways to do this.
We can modify the cost function to penalize negative weights [Lagrangian approach] [See:TensorFlow - best way to implement weight constraints
We can compute the gradients ourselves and project them on [0, infinity] [Projected gradient approach]
Approach 1: Lagrangian
When I tried the first approach, I would often end up with negative b.
I had modified the cost function from:
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
to:
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 100.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
Keeping the coefficient of nn_b and nn_w to be very high leads to instability and very high cost.
Here is the complete code.
import numpy as np
import tensorflow as tf
n_samples = 50
train_X = np.linspace(1, 50, n_samples)
train_Y = 10*train_X + 6 +40*np.random.randn(50)
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Gradient descent
learning_rate=0.0001
# Initializing the variables
init = tf.global_variables_initializer()
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 1.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_with_constraint)
training_epochs=200
with tf.Session() as sess:
sess.run(init)
# Fit all training data
cost_array = np.zeros(training_epochs)
W_array = np.zeros(training_epochs)
b_array = np.zeros(training_epochs)
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
W_array[epoch] = sess.run(W)
b_array[epoch] = sess.run(b)
cost_array[epoch] = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
The following is the mean of b across 10 different runs.
0 -1.101268
1 0.169225
2 0.158363
3 0.706270
4 -0.371205
5 0.244424
6 1.312516
7 -0.069609
8 -1.032187
9 -1.711668
Clearly, the first approach is not optimal. Further, there is a lot of art involved in choosing the coefficient of penalty terms.
Approach 2: Projected gradient
I then thought to use the second approach, which is more guaranteed to work.
gr = tf.gradients(cost, [W, b])
We manually compute the gradients and update the W and b.
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
W_del, b_del = sess.run(gr, feed_dict={X: x, Y: y})
W = max(0, (W - W_del)*learning_rate) #Project the gradient on [0, infinity]
b = max(0, (b - b_del)*learning_rate) # Project the gradient on [0, infinity]
This approach seems to be very slow.
I am wondering if there is a better way to run the second approach, or guarantee the results with the first approach. Can we somehow allow the optimizer to ensure that the learnt weights are non-negative?
Edit: How to do this in Autograd
https://github.com/HIPS/autograd/issues/207

If you modify your linear model to:
pred = tf.add(tf.multiply(X, tf.abs(W)), tf.abs(b))
it will have the same effect as using only positive W and b values.
The reason your second approach is slow is that you clip the W and b values outside of the tensorflow graph. (Also it will not converge because (W - W_del)*learning_rate must instead be W - W_del*learning_rate)
edit:
You can implement the clipping using tensorflow graph like this:
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.control_dependencies([train_step]):
clip_W = W.assign(tf.maximum(0., W))
clip_b = b.assign(tf.maximum(0., b))
train_step_with_clip = tf.group(clip_W, clip_b)
In this case W and b values will be clipped to 0 and not to small positive numbers.
Here is a small mnist example with clipping:
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x = tf.placeholder(tf.uint8, [None, 28, 28])
x_vec = tf.cast(tf.reshape(x, [-1, 784]), tf.float32) / 255.
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x_vec, W) + b
y_target = tf.placeholder(tf.uint8, [None])
y_target_one_hot = tf.one_hot(y_target, 10)
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_target_one_hot, logits=y))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
with tf.control_dependencies([train_step]):
clip_W = W.assign(tf.maximum(0., W))
clip_b = b.assign(tf.maximum(0., b))
train_step_with_clip = tf.group(clip_W, clip_b)
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
tf.global_variables_initializer().run()
for i in range(1000):
sess.run(train_step_with_clip, feed_dict={
x: x_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)],
y_target: y_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)]})
if not i%100:
print("Min_W:", sess.run(tf.reduce_min(W)))
print("Min_b:", sess.run(tf.reduce_min(b)))
print("Accuracy:", sess.run(accuracy, feed_dict={
x: x_test,
y_target: y_test}))

I actually was not able to reproduce your problem of getting negative bs with your first approach.
But I do agree that this is not optimal for your use case and can result in negative values.
You should be able to constrain your parameters to non-negative values like so:
W *= tf.cast(W > 0., tf.float32)
b *= tf.cast(b > 0., tf.float32)
(exchange > with >= if necessary, the cast is necessary as the comparison operators will produce boolean values.
You then would optimize for the "standard cost" without the additional constraints.
However, this does not work in every case. For example, it should be avoided to initialize W or b with negative values in the beginning.
Your second (and probably better) approach can be accelerated by defining the update logic in the general computational graph, i.e. after the definition of cost
params = [W, b]
grads = tf.gradients(cost, params)
optimizer = [tf.assign(param, tf.maximum(0., param - grad*learning_rate))
for param, grad in zip(params, grads)]
I think your solution is slow because it creates new computation nodes every time which is probably very costly and repeated a lot inside the loops.
update using tensorflow optimizer
In my solution above not the gradients are clipped but rather the resulting update values.
Along the lines of this answer you could clip the gradients to be at most the value of the updated parameter like so:
params = [W, b]
opt = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = opt.compute_gradients(cost, params)
clipped_grads_vars = [(tf.clip_by_value(grad, -np.inf, var), var) for grad, var in grads_and_vars]
optimizer = opt.apply_gradients(clipped_grads_vars)
This way an update will never decrease a parameter to a value below 0.
However, I think this will not work in the case the updated variable is already negative.
Also, if the optimizing algorithm somehow multiplies the clipped gradient by a value greater than 1.
The latter might actually never happen, but I'm not 100% sure.

Related

Control flow in Tensorflow 2 - gradients are None

I have a Tensorflow 2.x model with the purpose of dynamically choosing a computational path. Here's a schematic drawing of this model:
The only trainable block is the Decision Module (DM), which is essentially a fully connected layer with a single binary output (0 or 1; It's differentiable using a technique called Improved Semantic Hashing). Nets A & B have the same network architecture.
In the training progress, I feed forward a batch of images until the output of the DM, and then process the decision image-by-image, directing each image to the decided net (A or B). The predictions are concatenated into a single tensor, who's used to evaluate the performance. Here's the training code (sigma is the output of the DM; model includes the feature extractor and the DM):
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
# training=True is only needed if there are custom_layers with different
# behavior during training versus inference (e.g. Dropout).
_, sigma = model(images, training=True)
out = []
for img, s in zip(images, sigma):
if s == 0:
o = binary_classifier_model_a(tf.expand_dims(img, axis=0), training=False)
else:
o = binary_classifier_model_b(tf.expand_dims(img, axis=0), training=False)
out.append(o)
predictions = tf.concat(out, axis=0)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
The problem - when running this code, gradients returns [None, None].
What I know for now is:
The first part of the model (until the DM's output) is differentiable; I tested it by running only this section and applying a loss function (MSE) and then applying tape.gradients - I got actual gradients.
I tried choosing a single (constant) net - e.g, net A - and simply multiplying it's output by s (which is either 0 or 1); This is performed instead of the if-else block in the code. In this case I also got gradients.
My concern is that such thing might not be possible - quoting from the official docs:
x = tf.constant(1.0)
v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
if x > 0.0:
result = v0
else:
result = v1**2
Depending on the value of x in the above example, the tape either
records result = v0 or result = v1**2. The gradient with respect to
x is always None.
dx = tape.gradient(result, x)
print(dx)
>> None
I'm not 100% sure that this is my case, but I wanted to ask here for the experts' opinion.
Is what I'm trying to do possible? And if yes - what should I change in order for this to work?
Thanks

You correctly identified the issue. The control statement of the conditional is not differentiable, so you lose your link to the model variables that produced sigma.
In your case, because you state that sigma is either 1 or 0, you can use the value of sigma as a mask, and skip the conditional statement (and even the loop).
with tf.GradientTape() as tape:
_, sigma = model(images, training=True)
predictions = (1.0 - sigma) * binary_classifier_model_a(images, training=False)\
+ sigma * binary_classifier_model_b(images, training=False)
loss = loss_object(labels, predictions)

It seems solution to your ploblem is to control flow operations. Try using tf.where. You can implement your condition by doing something like this.
a = tf.constant([1, 1])
b = tf.constant([2, 2])
p = tf.constant([True, False])
x = tf.where(p, a + b, a * b)
For more information please refer this

Using tf.py_func as loss function to implement gradient descent

I'm trying to use tf.train.GradientDescentOptimizer().minimize(loss) to get the minimum value of the loss function. But the loss function is very complicated and I need to use numpy to calculate the value, so I use tf.py_func to change the output to tensor again and try to use gradient descent to get the result. But this occurs the error: No gradients provided for any variable.
def getvalue(w):
return w
train_X = np.array([1,2,3,4,5])
train_Y = np.array([34,56,78,27,96])
X = tf.placeholder("float")
Y = tf.placeholder("float")
w = tf.Variable([0.0,0,0,0,0], name="weight")
loss = tf.py_func(getvalue,[w],tf.float32)
train_op = tf.train.GradientDescentOptimizer(0.02).minimize(loss)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
temp = 1
for i in range(100):
for (x, y) in zip(train_X, train_Y):
_, w_value = sess.run([train_op, w],feed_dict={X: x,Y: y})
temp += 1
Here's a very simple loss function, but when compile it still occurs No gradients provided for any variable.
So is there's a way to solve it? Do I need to calculate everything in tensor in order to use gradient descent?
Thanks in advance.

Predictions on a two layer trained neural model on Tensorflow

I'm having trouble making predictions with a trained neural model on Tensor. Here's my attempt:
import tensorflow as tf
import pandas, numpy as np
dataset=[[0.4,0.5,0.6,0],[0.6,0.7,0.8,1],[0.3,0.8,0.5,2],....]
X = tf.placeholder(tf.float32, [None, 3])
W = tf.Variable(tf.zeros([3,10]))
b = tf.Variable(tf.zeros([10]))
Y1 = tf.matmul(X, W) + b
W1 = tf.Variable(tf.zeros([10, 1]))
b1 = tf.Variable(tf.zeros([1]))
Y = tf.nn.sigmoid(tf.matmul(Y1, W1) + b1)
# placeholder for correct labels
Y_ = tf.placeholder(tf.float32, [None, 1])
init = tf.global_variables_initializer()
# loss function
cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y))
optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)
sess = tf.Session()
sess.run(init)
for i in range(1000):
# load batch of images and correct answers
batch_X, batch_Y = [x[:3] for x in dataset[:4000]],[x[-1:] for x in dataset[:4000]]
train_data={X: batch_X, Y_: batch_Y}
sess.run(train_step, feed_dict=train_data)
correct_prediction = tf.equal(tf.argmax(Y,1), tf.argmax(Y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
a,c = sess.run([accuracy, cross_entropy], feed_dict=train_data)
test, lebs=[x[:3] for x in dataset[4000:]],[x[-1:] for x in dataset[4000:]]
test_data={X: test, Y_: lebs}
a,c = sess.run([accuracy, cross_entropy], feed_dict=test_data)
prediction=tf.argmax(Y,1)
print ("predictions", prediction.eval({X:test}, session=sess))
I got the following results when I ran the above code:
predictions [0 0 0 ..., 0 0 0]
My expected output should be the class labels:
predictions `[0,1,2....]`
I will appreciate your suggestions.

There are multiple problems with your code:
Initialisation: You are zero initializing your weight variable.
W = tf.Variable(tf.zeros([3,10]))
Your model will keep propogating same values at each layer for all types of inputs if you zero initialize it. Initialize it with random values. Ex:
W = tf.Variable(tf.truncated_normal((3,10)))
Loss function: I believe you are trying to replicate this familiar looking equation as your loss function:
y * log(prob) + (1 - y) * log(1 - prob). I believe you are having totally 10 classes. For each of the 10 classes, you will have to substitue the above equation and remember, you will use y value in above equation as either correct class or wrong class i.e 1 or 0 only for each class. Do not substitute y value as class label from 0 to 9.
To avoid all this calculations, I will suggest you to make use of Tensorflow's in-built functions like tf.nn.softmax_cross_entropy_with_logits. It will help you in a long way.
Sigmoid function: This is the prime culprit on why all your outputs are giving value as only 0. The output range of sigmoid is from 0 to 1. Think of replacing it with ReLU.
Output units: If you are doing classification, your number of neurons in final layers should be equal to number of classes. Each class denotes one output class. Replace it with 10 neurons.
Learning rate: Keep playing with your learning rate. I believe your learning rate is little high for such a small network.
Hope you understood the problems in your code. Please Google each of the above point I have mentioned for greater details but I have given you more than enough information to start solving the problem.

Feed a trainable variable into a placeholder in TensorFlow

Suppose I want to train a model over a number of samples (and also variables) that are known only at run time.
Case study: PCA (X W = Y)
(this is only a simplification of a much more complex model)
Take for example this simple PCA model, where only the features dimensions (Din and Dout) are known and fixed.
W = tf.Variable(tf.zeros([D_in, D_out]), name='weights', trainable=True)
X = tf.placeholder(tf.float32, [None, D_in], name='placeholder_latent')
Y_est = tf.matmul(X, W)
loss = tf.reduce_sum((Y_tf-Y_est)**2)
train_step = tf.train.AdamOptimizer(0.001).minimize(loss)
Suppose now we generate some data
W_true = np.random.randn(D_in, D_out)
X_true = np.random.randn(N, D_in)
Y_true = np.dot(X_true, W_true)
Y_tf = tf.constant(Y_true.astype(np.float32))
As soon as I know the dimension of my training data, I can declare the latent variable that will be fed to the placeholder X to be optimised.
latent = tf.Variable(tf.zeros([N, D_in]), name='latent', trainable=True)
init_op = tf.global_variables_initializer()
After that, what I would like to do is to feed this latent variable to the placeholder X and run the optimisation.
with tf.Session() as sess:
sess.run(init_op)
for n in range(10000):
sess.run(train_step, feed_dict={X : sess.run(latent)})
if (n+1) % 1000 == 0:
print('iter %i, %f' % (n+1, sess.run(loss, feed_dict={X : sess.run(latent)})))
The problem is that the optimiser does not optimise W and latent, but only W. I have tried also to feed the variable directly without evaluation but I get this error:
ValueError: setting an array element with a sequence.
Have you ever encountered this kind of issue? Do you know how to overcome this problem? Are there any possible workaround to optimise on a placeholder?
By the way, I am using TensorFlow 1.1.0rc0 with Python 2.7.13

Linear regression with tensorflow

I trying to understand linear regression... here is script that I tried to understand:
'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
from __future__ import print_function
import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
Question is what this part represent:
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
And why are there random float numbers?
Also could you show me some math with formals represents cost, pred, optimizer variables?

let's try to put up some intuition&sources together with the tfapproach.
General intuition:
Regression as presented here is a supervised learning problem. In it, as defined in Russel&Norvig's Artificial Intelligence, the task is:
given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym), where each output was generated by an unknown function y = f(x), discover a function h that approximates the true function f
For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. The hope is that the resulting function will be close to f.
But how to learn this parameters? in order to be able to learn, the model has to be able to evaluate. Here comes the cost (also called loss, energy, merit...) function to play: it is a metric function that compares the output of h with the corresponding y, and penalizes big differences.
Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function.
Linear Regression:
The example that you are posting performs a parametric linear regression, optimized with gradient descent based on the mean squared error as cost function. Which means:
Parametric: The set of parameters is fixed. They are held in the exact same memory placeholders thorough the learning process.
Linear: The output of h is merely a linear (actually, affine) combination between the input x and your parameters. So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b. Page 107 of the Deep Learning Book brings more quality insights and intuitions into that.
Cost function: Now this is the interesting part. The average squared error is a convex function. This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB).
Gradient descent: tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. If you totally need to know, the exact technique applied by TF is called automatic differentiation, kind of a compromise between the numeric and symbolic approaches. For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with. The random initialization is necessary in more complex architectures like neural networks. There is some extra code regarding the management of the minibatches, but I won't get into that because it is not the main focus of your question.
The TensorFlow approach:
Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). For constructing and running the graph, TensoFlow follows a declarative style, which means that the graph has to be first completely defined and compiled, before it is deployed and executed. It is very reccommended to read this short wiki article, if you haven't yet. In this context, the setup is split in two parts:
Firstly, you define your computational Graph, where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply.
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization.
The code:
The code of the example follows this approach closely:
Define the test data X and labels Y, and prepare a placeholder in the Graph for them (which is fed in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They have to be Variables because they will be updated during the Session.
Define pred (our hypothesis) and cost as explained before.
From this, the rest of the code should be clearer. Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that)
Cheers!
Andres
CODE EXAMPLES: GRADIENT DESCENT VS. NORMAL EQUATIONS
This small snippets generate simple multi-dimensional datasets and test both approaches. Notice that the normal equations approach doesn't require looping, and brings better results. For small dimensionality (DIMENSIONS<30k) is probably the preferred approach:
from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf
####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
# forward propagation
train_prediction = tf.matmul(x_train, theta)+theta_0
test_prediction = tf.matmul(x_test, theta) +theta_0
# cost function and optimizer
train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
# test results
test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)
# run the computation
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized"); print(theta.eval())
for step in range(TRAINING_STEPS):
_, train_c, test_c = s.run([optimizer, train_cost, test_cost],
feed_dict={x_train: train_data, y_train: train_labels,
x_test: test_data, y_test: test_labels })
if (step%100==0):
# it should return bias close to zero and parameters all close to 1 (see definition of f)
print("\nAfter", step, "iterations:")
#print(" Bias =", theta_0.eval(), ", Weights = ", theta.eval())
print(" train cost =", train_c); print(" test cost =", test_c)
PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
print("Solution for parameters:\n", PARAMETERS_GRADDESC)
####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
# optimum
optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)
# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized")
opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
PARAMETERS_NORMEQ = opt
print("Solution for parameters:\n",PARAMETERS_NORMEQ)
####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################
# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)

Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
The variable with type tf.Variable is the parameter which we will learn use TensorFlow. Assume you use the gradient descent to minimize the loss function. You need initial these parameter first. The rng.randn() is used to generate a random value for this purpose.
I think the Getting Started With TensorFlow is a good start point for you.

I'll first define the variables:
W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias)
Y is also a scalar value i.e. the value at X
pred = W (dot) X + b # dot here refers to dot product
# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples
#finally optimizer
# optimizer computes the gradient with respect to each variable and the update
W += learning_rate * (pred - Y)/num_samples * X
b += learning_rate * (pred - Y)/num_samples
Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. It isn't performing linear regression via least squares method although both will converge to the same solution.
Look here for more information: Getting Started

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: Linear regression with non-negative constraints - python

Related

Control flow in Tensorflow 2 - gradients are None

Using tf.py_func as loss function to implement gradient descent

Predictions on a two layer trained neural model on Tensorflow

Feed a trainable variable into a placeholder in TensorFlow

Linear regression with tensorflow

Categories

Resources