How to test my trained Tensor Flow model - python

I currently have a regression model that tries to predict a value based on 25 other ones.
Here is the code I currently gave
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
rng = np.random
learning_rate = 0.11
training_epochs = 1000
display_step = 50
X = np.random.randint(5,size=(100,25)).astype('float32')
y_data = np.random.randint(5,size=(100,1)).astype('float32')
m = 100
epochs = 100
W = tf.Variable(tf.zeros([25,1]))
b = tf.Variable(tf.zeros([1]))
y = tf.add( tf.matmul(X,W), b)
loss = tf.reduce_sum(tf.square(y - y_data)) / (2 * m)
loss = tf.Print(loss, [loss], "loss: ")
optimizer = tf.train.GradientDescentOptimizer(.01)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(epochs):
sess.run(train)
sess.close()
I understand that right now these variables are all random so the accuracy would not be very good anyways, but I just want to know how to make a test set and find the accuracy of the predictions.

Typically, you split your training set into two pieces: roughly 2/3 for training and 1/3 for testing (opinions vary on the proportions). Train your model with the first set. Check the training accuracy (run the training set back through the model to see how many it gets right).
Now, run the remainder (test set) through the model and check how well the predictions match the results. "Find the accuracy" depends on what sort of predictions you're making: classification vs scoring, binary vs disjoint vs contiguous, etc.

Related

Implementing binary cross entropy from scratch - inconsistent results in training a neural network

I'm trying to implement and train a neural network using the JAX library and its little neural network submodule, "Stax". Since this library doesn't come with an implementation of binary cross entropy, I wrote my own:
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
I implemented a simple neural network and trained it on MNIST, and started to get suspicious of some of the results I was getting. So I implemented the same setup in Keras, and I immediately got wildly different results! The same model, trained in the same way on the same data, was getting 90% training accuracy in Keras instead of around 50% in JAX. Eventually I tracked down part of the issue to my naive implementation of cross-entropy, which is supposedly numerically unstable. Following this post and this code I found, I wrote the following new version:
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
This works a little better. Now my JAX implementation gets up to 80% train accuracy, but that's still a lot less than the 90% Keras gets. What I want to know is what is going on? Why are my two implementations not behaving the same way?
Below, I condensed my two implementations down to a single script. In this script, I implement the same model in JAX and in Keras. I initialize both with the same weights, and train them using full-batch gradient descent for 10 steps on 1000 datapoints from MNIST, the same data for each model. JAX finishes with 80% training accuracy, while Keras finishes with 90%. Specifically, I get this output:
Initial Keras accuracy: 0.4350000023841858
Initial JAX accuracy: 0.435
Final JAX accuracy: 0.792
Final Keras accuracy: 0.9089999794960022
JAX accuracy (Keras weights): 0.909
Keras accuracy (JAX weights): 0.7919999957084656
And actually, when I vary the conditions a little (using different random initial weights or a different training set), sometimes I get back the 50% JAX accuracy and 90% Keras accuracy.
I swap the weights at the end to verify that the weights obtained from training are indeed the issue, not something to do with the actual computation of the network predictions, or the way I calculate accuracy.
The code:
import numpy as np
import jax
from jax import jit, grad
from jax.experimental import stax, optimizers
import jax.numpy as jnp
import keras
import keras.datasets.mnist
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
def binary_accuracy(y_hat, y):
return jnp.mean((y_hat >= 1/2) == (y >= 1/2))
########################################
# #
# Create dataset #
# #
########################################
input_dimension = 784
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data(path="mnist.npz")
xs = np.concatenate([x_train, x_test])
xs = xs.reshape((70000, 784))
ys = np.concatenate([y_train, y_test])
ys = (ys >= 5).astype(np.float32)
ys = ys.reshape((70000, 1))
train_xs = xs[:1000]
train_ys = ys[:1000]
########################################
# #
# Create JAX model #
# #
########################################
jax_initializer, jax_model = stax.serial(
stax.Dense(1000),
stax.Relu,
stax.Dense(1),
stax.Sigmoid
)
rng_key = jax.random.PRNGKey(0)
_, initial_jax_weights = jax_initializer(rng_key, (1, input_dimension))
########################################
# #
# Create Keras model #
# #
########################################
initial_keras_weights = [*initial_jax_weights[0], *initial_jax_weights[2]]
keras_model = keras.Sequential([
keras.layers.Dense(1000, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
keras_model.compile(
optimizer=keras.optimizers.SGD(learning_rate=0.01),
loss=keras.losses.binary_crossentropy,
metrics=["accuracy"]
)
keras_model.build(input_shape=(1, input_dimension))
keras_model.set_weights(initial_keras_weights)
if __name__ == "__main__":
########################################
# #
# Compare untrained models #
# #
########################################
initial_keras_predictions = keras_model.predict(train_xs, verbose=0)
initial_jax_predictions = jax_model(initial_jax_weights, train_xs)
_, keras_initial_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
jax_initial_accuracy = binary_accuracy(jax_model(initial_jax_weights, train_xs), train_ys)
print("Initial Keras accuracy:", keras_initial_accuracy)
print("Initial JAX accuracy:", jax_initial_accuracy)
########################################
# #
# Train JAX model #
# #
########################################
L = jit(binary_cross_entropy_stable)
gradL = jit(grad(lambda w, x, y: L(jax_model(w, x), y)))
opt_init, opt_apply, get_params = optimizers.sgd(0.01)
network_state = opt_init(initial_jax_weights)
for _ in range(10):
wT = get_params(network_state)
gradient = gradL(wT, train_xs, train_ys)
network_state = opt_apply(
0,
gradient,
network_state
)
final_jax_weights = get_params(network_state)
final_jax_training_predictions = jax_model(final_jax_weights, train_xs)
final_jax_accuracy = binary_accuracy(final_jax_training_predictions, train_ys)
print("Final JAX accuracy:", final_jax_accuracy)
########################################
# #
# Train Keras model #
# #
########################################
for _ in range(10):
keras_model.fit(
train_xs,
train_ys,
epochs=1,
batch_size=1000,
verbose=0
)
final_keras_loss, final_keras_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Final Keras accuracy:", final_keras_accuracy)
########################################
# #
# Swap weights #
# #
########################################
final_keras_weights = keras_model.get_weights()
final_keras_weights_in_jax_format = [
(final_keras_weights[0], final_keras_weights[1]),
tuple(),
(final_keras_weights[2], final_keras_weights[3]),
tuple()
]
jax_accuracy_with_keras_weights = binary_accuracy(
jax_model(final_keras_weights_in_jax_format, train_xs),
train_ys
)
print("JAX accuracy (Keras weights):", jax_accuracy_with_keras_weights)
final_jax_weights_in_keras_format = [*final_jax_weights[0], *final_jax_weights[2]]
keras_model.set_weights(final_jax_weights_in_keras_format)
_, keras_accuracy_with_jax_weights = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Keras accuracy (JAX weights):", keras_accuracy_with_jax_weights)
Try changing the PRNG seed at line 57 to a value other than 0 to run the experiment using different initial weights.
Your binary_cross_entropy_stable function does not match the output of keras.binary_crossentropy; for example:
x = np.random.rand(10)
y = np.random.rand(10)
print(keras.losses.binary_crossentropy(x, y))
# tf.Tensor(0.8134677734043875, shape=(), dtype=float64)
print(binary_cross_entropy_stable(x, y))
# 0.9781515
That is where I would start if you're trying to exactly duplicate the model.
You can view the source of the keras loss function here: keras/losses.py#L1765-L1810, with the main part of the implementation here: keras/backend.py#L4972-L5017
One detail: it appears that with a sigmoid activation function, Keras re-uses some cached logits to compute the binary cross entropy while avoiding problematic values: keras/backend.py#L4988-L4997. I'm not sure how to easily replicate that behavior using JAX & stax.

How can I tune my neural network to avoid overfitting the mnist data set?

!!!!!!!!!TL;DR at the bottom!!!!!!!!
In an attempt to learn the in's and out's of ML, I have been implementing a neural network optimizer in c++ and wrapped it with swig as a python module. Of course, the first problem I tackled was XOR via the following snip of code: 2 input layers, 2 hidden layers, 1 output layer.
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import time
#=======================================================
# Training Set
#=======================================================
X = [[0,1],[1,0],[1,1],[0,0]]
Y = [[1],[1],[0],[0]]
nIn = len(X[0])
nOut = len(Y[0])
#=======================================================
# Model
#=======================================================
verbosity = 0
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,2,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize the classification optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X,Y)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10000,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#Make a prediction
print(Opt.predict(X))
This snippet of code yields the following output (Correct answer would be [1,1,0,0])
--- 0.10273098945617676 seconds ---
((0.9398755431175232,), (0.9397522211074829,), (0.0612373948097229,), (0.04882470518350601,))
>>>
Looks great!
Now for the issue. The following snippet of code tries to learn from the mnist dataset, but suffers very obviously from overfitting. ~750 input (28X28 pixels), 50 hidden, 10 output
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import matplotlib.pyplot as plt
import numpy as np
import pickle
import time
#=======================================================
# Data Set
#=======================================================
#load the data dictionary
modeldata = pickle.load( open( "mnist_data.p", "rb" ) )
X = modeldata['X']
Y = modeldata['Y']
#normalize data
X = np.array(X)
X = X/255
X = X.tolist()
#training set
X1 = X[0:49999]
Y1 = Y[0:49999]
#validation set
X2 = X[50000:59999]
Y2 = Y[50000:59999]
#number of inputs/outputs
nIn = len(X[0]) #~750
nOut = len(Y[0]) #=10
#=======================================================
# Model
#=======================================================
verbosity = 1
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,50,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X1,Y1)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#================================
#Final Accuracy on training set
#================================
XL = Opt.predict(X1)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y1[i])):
correct = correct + 1
print("Training set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy) + '%')
#================================
#Final Accuracy on validation set
#================================
XL = Opt.predict(X2)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y2[i])):
correct = correct + 1
print("Testing set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy)+'%')
That snippet of code yields the following output which shows the training accuracy and validation accuracy.
-------------------------
Epoch
9
-------------------------
E=
0.00696964
E=
0.350509
E=
3.49568e-05
E=
4.09073e-06
E=
1.38491e-06
E=
0.229873
E=
3.60186e-05
E=
0.000115187
E=
2.29978e-06
E=
2.69165e-06
--- 27.400235176086426 seconds ---
Training set Correct = 48435
Accuracy = 96.87193743874877%
Testing set Correct = 982
Accuracy = 9.820982098209821%
The training set accuracy is great, but then the testing set is no better than a random guess. Any idea what could be causing this?
TL;DR
Solved XOR with a model 2 inputs, 2 hidden, 1 output and sigmoid activation functions. Good results.
Tried to solve the Mnist data set with a model of 750 inputs (28X28 pixels), 50 hidden, 10 output and sigmoid activation functions. Severe overfitting issue. 95% accuracy on the training set, 10% accuracy on validation set.
Any Idea what is causing this?
The cause of overfitting is a combination of the data and model (network in this case). During the training is was 'lazy' and found aspects of the data that worked well in training data but not generalize well.
It is difficult/impossible to point out exactly where in the trained network the nodes/weights are located that are responsible for overfitting.
But we can avoid overfitting with several tricks:
Regularisation
Drop-out (easier to implement)
Change Network Architecture (less layers/less nodes/more dimension-reduction)
https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
To get an idea of regularization, try the playground from tensorflow:
https://playground.tensorflow.org/
A visualisation of dropout
https://yusugomori.com/projects/deep-learning/dropout-relu
Besides try out regularisation techniques, also experiments with different NN architectures.

Tensorflow not recognising numbers in a list correctly

Hi I'm encountering a problem where Tensorflow does not like the numbers I've selected for the training data. I've borrowed the code from the Tensorflow 'Getting started' tutorial but I have subsituted the x_train and y_train data list of coordinates with another.
import numpy as np
import tensorflow as tf
# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [1.0, 1.5 ,3.0, 6.0, 8.0, 9.0, 11.0, 12.0,38.0 ,41.0, 82.0]
y_train = [9.5,10.75,14.5,22.0,27.0,29.5,34.5,37,102.0,109.5,212.0]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
#print(sess.run(init))
for i in range(1000):
sess.run(train, {x:x_train, y:y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
Now the code outputs Nan for those values in the lists x_train and y_train. The original x_train and y_train data for which it works is
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]
EDIT: I'm expecting the value of W and b to converge to 2.5 and 7.0 respectively.
The problem is in the learning rate of the optimizer. Since you have change the inputs to some arrays with a very different scale, the appropriate learning rate for the model is different. You can avoid this sort of problem using some normalization technique on your input, but in this case you can just tune the learning rate until you find the right value. After a couple of tests, I found that in your case something like 1e-5 should work:
optimizer = tf.train.GradientDescentOptimizer(1e-5)
More generally, the problem in this case is that your inputs are quite unevenly distributed. Most of the examples are below 40, but then you have a couple around 100 and one over 200; these last three will have a way bigger impact in the training updates than all the others, so, for example, if there is some noise in one of them, it will have a large effect in the whole model.

Linear regression with tensorflow

I trying to understand linear regression... here is script that I tried to understand:
'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
from __future__ import print_function
import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
Question is what this part represent:
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
And why are there random float numbers?
Also could you show me some math with formals represents cost, pred, optimizer variables?
let's try to put up some intuition&sources together with the tfapproach.
General intuition:
Regression as presented here is a supervised learning problem. In it, as defined in Russel&Norvig's Artificial Intelligence, the task is:
given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym), where each output was generated by an unknown function y = f(x), discover a function h that approximates the true function f
For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. The hope is that the resulting function will be close to f.
But how to learn this parameters? in order to be able to learn, the model has to be able to evaluate. Here comes the cost (also called loss, energy, merit...) function to play: it is a metric function that compares the output of h with the corresponding y, and penalizes big differences.
Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function.
Linear Regression:
The example that you are posting performs a parametric linear regression, optimized with gradient descent based on the mean squared error as cost function. Which means:
Parametric: The set of parameters is fixed. They are held in the exact same memory placeholders thorough the learning process.
Linear: The output of h is merely a linear (actually, affine) combination between the input x and your parameters. So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b. Page 107 of the Deep Learning Book brings more quality insights and intuitions into that.
Cost function: Now this is the interesting part. The average squared error is a convex function. This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB).
Gradient descent: tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. If you totally need to know, the exact technique applied by TF is called automatic differentiation, kind of a compromise between the numeric and symbolic approaches. For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with. The random initialization is necessary in more complex architectures like neural networks. There is some extra code regarding the management of the minibatches, but I won't get into that because it is not the main focus of your question.
The TensorFlow approach:
Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). For constructing and running the graph, TensoFlow follows a declarative style, which means that the graph has to be first completely defined and compiled, before it is deployed and executed. It is very reccommended to read this short wiki article, if you haven't yet. In this context, the setup is split in two parts:
Firstly, you define your computational Graph, where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply.
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization.
The code:
The code of the example follows this approach closely:
Define the test data X and labels Y, and prepare a placeholder in the Graph for them (which is fed in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They have to be Variables because they will be updated during the Session.
Define pred (our hypothesis) and cost as explained before.
From this, the rest of the code should be clearer. Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that)
Cheers!
Andres
CODE EXAMPLES: GRADIENT DESCENT VS. NORMAL EQUATIONS
This small snippets generate simple multi-dimensional datasets and test both approaches. Notice that the normal equations approach doesn't require looping, and brings better results. For small dimensionality (DIMENSIONS<30k) is probably the preferred approach:
from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf
####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
# forward propagation
train_prediction = tf.matmul(x_train, theta)+theta_0
test_prediction = tf.matmul(x_test, theta) +theta_0
# cost function and optimizer
train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
# test results
test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)
# run the computation
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized"); print(theta.eval())
for step in range(TRAINING_STEPS):
_, train_c, test_c = s.run([optimizer, train_cost, test_cost],
feed_dict={x_train: train_data, y_train: train_labels,
x_test: test_data, y_test: test_labels })
if (step%100==0):
# it should return bias close to zero and parameters all close to 1 (see definition of f)
print("\nAfter", step, "iterations:")
#print(" Bias =", theta_0.eval(), ", Weights = ", theta.eval())
print(" train cost =", train_c); print(" test cost =", test_c)
PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
print("Solution for parameters:\n", PARAMETERS_GRADDESC)
####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
# optimum
optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)
# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized")
opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
PARAMETERS_NORMEQ = opt
print("Solution for parameters:\n",PARAMETERS_NORMEQ)
####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################
# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)
Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
The variable with type tf.Variable is the parameter which we will learn use TensorFlow. Assume you use the gradient descent to minimize the loss function. You need initial these parameter first. The rng.randn() is used to generate a random value for this purpose.
I think the Getting Started With TensorFlow is a good start point for you.
I'll first define the variables:
W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias)
Y is also a scalar value i.e. the value at X
pred = W (dot) X + b # dot here refers to dot product
# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples
#finally optimizer
# optimizer computes the gradient with respect to each variable and the update
W += learning_rate * (pred - Y)/num_samples * X
b += learning_rate * (pred - Y)/num_samples
Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. It isn't performing linear regression via least squares method although both will converge to the same solution.
Look here for more information: Getting Started

TensorFlow - why doesn't this sofmax regression learn anything?

I am aiming to do big things with TensorFlow, but I'm trying to start small.
I have small greyscale squares (with a little noise) and I want to classify them according to their colour (e.g. 3 categories: black, grey, white). I wrote a little Python class to generate squares, and 1-hot vectors, and modified their basic MNIST example to feed them in.
But it won't learn anything - e.g. for 3 categories it always guesses ≈33% correct.
import tensorflow as tf
import generate_data.generate_greyscale
data_generator = generate_data.generate_greyscale.GenerateGreyScale(28, 28, 3, 0.05)
ds = data_generator.generate_data(10000)
ds_validation = data_generator.generate_data(500)
xs = ds[0]
ys = ds[1]
num_categories = data_generator.num_categories
x = tf.placeholder("float", [None, 28*28])
W = tf.Variable(tf.zeros([28*28, num_categories]))
b = tf.Variable(tf.zeros([num_categories]))
y = tf.nn.softmax(tf.matmul(x,W) + b)
y_ = tf.placeholder("float", [None,num_categories])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
# let batch_size = 100 --> therefore there are 100 batches of training data
xs = xs.reshape(100, 100, 28*28) # reshape into 100 minibatches of size 100
ys = ys.reshape((100, 100, num_categories)) # reshape into 100 minibatches of size 100
for i in range(100):
batch_xs = xs[i]
batch_ys = ys[i]
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
xs_validation = ds_validation[0]
ys_validation = ds_validation[1]
print sess.run(accuracy, feed_dict={x: xs_validation, y_: ys_validation})
My data generator looks like this:
import numpy as np
import random
class GenerateGreyScale():
def __init__(self, num_rows, num_cols, num_categories, noise):
self.num_rows = num_rows
self.num_cols = num_cols
self.num_categories = num_categories
# set a level of noisiness for the data
self.noise = noise
def generate_label(self):
lab = np.zeros(self.num_categories)
lab[random.randint(0, self.num_categories-1)] = 1
return lab
def generate_datum(self, lab):
i = np.where(lab==1)[0][0]
frac = float(1)/(self.num_categories-1) * i
arr = np.random.uniform(max(0, frac-self.noise), min(1, frac+self.noise), self.num_rows*self.num_cols)
return arr
def generate_data(self, num):
data_arr = np.zeros((num, self.num_rows*self.num_cols))
label_arr = np.zeros((num, self.num_categories))
for i in range(0, num):
label = self.generate_label()
datum = self.generate_datum(label)
data_arr[i] = datum
label_arr[i] = label
#data_arr = data_arr.astype(np.float32)
#label_arr = label_arr.astype(np.float32)
return data_arr, label_arr
For starters, try initializing your W matrix with random values, not zeros - you're not giving the optimizer anything to work with when the output is all zeros for all inputs.
Instead of:
W = tf.Variable(tf.zeros([28*28, num_categories]))
Try:
W = tf.Variable(tf.truncated_normal([28*28, num_categories],
stddev=0.1))
You issue is that the your gradients are increasing/decreasing without bounds, causing the loss function to become nan.
Take a look at this question: Why does TensorFlow example fail when increasing batch size?
Furthermore, make sure that you run the model for a sufficient number of steps. You are only running it once through your train dataset (100 times * 100 examples), and this is not enough for it to converge. Increase it to something like 2000 at a minimum (running 20 times through your dataset).
Edit (can't comment, so i'll add my thoughts here):
The point of the post i linked is that you can use GradientDescentOptimizer, as long as you make the learning rate something like 0.001. That's the issue, your learning rate was too high for the loss function you were using.
Alternatively, use a different loss function, that doesn't increase/decrease the gradients as much. Use tf.reduce_mean instead of tf.reduce_sum in the definition of crossEntropy.
While dga and syncd's responses were helpful, I tried using non-zero weight initialization and larger datasets but to no avail. The thing that finally worked was using a different optimization algorithm.
I replaced:
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
with
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
I also embedded the training for loop in another for loop to train for several epochs, resulting in convergence like this:
===# EPOCH 0 #===
Error: 0.370000004768
===# EPOCH 1 #===
Error: 0.333999991417
===# EPOCH 2 #===
Error: 0.282000005245
===# EPOCH 3 #===
Error: 0.222000002861
===# EPOCH 4 #===
Error: 0.152000010014
===# EPOCH 5 #===
Error: 0.111999988556
===# EPOCH 6 #===
Error: 0.0680000185966
===# EPOCH 7 #===
Error: 0.0239999890327
===# EPOCH 8 #===
Error: 0.00999999046326
===# EPOCH 9 #===
Error: 0.00400000810623
EDIT - WHY IT WORKS: I suppose the problem was that I didn't manually choose a good learning rate schedule, and Adam was able to generate a better one automatically.
Found this question when I was having a similar issue..I fixed mine by scaling the features.
A little background: I was following the tensorflow tutorial, however I wanted to use the data from Kaggle(see data here) to do the modeling, but in the beginning I kepted having the same issue: the model just doesn't learn..after rounds of trouble-shooting, I realized that the Kaggle data was on a completely different scale. Therefore, I scaled the data so that it shares the same scale(0,1) as the tensorflow's MNIST dataset.
Just figured that I would add my two cents here..in case some beginners who are trying to follow the tutorial's settings get stuck like I did =)

Categories

Resources