I tried to train my data using 'Gradient Descent Algorithm' to minimize cost value,
and strangely enough, result came out differently depending on the number of steps.
Below is my training code:
import tensorflow as tf
X = tf.placeholder(tf.float32, shape=[None, 2], name="X")
Y = tf.placeholder(tf.float32, shape=[None, 1], name="Y")
W = tf.Variable(tf.random_normal([2, 1]), name="weight")
b = tf.Variable(tf.random_normal([1]), name="bias")
hypo = tf.sigmoid(tf.matmul(X, W) +b)
cost = -tf.reduce_mean(Y*(tf.log*(hypo)) + (1-Y)*(tf.log(1-hypo)))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-3)
train = optimizer.minimize(cost)
#### Saving model
SAVER_DIR = "model"
saver = tf.train.Saver()
checkpoint_path = os.path.join(SAVER_DIR, "model")
ckpt = tf.train.get_checkpoint_state(SAVER_DIR)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for step in range(4201):
cost_val, hy_val, _ = sess.run([cost, hypo, train], feed_dict={X:x_data, Y=y_data})
saver.save(sess, checkpoint_path, global_step=step)
And restore model:
saver = tf.train.import_meta_graph('./model/model-4200.meta')
saver.restore(sess,'./model/model-4200')
result = sess.run(hypo, feed_dict={X: x_data_test})
fig, ax = plt.subplots()
ax.plot(Julian_test,y_data_test,'ro-') # Correct answer. all items are one of the two:0 or 1.
ax.plot(Julian_test,result,'bo-') # Result of training. Predict answer within
plt.show() # sigmoid function, so all items are in range of 0 ~ 1.
As figure is showing, the result of sigmoid is reverse.
But, when I changed the number of steps to 5000, (in my above code, I only changed step.)
result came out correctly.
I can't understand why it makes defference. Did I miss something? Need help indeed!
in simple terms by increasing the steps you are allowing your tensorflow code/model to see the data multiple times, hence giving it the capabilities to learn more insights about the data. and generalize its representation.
E.G
Lets say you give your model 2000 steps and at the end of 2000 steps it finds a minimum and your model stops there. but what if the minimum cost that the model has find till now is not the global minimum, we cant say cause we restricted it to 2000 steps. so lets say you increase the steps to 20000 and model now finds another minimum which gives more accurate results.
But you need to make sure that your model does-not overfit i.e giving accuracy on your training data but not on your validation set. (So make sure not to increase num steps by too much).
Related
I have a pre-trained PyTorch model. I need to calculate the gradient of the loss with respect to the network's inputs using this model (without training again and only using the pre-trained model).
I wrote the following code, but I am not sure it is correct or not.
test_X, test_y = load_data(mode='test')
testset_original = MyDataset(test_X, test_y, transform=default_transform)
testloader = DataLoader(testset_original, batch_size=32, shuffle=True)
model = MyModel(device=device).to(device)
checkpoint = torch.load('checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
gradient_losses = []
for i, data in enumerate(testloader):
inputs, labels = data
inputs= inputs.to(device)
labels = labels.to(device)
inputs.requires_grad = True
output = model(inputs)
loss = loss_function(output)
loss.backward()
gradient_losses.append(inputs.grad)
My question is, does this list gradient_losses actually storing what I wish to store? If not, what is the correct way to do that?
does this list gradient_losses actually storing what I wish to store?
Yes, if you are looking to get the derivative of the loss with respect to the input then that seems to be the correct way to do it. Here is minimal example, take f(x) = a*x. Then df/dx = a.
>>> x = torch.rand(10, requires_grad=True)
>>> y = torch.rand(10)
>>> a = torch.tensor([3.], requires_grad=True)
>>> loss = a*x - y
>>> loss.mean().backward()
>>> x.grad
tensor([0.3000, 0.3000, ..., 0.3000, 0.3000])
Which, in this case is equal to a / len(x)
Do note, each gradient you extract with input.grad will be averaged over the whole batch, and won't be a gradient over each individual input.
Also, you don't need to .clone() your input gradients as they are not part of the model and won't get zeroed by model.zero_grad().
I am using tensorflow to optimize a simple least squares objective function like the following:
Here, Y is the target vector ,X is the input matrix and vector w represents the weights to be learned.
Example Scenario:
, ,
If I wanted to augment the initial objective function to impose an additional constraint on w1 (the first scalar value in the tensorflow Variable w and X1 represents the first column of the feature matrix X), how would I achieve this in tensorflow?
One solution I can think of is to use tf.slice to index the first value of $w$ and add this in addition to the original cost term but I am not convinced that it will have the desired effect on the weights.
I would appreciate inputs on whether something like this is possible in tensorflow and if so, what the best ways to implement this might be?
An alternate option would be to add weight constraints, and do it using an augmented Lagrangian objective but I would first like to explore the regularization option before going the Lagrangian route.
The current code I have for the initial objective function without additional regularization is the following:
train_x ,train_y are the training data, training targets respectively.
test_x , test_y are the testing data, testing targets respectively.
#Sum of Squared Errs. Cost.
def costfunc(predicted,actual):
return tf.reduce_sum(tf.square(predicted - actual))
#Mean Squared Error Calc.
def prediction(sess,X,y_,test_x,test_y):
pred_y = sess.run(y_,feed_dict={X:test_x})
mymse = tf.reduce_mean(tf.square(pred_y - test_y))
mseval=sess.run(mymse)
return mseval,pred_y
with tf.Session() as sess:
X = tf.placeholder(tf.float32,[None,num_feat]) #Training Data
Y = tf.placeholder(tf.float32,[None,1]) # Target Values
W = tf.Variable(tf.ones([num_feat,1]),name="weights")
init = tf.global_variables_initializer()
sess.run(init)
#Tensorflow ops and cost function definitions.
y_ = tf.matmul(X,W)
cost_history = np.empty(shape=[1],dtype=float)
out_of_sample_cost_history = np.empty(shape=[1],dtype=float)
cost=costfunc(y_,Y)
learning_rate = 0.000001
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
for epoch in range(training_epochs):
sess.run(training_step,feed_dict={X:train_x,Y:train_y})
cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))
out_of_sample_cost_history = np.append(out_of_sample_cost_history,sess.run(cost,feed_dict={X:test_x,Y:test_y}))
MSETest,pred_test = prediction(sess,X,y_,test_x,test_y) #Predict on full testing set.
tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).
I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0
I built a neural network with tensorflow. It is a simple 3 layer neural network with the last layer being softmax.
I tried it on standard adult income dataset (e.g. https://archive.ics.uci.edu/ml/datasets/adult) since it is publicly available, has a good amount of data (roughly 50k examples) and also provides separate test data.
As there are some categorical attributes, I converted them into one hot encodings. For neural network I used Xavier initialization and Adam Optimizer. As there are only two output classes (>50k and <=50k) the last softmax layer had only two neurons. After one hot encoding expansion, the 14 attributes / columns expanded into 108 columns.
I experimented with different number of neurons in the first two hidden layers (from 5 to 25). I also experimented with number of iterations (from 1000 to 20000).
The training accuracy wasn't affected much by the number of neurons. It went up a little with more number of iterations. However I could not do any better than 82% :(
Am I missing something basic in my approach? Has anyone tried this (neural network with this dataset)? If so what are the expected results? Could the low accuracy be due to missing values? (I am planning to try filtering out all the missing values if there aren't much in the dataset).
Any other ideas? Here is my tensorflow neural network code in case there are any bugs in it etc.
def create_placeholders(n_x, n_y):
X = tf.placeholder(tf.float32, [n_x, None], name = "X")
Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
return X, Y
def initialize_parameters(num_features):
tf.set_random_seed(1) # so that your "random" numbers match ours
layer_one_neurons = 5
layer_two_neurons = 5
layer_three_neurons = 2
W1 = tf.get_variable("W1", [layer_one_neurons,num_features], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [layer_one_neurons,1], initializer = tf.zeros_initializer())
W2 = tf.get_variable("W2", [layer_two_neurons,layer_one_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b2 = tf.get_variable("b2", [layer_two_neurons,1], initializer = tf.zeros_initializer())
W3 = tf.get_variable("W3", [layer_three_neurons,layer_two_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b3 = tf.get_variable("b3", [layer_three_neurons,1], initializer = tf.zeros_initializer())
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2,
"W3": W3,
"b3": b3}
return parameters
def forward_propagation(X, parameters):
"""
Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
Arguments:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
the shapes are given in initialize_parameters
Returns:
Z3 -- the output of the last LINEAR unit
"""
# Retrieve the parameters from the dictionary "parameters"
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
W3 = parameters['W3']
b3 = parameters['b3']
Z1 = tf.add(tf.matmul(W1, X), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W2, A1), b2)
A2 = tf.nn.relu(Z2)
Z3 = tf.add(tf.matmul(W3, A2), b3)
return Z3
def compute_cost(Z3, Y):
"""
Computes the cost
Arguments:
Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
Y -- "true" labels vector placeholder, same shape as Z3
Returns:
cost - Tensor of the cost function
"""
# to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
logits = tf.transpose(Z3)
labels = tf.transpose(Y)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))
return cost
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, num_epochs = 1000, print_cost = True):
"""
Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
Arguments:
X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
X_test -- training set, of shape (input size = 12288, number of training examples = 120)
Y_test -- test set, of shape (output size = 6, number of test examples = 120)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
print_cost -- True to print the cost every 100 epochs
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
tf.set_random_seed(1) # to keep consistent results
seed = 3 # to keep consistent results
(n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set)
n_y = Y_train.shape[0] # n_y : output size
costs = [] # To keep track of the cost
# Create Placeholders of shape (n_x, n_y)
X, Y = create_placeholders(n_x, n_y)
# Initialize parameters
parameters = initialize_parameters(X_train.shape[0])
# Forward propagation: Build the forward propagation in the tensorflow graph
Z3 = forward_propagation(X, parameters)
# Cost function: Add cost function to tensorflow graph
cost = compute_cost(Z3, Y)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(num_epochs):
_ , epoch_cost = sess.run([optimizer, cost], feed_dict={X: X_train, Y: Y_train})
# Print the cost every epoch
if print_cost == True and epoch % 100 == 0:
print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
if print_cost == True and epoch % 5 == 0:
costs.append(epoch_cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
# lets save the parameters in a variable
parameters = sess.run(parameters)
print ("Parameters have been trained!")
# Calculate the correct predictions
correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
# Calculate accuracy on the test set
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
#print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
return parameters
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
import pandas as pd
%matplotlib inline
np.random.seed(1)
df = pd.read_csv('adult.data', header = None)
X_train_orig = df.drop(df.columns[[14]], axis=1, inplace=False)
Y_train_orig = df[[14]]
X_train = pd.get_dummies(X_train_orig) # get one hot encoding
Y_train = pd.get_dummies(Y_train_orig) # get one hot encoding
parameters = model(X_train.T, Y_train.T, None, None, num_epochs = 10000)
Any suggestions for other publicly available dataset for trying this out?
I tried standard algorithms on this dataset from scikit learn with default parameters and I got following accuracies:
Random Forest: 86
SVM: 96
kNN: 83
MLP: 79
I have uploaded my iPython notebook for this at: https://github.com/sameermahajan/ClassifiersWithIncomeData/blob/master/Scikit%2BLearn%2BClassifiers.ipynb
The best accuracy is with SVM which can be expected from some explanation that can be seen from: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Interestingly SVM also took a lot of time to run, way more than any other method.
This may not be a good problem to be solved by neural network looking at MLPClassifier accuracy above. My neural network wasn't that bad after all! Thanks for all the responses and your interest in this.
I didn't experiment on this dataset but after looking at some papers and doing some researches, it looks like your network is doing ok.
First is your accuracy calculed from the training set or the test set ? Having both will give you a good hint of how your network is performing.
I'm still a bit new to machine learning but I can maybe give some help :
By looking at the data documentation link here : https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
And this paper : https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf
From those links 85% accuracy on training and test set looks like a good score, you are not too far.
Do you have some kind of cross-validation to look for overfitting of your network ?
I don't have your code so can't help you if this is a bug or a programming related issue, maybe sharing your code might be a good idea.
I think you would gain more accuracy by pre-processing your data a bit :
There are a lot of unknowns inside your data and neural networks are very sensitive to mislabeling and bad data.
You should try to find and replace or remove the unknowns.
You could also try to identify the most useful features and drop the ones that are near useless.
Feature scaling / data normalization can also be quite important for neural networks, i didn't look much into the data but maybe you can try to figure out how to scale your data between [0, 1] if its not done already.
The document I linked you seems to see an upgrade in performance by adding layers up to 5 layers, did you try adding more layers ?
You can also add dropout if you network overfits, if you didn't already.
I would maybe try other networks that are generally good for those tasks like SVM (Support Vector Machine) or Logistic Regression or even Random Forest but not sure by looking at the result that those will perform better than the artificial neural network.
I would also take a look at those links : https://www.kaggle.com/wenruliu/adult-income-dataset/feed
https://www.kaggle.com/wenruliu/income-prediction
In this link there are some people trying algorithms and giving tips to process the data and tackle this subject.
Hope it helped
Good luck,
Marc.
I think you are focusing too much in your network structure and you are forgetting that your results also depend largely on the data quality. I have tried a quick out-of-the-shelf random forest and it gave me similar results as you got (acc = 0.8275238).
I suggest you do some feature engineering (the kaggle link provided by #Marc has some nice examples). Decide an strategy for your NA's (look here), group values when you have many factor levels in categorical variables (e.g. countries grouped into continents) or discretise continuous variables (age variable into levels as in old, mid_aged, young).
Play with your data, study your dataset and try to apply expertise to remove redundant or too narrow info. Once this is done, then start tweaking your model. Additionally, you can consider doing as I did: use ensemble models (which are usually fast and pretty accurate with the default values) like RF or XGB to check if the results are consistent between all your models. Once you are sure you are in the right track, you can start tweaking structure, layers, etc. and see if you can push your results ever further.
Hope this helps.
Good luck!
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.