I have Tensorflow code for multi-task learning (one input, several outputs, similar to this: https://jg8610.github.io/Multi-Task/). For further explanation see below. The code works, but is slow as there's a lot of overhead from reading data in Python and feeding it to the GPU (with the tf.Session's feed_dict).
So my plan is now to preload the data according to https://www.tensorflow.org/programmers_guide/reading_data#preloaded_data [storing it in a tf.constant and using TF's queuing system]. This raises some problems, of which the most central for now seems to be:
If I preload the different task data into different tensors, I no longer have a task-generic X_in. That means that when declaring the shared layer, I now need to make a decision whether to connect it to X_input_task_A or X_input_task_B, and obviously that's not going to result in a shared layer.
My question
Would you have any idea how to solve this problem, i.e. to define shared layers with task-specific tensors, and then training by alternating between tasks? How would you alternatively call the different optimizer operations?
Further explanation on the Multi-task learning paradigm
For background, what the mentioned blog post (as well as my code so far) does is to define a placeholder X_in plus a shared layer that consumes that input op. Then, for each task we want to learn, we have different projections and loss functions that use task-specific placeholders y_task, and training happens by alternately running session.run(optimizer_task, feed_dict={X_in: X_batch_task, y_task: y_batch_task}), where optimizer_task is some task-specific optimizer. This is basically what my code does now - it works but is slow because I need to feed the data:
# PLACEHOLDERS
X_in = tf.placeholder([batch_size, 100])
y_task_a = tf.placeholder([batch_size, 4]) # 4 output classes
y_task_b = tf.placeholder([batch_size, 2]) # 2 output classes
# SHARED LAYER
W = tf.get_variable("W", [100, 50])
shared_layer = tf.sigmoid(tf.matmul(X_in, W))
# TASK-SPECIFIC OUTPUTS
W_task_a = tf.get_variable("Wa", [50, 4])
W_task_b = tf.get_variable("Wb", [50, 2])
pred_task_a = tf.sigmoid(tf.matmul(shared_layer, W_task_a))
pred_task_b = tf.sigmoid(tf.matmul(shared_layer, W_task_b))
# TASK-SPECIFIC LOSSES AND OPTIMIZERS
loss_task_a = tf.nn.softmax_cross_entropy_with_logits(logits=pred_task_a, labels=y_task_a)
loss_task_b = tf.nn.softmax_cross_entropy_with_logits(logits=pred_task_b, labels=y_task_b)
optimizer_a = ...(loss_task_a)
optimizer_b = ...(loss_task_b)
# TRAINING
with tf.Session() as sess:
for i in range(ITERS):
# ALTERNATE BETWEEN TASKS, GET BATCH FROM DATA PER TASK AND TRAIN
X_a, y_a = data_task_a.get_batch()
X_b, y_b = data_task_b.get_batch()
sess.run(optimizer_a, feed_dict={X_in: X_a, y_task_a: y_a})
sess.run(optimizer_b, feed_dict={X_in: X_b, y_task_b: y_b})
Related
I have tried to implement Proximal Policy Optimization with Intrinsic Curiosity Rewards for statefull LSTM neural network.
Losses in both PPO and ICM are diverging and I would like to find out if its bug in code or badly selected hyperparameters.
Code (where some wrong implementation could be):
In ICM model I use first layer LSTM too to match input dimensions.
In ICM whole dataset is propagated at once, with zeros as initial hidden(resultin tensors are different, than they would be if I propagated only 1 state or batch and re-use hidden cells)
In PPO advantage and discount reward processing the dataset is propagated one by one and hidden cells are re-used (exact opposite than in ICM because here it uses same model for selecting actions and this approach is "real-time-like")
In PPO training model is trained on batches with re-use of hidden cells
I have used https://github.com/adik993/ppo-pytorch as default code and reworked it to run on my environment and use LSTM
I may provide code samples later if specifically requested due to large amount of rows
Hyperparameters:
def __init_curiosity(self):
curiosity_factory=ICM.factory(MlpICMModel.factory(), policy_weight=1,
reward_scale=0.1, weight=0.2,
intrinsic_reward_integration=0.01,
reporter=self.reporter)
self.curiosity = curiosity_factory.create(self.state_converter,
self.action_converter)
self.curiosity.to(self.device, torch.float32)
self.reward_normalizer = StandardNormalizer()
def __init_PPO_trainer(self):
self.PPO_trainer = PPO(agent = self,
reward = GeneralizedRewardEstimation(gamma=0.99, lam=0.95),
advantage = GeneralizedAdvantageEstimation(gamma=0.99, lam=0.95),
learning_rate = 1e-3,
clip_range = 0.3,
v_clip_range = 0.3,
c_entropy = 1e-2,
c_value = 0.5,
n_mini_batches = 32,
n_optimization_epochs = 10,
clip_grad_norm = 0.5)
self.PPO_trainer.to(self.device, torch.float32)
Training graphs:
(Notice large numbers on y axis)
UPDATE
For now I have reworked LSTM processing to use batches and hidden memory on all places (for both main model and ICM), but the problem is still present. I have traced it to output from ICM's model, here the output diverges mainly in action_hat tensor.
Found the problem... In main model I use softmax for eval runs and log_softmax for training in output layer and according to PyTorch docs the CrossEntropyLoss uses log_softmax inside, so as advised I used NLLLoss but forthe computation of ICM model loss which does not have softmax fnc in output layer! So switching back to CrossEntropyLoss (which was originaly in reference code) solved ICM loss divergence.
I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.
By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.
I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).
I am trying to create a very simple neural network reading in information with the shape 1x2048 and to create a classification for two categories (object or not object). The graph structure however, deviates from what I believe to have coded. The dense layers should be included in the scope of "inner_layer" and should be receiving their input from the "input" placeholder. Instead, TF seems to be treating them as independent layers which do not receive any information from "input".
Also, when using trying to use tensorboard summaries I get an error telling me that I have not mentioned inserting inputs for the apparent placeholders of the dense layers. When omitting tensorboard, everything works as I expected it based on the code.
I have spent a lot of time trying to find the problem but I think I must be overlooking an something very basic.
The graph I get in tensorboard is on this image.
Which corresponds to the following code:
tf.reset_default_graph()
keep_prob = 0.5
# Graph Strcuture
## Placeholders for input
with tf.name_scope('input'):
x_ = tf.placeholder(tf.float32, shape = [None, transfer_values_train.shape[1]], name = "input1")
y_ = tf.placeholder(tf.float32, shape = [None, num_classes], name = "labels")
## Dense Layer one with 2048 nodes
with tf.name_scope('inner_layers'):
first_layer = tf.layers.dense(x_, units = 2048, activation=tf.nn.relu, name = "first_dense")
dropout_layer = tf.nn.dropout(first_layer, keep_prob, name = "dropout_layer")
#readout layer, without softmax
y_conv = tf.layers.dense(dropout_layer, units = 2, activation=tf.nn.relu, name = "second_dense")
# Evaluation and training
with tf.name_scope('cross_entropy'):
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = y_ , logits = y_conv),
name = "cross_entropy_layer")
with tf.name_scope('trainer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
with tf.name_scope('accuracy'):
prediction = tf.argmax(y_conv, axis = 1)
correct_prediction = tf.equal(prediction, tf.argmax(y_, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Does anyone have an idea why the graph is so different from what you would expect based on the code?
The graph rendering in tensorboard may be a bit confusing (initially), but it's correct. Take a look at this picture where I've left only the inner_layers part of your graph:
You may notice that:
The first_dense and second_dense are actually the name scopes themselves (generated by tf.layers.dense function; see also this question).
Their input/output tensors are inside the inner_layers scope and wire correctly to the dropout_layer. Here, in each of dense layers, live the corresponding linear ops: MatMul, BiasAdd, Relu.
Both scopes also include the variables (kernel and bias each), that are shown separately from inner_layers. They encapsulate the ops related specifically to variable, such as read, assign, initialize, etc. The linear ops in first_dense depend on the variable ops of first_dense, and second_dense likewise.
The reason for this separation is that in distributed settings the variables are manages by a different task called parameter server. It's usually run on a different device (CPU as opposed to GPU), sometimes even on a different machine. In other words, for tensorflow the variable management is by design different from matrix computation.
Having said that, I'd love to see a mode in tensorflow that would not split the scope into variables and ops and keep them coupled.
Other than this the graph perfectly matches the code.
I am a beginner in tensorflow.
I have a data set with 43 inputs and one output. I am gonna create a mini-batch of the data to run deep learning.
Here are my inputs:
x = tf.placeholder(tf.float32, shape=[None, 43])
y_ = tf.placeholder(tf.float32, shape=[None])
which I am feeding them from a matlab file, looking:
train_mat = train_mat["binary_train"].value
feed_dict={x:Train[0:100,0:43] , y_:Train[0:100,43]}
I am gonna have random batch instead of calling 0:100 records.
I saw
tf.train.batch
but, I could not realize how does it work.
Could you please guide me how I can do that.
Thanks,
Afshin
The tf.train.batch and other similar methods are based on Queues, which are best fit in parallel loading huge amount of samples asynchronously. The document here describes basic of using queues in TensorFlow. There is also another blog describing how to read data from files.
If you are going to use queues, the placeholder and feed_dict is unnecessary.
For your specific case, the potential solution maybe look like this:
from tensorflow.python.training import queue_runner
# capacity and min_after_dequeue could be set according to your case
q = tf.RandomShuffleQueue(1000, 500, tf.float32)
enq = q.enqueue_many(train_mat)
queue_runner.add_queue_runner(queue_runner.QueueRunner(q, [enq]))
deq = q.dequeue()
input = deq[:, 0:43]
label = deq[:, 43]
x, y_ = tf.train.batch([input, label], 100)
# then you can use x and y_ directly in inference and train process.
Code above is based on some hypothesis, because information provided in question is not sufficient. However, I hope the code could inspire you in some way.
I designed a neural network in tensorflow for my regression problem by following and adapting the tensorflow tutorial. However, due to the structure of my problem (~300.000 data points and use of the costful FTRLOptimizer), my problem took too long to execute even with my 32 CPUs machine (I don't have GPUs).
According to this comment and a quick confirmation via htop, it appears that I have some single-threaded operations and it should be feed_dict.
Therefore, as adviced here, I tried to use queues for multi-threading my program.
I wrote a simple code file with queue to train a model as following:
import numpy as np
import tensorflow as tf
import threading
#Function for enqueueing in parallel my data
def enqueue_thread():
sess.run(enqueue_op, feed_dict={x_batch_enqueue: x, y_batch_enqueue: y})
#Set the number of couples (x, y) I use for "training" my model
BATCH_SIZE = 5
#Generate my data where y=x+1+little_noise
x = np.random.randn(10, 1).astype('float32')
y = x+1+np.random.randn(10, 1)/100
#Create the variables for my model y = x*W+b, then W and b should both converge to 1.
W = tf.get_variable('W', shape=[1, 1], dtype='float32')
b = tf.get_variable('b', shape=[1, 1], dtype='float32')
#Prepare the placeholdeers for enqueueing
x_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
y_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
#Create the queue
q = tf.RandomShuffleQueue(capacity=2**20, min_after_dequeue=BATCH_SIZE, dtypes=[tf.float32, tf.float32], seed=12, shapes=[[1], [1]])
#Enqueue operation
enqueue_op = q.enqueue_many([x_batch_enqueue, y_batch_enqueue])
#Dequeue operation
x_batch, y_batch = q.dequeue_many(BATCH_SIZE)
#Prediction with linear model + bias
y_pred=tf.add(tf.mul(x_batch, W), b)
#MAE cost function
cost = tf.reduce_mean(tf.abs(y_batch-y_pred))
learning_rate = 1e-3
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
available_threads = 1024
#Feed the queue
for i in range(available_threads):
threading.Thread(target=enqueue_thread).start()
#Train the model
for step in range(1000):
_, cost_step = sess.run([train_op, cost])
print(cost_step)
Wf=sess.run(W)
bf=sess.run(b)
This code doesn't work because each time I call x_batch, one y_batch is also dequeued and vice versa. Then, I do not compare the features with the corresponding "result".
Is there an easy way to avoid this problem ?
My mistake, everything worked fine.
I was misled because I estimated at each step of the algorithm my performance on different batches and also because my model was too complicated for a dummy one (I should had something like y=W*x or y=x+b).
Then, when I tried to print in the console, I exucuted several times sess.run on different variables and got obviously non-consistent results.
Nonetheless your problem is solved, wanted to show you a small inefficiency in your code. When you created your RandomShuffleQueue you specified capacity=2**20. In all the queues capacity:
The upper bound on the number of elements that may be stored in this
queue.
The queue will try to put as many elements as possible in the queue till it will hit this limit. All these elements are eating your RAM. If each element consists of only 1byte, your queue will eat 1Mb of your data. If you will have 10Kb images in your queue you will eat 10Gb of RAM.
This is very wasteful, especially because you never need so many elements in the queue. All you need to make sure is that your queue is never empty. So find a reasonable capacity of the queue and do not use huge numbers.