Batching for a non-image data set with Tensorflow - python

I am a beginner in tensorflow.
I have a data set with 43 inputs and one output. I am gonna create a mini-batch of the data to run deep learning.
Here are my inputs:
x = tf.placeholder(tf.float32, shape=[None, 43])
y_ = tf.placeholder(tf.float32, shape=[None])
which I am feeding them from a matlab file, looking:
train_mat = train_mat["binary_train"].value
feed_dict={x:Train[0:100,0:43] , y_:Train[0:100,43]}
I am gonna have random batch instead of calling 0:100 records.
I saw
tf.train.batch
but, I could not realize how does it work.
Could you please guide me how I can do that.
Thanks,
Afshin

The tf.train.batch and other similar methods are based on Queues, which are best fit in parallel loading huge amount of samples asynchronously. The document here describes basic of using queues in TensorFlow. There is also another blog describing how to read data from files.
If you are going to use queues, the placeholder and feed_dict is unnecessary.
For your specific case, the potential solution maybe look like this:
from tensorflow.python.training import queue_runner
# capacity and min_after_dequeue could be set according to your case
q = tf.RandomShuffleQueue(1000, 500, tf.float32)
enq = q.enqueue_many(train_mat)
queue_runner.add_queue_runner(queue_runner.QueueRunner(q, [enq]))
deq = q.dequeue()
input = deq[:, 0:43]
label = deq[:, 43]
x, y_ = tf.train.batch([input, label], 100)
# then you can use x and y_ directly in inference and train process.
Code above is based on some hypothesis, because information provided in question is not sufficient. However, I hope the code could inspire you in some way.

Related

Reason for huge slowdown usinf TF Dataset API

I'm trying to generate batches for triplet loss where there are always pairs in the batch. The code below achieves this but it's very, very slow. In particular the choose_from_datasets method seems to be the source of the slowness.
Is there something wrong with my code that's creating the slowdown? Or is there a smarter way to do this?
I tried switching to sample_from_datasets instead, but this didn't help.
def batch_pairs3(dataset, num_classes, shuffle=True, num_classes_per_batch=10, num_images_per_class=2):
# Isolate each class into its own dataset
datasets = []
for cl in range(num_classes):
this_dataset = dataset.filter(lambda xx, yy: tf.equal(tf.reshape(yy, []), cl))
if shuffle:
this_dataset = this_dataset.shuffle(100)
datasets += [this_dataset]
# if shuffle:
# random.shuffle(datasets)
selector = tf.contrib.data.Counter().map(
lambda x: generator3(x, num_classes, num_classes_per_batch, num_images_per_class))
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
return dataset.batch(batch_size)
tf data pipeline does not handle these kind of applications where you are processing your data on the fly by iterating through it very well, unless you can independently map every data point to do such processing. For what you are doing, you may be better off pre-processing and storing your data, in something like tfrecord format and then using the data pipeline to read it in an optimized way.
Refer this official example, which kind of works on a similar problem involving triplet loss: Time Contrastive Networks, the data provider

Feed_dict doesnt accept my data

I've been trying to be able to feed my own images in some tensorflow code to look how the code would react to my own images instead of the MNIST set. I've been able to import images(i think) into tensorflow but I have two placeholders that should get my image data and label data. I tried to use feed_dict(which still seems right to me) to be able to use my data in the rest of my code but it won't accept any data I feed it. I know I can't feed it a Tensor and apparently not a batch but the only way I can think of to make this work is to feed it a list. I saw feed_dict is able to use numpy arrays but im not sure how i should approach converting data to a numpy array.
I'm new to TensorFlow and python so please forgive any mistakes I made, I'm still learning how everything works.
with tf.name_scope('Image_Data_Input'):
def read_labeled_image_list(image_list_file):
print('read_labeled_image_list function opened')
f = open(image_list_file, 'r')
print('image_list_file opened')
filenames = []
labels = []
print('Arrays formed')
for line in f:
filename, label = line[:-1].split(' ')
filenames.append(filename)
labels.append(label)
print('Lines deconstructed')
return filenames, labels
def read_image(input_queue):
label = input_queue[1]
file_contents = tf.read_file(input_queue[0])
decoded_image = tf.image.decode_jpeg(file_contents, channels=3)
print('Image decoded to JPEG')
decoded_image.set_shape([2560, 1440, 3])
decoded_image = tf.image.resize_images(decoded_image, [128, 128])
return decoded_image, label
image_list, label_list = read_labeled_image_list(image_list_file)
images = tf.convert_to_tensor(image_list, dtype=tf.string)
labels = tf.convert_to_tensor(label_list, dtype=tf.string)
input_queue = tf.train.slice_input_producer([images, labels], num_epochs=None, shuffle=True)
image, label = read_image(input_queue)
The indentation behaved a little weird when i pasted my code so I'm not sure everything is properly placed.
Well now I have these placeholder:
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
And I've seen the code routing data to those placeholder this way:
batch_x, batch_y = tf.train.batch([image, label], batchsize)
#_, summary = sess.run([train_writer, summary_op], feed_dict={x: batch_x, y_: batch_y})
But I can't seem to make this work.
Does anyone have any idea how i could make this work?
Again sorry for any mistakes and thanks in advance.
As the error says, you can't feed tensors into a placeholder. batch_x and batch_y are tensors. The new tf.Dataset API is the preferred way to input data into a model (guide here). I think Dataset.from_tensor_slices would require minimal rewriting. Short of that, build the graph so that batch_x and batch_y flow into the model you're using directly. Then you don't need to use placeholders.
I don't recommend this, but for completeness I want to mention another method. You could:
numpy_batch_x, numpy_batch_y = sess.run([batch_x, batch_y])
_, summary = sess.run([train_writer, summary_op],
feed_dict={x: numpy_batch_x, y_: numpy_batch_y})
PS: If train_writer is a tf.summary.FileWriter, I think you want to:
summary = sess.run([summary_op], ...)
train_writer.add_summary(summary)
EDIT: In response to confusion on the dataset API, I am going to show how to handle this with a Dataset. I am going to use TFRecords. It may not be the simplest solution, but it's one way.
import numpy as np
from scipy.misc import imread # There are others that would work here.
from cv2 import resize # Again, others to choose from.
def read_labeled_image_list(...)
# See question
return filenames, labels
def make_tfr(tfr_dir="/YOUR/PREFERRED/TFR/DIR")
def _int64_list_feature(a_list):
return tf.train.Feature(int64_list=tf.train.Int64List(value=a_list)
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter(tfr_dir)
all_image_paths, all_labels = read_labeled_image_list(...)
for path, label in zip(all_image_paths, all_labels):
disk_im = imread(path)
resized_im = cv2.resize(disk_im, (128, 128))
raw_im = resized_im.tostring()
# Construct an example proto-obj,
example = tf.train.Example(
# which wants a Features proto-obj,
features=tf.train.Features(
# which wants a dict.
feature={
'image_raw': _bytes_feature(raw_im),
'label': _int64_list_feature(label)
})) # close your example object
serialized = example.SerializeToString()
writer.write(serialized)
make_tfr() # After you've done it successfully once, comment out.
def input_pipeline(batch_size, epochs, tfr_dir="/YOUR/PREFERRED/TFR/DIR"):
# with tf.name_scope("Input"): maybe you like to scope as much as I do?
dataset = tf.data.TFRecordDataset(tfr_dir)
def parse_protocol_buffer(example_proto):
features = {'image_raw': tf.FixedLenFeature((), tf.string),
'label': tf.FixedLenFeature((), tf.int64)}
parsed_features = tf.parse_single_example(
example_proto, features)
return parsed_features['image_raw'], parsed_features['label']
dataset = dataset.map(parse_protocol_buffer)
def convert_parsed_proto_to_input(image_string, label):
image_decoded = tf.decode_raw(image_string, tf.uint8)
image_resized = tf.reshape(image_decoded, (128, 128, 3))
image = tf.cast(image_resized, tf.float32)
# I usually put my image elements in [-1, 1]
return image * (2. /255) -1, label
dataset = dataset.map(converted_parsed_proto_to_input)
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.repeat(batch_size * epochs)
return dataset
def model(image_tensor):
...
# However you want to do this.
return predictions
def loss(predictions, labels):
...
return some_loss
def train(some_loss):
...
return train_op
batch_size = 50
iterations = 10000
train_dataset = input_pipeline(batch_size, iterations)
train_iterator = train_dataset.make_initializable_iterator()
image, label = train_iterator.get_next()
predictions = model(image)
loss_op = loss(image, predictions)
train_op = train(loss_op)
summary_op = tf.summary.merge_all()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter("/YOUR/LOGDIR", sess.graph)
sess.run(train_iterator.initializer)
for epoch in range(iterations + 1):
_, summary = sess.run([train_op, summary_op])
train_writer.add_summary(summary, epoch)
You say you're new to TensorFlow. I hope this doesn't intimidate you. I was new to TensorFlow not long ago, and it was a pain to figure out how to make a good input pipeline. Learning TFRecords seemed impossible. You also say you're new to Python, so I'll warn you that cv2 has a reputation of difficult installs. You may want to look into other ways to resize an image (though I'd advise against PIL, which is probably even more confusing and difficult at first).
Basically, I'm posting this code because because the documentation on writing TFRecords is confusing (Exhibit A vs a blog post that helped me figure it out) but TFRecords is the way I know how to make a Dataset best. Even if you don't go the TFRecords route, this could help you with the map function for datasets, e.g. notice how I pass label through convert... even though it's not used. Making a dataset (specifically from TFRecords) is a lot of lines of code, but Dataset is the preferred way to construct an input pipeline and it's designed to replace the old queue method you're using.
As a side note, the purpose of the queue strategy was to read data from memory directly into the graph without placeholders. Placeholders are slow and memory-intensive compared to the queue strategy, but Datasets are even better when implemented correctly.
I see in your comment that you want to see the placeholder namescope get connected to your graph. The dataset way, you'll see some dataset nodes on the graph. If you scope them with what I commented out, it should be apparent that everything's hooked up right. Your way, you're actually adding this queue-and-preprocess structure onto the graph. Since you'll have to de-tensor-ify the images to pass them into a placeholder, it won't be apparent that your data is flowing correctly.
Now, as I mentioned in the original post, you can just pass batch_x and batch_y into your model and forget the placeholder and dataset altogether. You'll see everything hooked up right from the preprocessing stage, if the queue is implemented right. Still, your images are large before reshaping them. Reading them will be an intensive task. I'd recommend going the hard route of learning to use Datasets and TFRecords.
I hope this helps you implement a Dataset in your code. I hope this helps you get TensorBoard running. And I hope this helps you figure out TFRecords if you decide to go that route.
PS: On the topic of TensorBoard validating that the model is working, you could attach a tf.summary.image(img) as the first line of model(...). Then check out the image dash and see if it's what you expect.
EDIT 2: example = tf.train.Example(features=tf.train.Features(feature={}))

Preloading data in Tensorflow with shared layers

I have Tensorflow code for multi-task learning (one input, several outputs, similar to this: https://jg8610.github.io/Multi-Task/). For further explanation see below. The code works, but is slow as there's a lot of overhead from reading data in Python and feeding it to the GPU (with the tf.Session's feed_dict).
So my plan is now to preload the data according to https://www.tensorflow.org/programmers_guide/reading_data#preloaded_data [storing it in a tf.constant and using TF's queuing system]. This raises some problems, of which the most central for now seems to be:
If I preload the different task data into different tensors, I no longer have a task-generic X_in. That means that when declaring the shared layer, I now need to make a decision whether to connect it to X_input_task_A or X_input_task_B, and obviously that's not going to result in a shared layer.
My question
Would you have any idea how to solve this problem, i.e. to define shared layers with task-specific tensors, and then training by alternating between tasks? How would you alternatively call the different optimizer operations?
Further explanation on the Multi-task learning paradigm
For background, what the mentioned blog post (as well as my code so far) does is to define a placeholder X_in plus a shared layer that consumes that input op. Then, for each task we want to learn, we have different projections and loss functions that use task-specific placeholders y_task, and training happens by alternately running session.run(optimizer_task, feed_dict={X_in: X_batch_task, y_task: y_batch_task}), where optimizer_task is some task-specific optimizer. This is basically what my code does now - it works but is slow because I need to feed the data:
# PLACEHOLDERS
X_in = tf.placeholder([batch_size, 100])
y_task_a = tf.placeholder([batch_size, 4]) # 4 output classes
y_task_b = tf.placeholder([batch_size, 2]) # 2 output classes
# SHARED LAYER
W = tf.get_variable("W", [100, 50])
shared_layer = tf.sigmoid(tf.matmul(X_in, W))
# TASK-SPECIFIC OUTPUTS
W_task_a = tf.get_variable("Wa", [50, 4])
W_task_b = tf.get_variable("Wb", [50, 2])
pred_task_a = tf.sigmoid(tf.matmul(shared_layer, W_task_a))
pred_task_b = tf.sigmoid(tf.matmul(shared_layer, W_task_b))
# TASK-SPECIFIC LOSSES AND OPTIMIZERS
loss_task_a = tf.nn.softmax_cross_entropy_with_logits(logits=pred_task_a, labels=y_task_a)
loss_task_b = tf.nn.softmax_cross_entropy_with_logits(logits=pred_task_b, labels=y_task_b)
optimizer_a = ...(loss_task_a)
optimizer_b = ...(loss_task_b)
# TRAINING
with tf.Session() as sess:
for i in range(ITERS):
# ALTERNATE BETWEEN TASKS, GET BATCH FROM DATA PER TASK AND TRAIN
X_a, y_a = data_task_a.get_batch()
X_b, y_b = data_task_b.get_batch()
sess.run(optimizer_a, feed_dict={X_in: X_a, y_task_a: y_a})
sess.run(optimizer_b, feed_dict={X_in: X_b, y_task_b: y_b})

Train model using queue Tensorflow

I designed a neural network in tensorflow for my regression problem by following and adapting the tensorflow tutorial. However, due to the structure of my problem (~300.000 data points and use of the costful FTRLOptimizer), my problem took too long to execute even with my 32 CPUs machine (I don't have GPUs).
According to this comment and a quick confirmation via htop, it appears that I have some single-threaded operations and it should be feed_dict.
Therefore, as adviced here, I tried to use queues for multi-threading my program.
I wrote a simple code file with queue to train a model as following:
import numpy as np
import tensorflow as tf
import threading
#Function for enqueueing in parallel my data
def enqueue_thread():
sess.run(enqueue_op, feed_dict={x_batch_enqueue: x, y_batch_enqueue: y})
#Set the number of couples (x, y) I use for "training" my model
BATCH_SIZE = 5
#Generate my data where y=x+1+little_noise
x = np.random.randn(10, 1).astype('float32')
y = x+1+np.random.randn(10, 1)/100
#Create the variables for my model y = x*W+b, then W and b should both converge to 1.
W = tf.get_variable('W', shape=[1, 1], dtype='float32')
b = tf.get_variable('b', shape=[1, 1], dtype='float32')
#Prepare the placeholdeers for enqueueing
x_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
y_batch_enqueue = tf.placeholder(tf.float32, shape=[None, 1])
#Create the queue
q = tf.RandomShuffleQueue(capacity=2**20, min_after_dequeue=BATCH_SIZE, dtypes=[tf.float32, tf.float32], seed=12, shapes=[[1], [1]])
#Enqueue operation
enqueue_op = q.enqueue_many([x_batch_enqueue, y_batch_enqueue])
#Dequeue operation
x_batch, y_batch = q.dequeue_many(BATCH_SIZE)
#Prediction with linear model + bias
y_pred=tf.add(tf.mul(x_batch, W), b)
#MAE cost function
cost = tf.reduce_mean(tf.abs(y_batch-y_pred))
learning_rate = 1e-3
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
available_threads = 1024
#Feed the queue
for i in range(available_threads):
threading.Thread(target=enqueue_thread).start()
#Train the model
for step in range(1000):
_, cost_step = sess.run([train_op, cost])
print(cost_step)
Wf=sess.run(W)
bf=sess.run(b)
This code doesn't work because each time I call x_batch, one y_batch is also dequeued and vice versa. Then, I do not compare the features with the corresponding "result".
Is there an easy way to avoid this problem ?
My mistake, everything worked fine.
I was misled because I estimated at each step of the algorithm my performance on different batches and also because my model was too complicated for a dummy one (I should had something like y=W*x or y=x+b).
Then, when I tried to print in the console, I exucuted several times sess.run on different variables and got obviously non-consistent results.
Nonetheless your problem is solved, wanted to show you a small inefficiency in your code. When you created your RandomShuffleQueue you specified capacity=2**20. In all the queues capacity:
The upper bound on the number of elements that may be stored in this
queue.
The queue will try to put as many elements as possible in the queue till it will hit this limit. All these elements are eating your RAM. If each element consists of only 1byte, your queue will eat 1Mb of your data. If you will have 10Kb images in your queue you will eat 10Gb of RAM.
This is very wasteful, especially because you never need so many elements in the queue. All you need to make sure is that your queue is never empty. So find a reasonable capacity of the queue and do not use huge numbers.

Feeding parameters into placeholders in tensorflow

I'm trying to get into tensorflow, setting up a network and then feeding data to it. For some reason I end up with the error message ValueError: setting an array element with a sequence. I made a minimal example of what I'm trying to do:
import tensorflow as tf
K = 10
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
input = [ tf.Variable(tf.random_normal([K])),
tf.Variable(tf.random_normal([K])) ]
with tf.Session() as sess :
print(sess.run([parent], feed_dict={ lchild: input[0], rchild: input[1] }))
Basically, I'm setting up a network with place holders and a sequence of input embeddings that I want to learn, and then I try to run the network, feeding the input embeddings into it. From what I can tell by searching for the error message, there might be something wrong with my feed_dict, but I can't see any obvious mismatches in eg. dimensionality.
So, what did I miss, or how did I get this completely backwards?
EDIT: I've edited the above to clarify that the input represents embeddings that need to be learned. I guess the question can be asked more sharply as: Is it possible to use placeholders for parameters?
The inputs should be numpy arrays.
So, instead of tf.Variable(tf.random_normal([K])), simply write np.random.randn(K) and everything should work as expected.
EDIT (The question was clarified after my answer):
It is possible to use placeholders as parameters but in a slightly different way. For example:
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
# Compute gradients with respect to the input variables
grads = tf.gradients(loss, [lchild, rchild])
inputs = [np.random.randn(K), np.random.randn(K)]
for i in range(<number of iterations>):
np_grads = sess.run(grads, feed_dict={lchild:inputs[0], rchild:inputs[1])
inputs[0] -= 0.1 * np_grads[0]
inputs[1] -= 0.1 * np_grads[1]
It is not however the best or easiest way to do this. The main problem with it is that at every iteration you need to copy numpy arrays in and out of the session (which is running potentially on a different device like GPU).
Placeholders generally are used to feed the data external to the model (like texts or images). The way to solve it using tensorflow utilities would be something like:
lchild = tf.Variable(tf.random_normal([K])
rchild = tf.Variable(tf.random_normal([K])
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
train_op = tf.train.GradientDescentOptimizer(loss).minimize(0.1)
for i in range(<number of iterations>):
sess.run(train_op)
# Retrieve the weights back to numpy:
np_lchild = sess.run(lchild)

Categories

Resources