I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.
I want to ask:
How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?
If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:
import h5py
import tensorflow as tf
data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
current_data = data['data_set'][position:position+batch_size]
sess.run(train_op, feed_dict={input: current_data})
You can also run through a huge number of iterations and randomly select a batch if you want to:
import random
for i in range(iterations):
pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
Or sequentially:
for i in range(iterations):
pos = (i % int(data_size / batch_size)) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.
You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :
import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})
alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.
Find below a screenshot of this approach
As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.
Related
I'm trying to load a dataset, stored in two .npy files (for features and ground truth) on my drive, and use it to train a neural network.
print("loading features...")
data = np.load("[...]/features.npy")
print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255
dataset = tf.data.Dataset.from_tensor_slices((data, labels))
throws a tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. error when calling the from_tensor_slices() method.
The ground truth's file is larger than 2.44GB and thus I encounter problems when creating a Dataset with it (see warnings here and here).
Possible solutions I found were either for TensorFlow 1.x (here and here, while I am running version 2.6) or to use numpy's memmap (here), which I unfortunately don't get to run, plus I wonder whether that slows down the computation?
I'd appreciate your help, thanks!
You need some kind of data generator, because your data is way too big to fit directly into tf.data.Dataset.from_tensor_slices. I don't have your dataset, but here's an example of how you could get data batches and train your model inside a custom training loop. The data is an NPZ NumPy archive from here:
import numpy as np
def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
dataset_zip = np.load(file, encoding='latin1')
images = dataset_zip['imgs']
latents_classes = dataset_zip['latents_classes']
return images, latents_classes
def get_batch(indices, train_images, train_categories):
shapes_as_categories = np.array([train_categories[i][1] for i in indices])
images = np.array([train_images[i] for i in indices])
return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
shapes_as_categories.shape[0], 1).astype('float32')]
# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)
epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size
for epoch in range(epochs):
for i in range(total_batch):
batch_indices = indices[batch_size * i: batch_size * (i + 1)]
batch = get_batch(batch_indices, train_images, train_categories)
...
...
# Train your model with this batch.
The accepted answer (https://stackoverflow.com/a/69994287) loads the whole data into memory with the load_data function. That's why our RAM is completely full.
You can try to make a npz file where each feature is its own npy file, then create a generator that loads this and use this generator like 1 to use it with tf.data.Dataset or build a data generator with keras like 2
or use the mmap method of numpy load while loading to stick to your one npy feature file like 3
I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.
I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.
As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.
I want to shuffle the dataset in a different way for each epoch.
Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.
There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.
You should place your graph construction code outside of loop as the following code (pseudo code)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)
You would be better off using prefetch, and decreasing sess.run call by combining your graph.
I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.
Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.
My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.
import tensorflow as tf
import numpy as np
import timeit
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)
n_epochs = 10
input_dimensions_list = [10]
def function_to_approximate(x):
return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32)
def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X')
Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
prediction = tf.matmul(X,weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for _ in range(n_epochs):
sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})
def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X,Y = data_set.make_one_shot_iterator().get_next()
prediction = tf.matmul(X, weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
while True:
try:
sess.run(loss_op)
except tf.errors.OutOfRangeError:
break
for input_dimension in input_dimensions_list:
for data_size in [500, 100000]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
print("Not using tf.data, with data size "
"{}, input dimension {} and training with "
"a full batch, it took an average of "
"{} seconds to run {} epochs.\n".
format(
data_size,
input_dimension,
timeit.timeit(
lambda: regress_without_tfData(
n_epochs, input_dimension,
training_inputs, training_labels
),
number=3
),
n_epochs))
for input_dimension in input_dimensions_list:
for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
print("Using tf.data, with data size "
"{}, and input dimension {}, and training with "
"batch size {}, it took an average of {} seconds "
"to run {} epochs.\n".
format(
data_size,
input_dimension,
batch_size,
timeit.timeit(
lambda: regress_with_tfData(
n_epochs, input_dimension,
training_inputs, training_labels,
batch_size
),
number=3
)/3,
n_epochs
))
This outputs for me:
Not using tf.data, with data size 500, input dimension 10 and training
with a full batch, it took an average of 0.20243382899980134 seconds
to run 10 epochs.
Not using tf.data, with data size 100000, input dimension 10 and
training with a full batch, it took an average of 0.2431719040000644
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 50, it took an average of 0.09512088866661846
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 500, it took an average of
0.07286913600000844 seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 50, it took an average of 4.421892363666605
seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 100000, it took an average of
2.2555197536667038 seconds to run 10 epochs.
Edit: Fixed an important issue that Fred Guth pointed out. It didn't much affect the results, though.
I wanted to test the dataset API which seems to be really convenient for processing data. I did a lot of time testing about this API in CPU, GPU and multi-GPU way for small and large NN with different type of data.
First thing, It seems to me that your code is ok. But I need to point that your NN is just one simple layer.
Now, the dataset API is not suitable for your type of NN but for NN with a lot more complexity. Why ? For several reasons that I explain below (founded in my quest of understanding the dataset API).
Firstly, in one hand the dataset API processes data each batch whereas in the other hand data are preprocessed. Therefore, if it fits your RAM, you can save time by preprocessing the data. Here your data are just to "simple". If you want to test what i am saying, try to find a really really big dataset to process. Nevertheless, the dataset API can be tuned with prefetching data. You can take a look to this tutorial that explain really well why it is good to process data with prefetch.
Secondly, in my quest of dataset API for Multi-GPU training, I discovered that as far as i know the old pre-processing way is faster than dataset API for small Neural Network. You can verify that by creating a simple stackable RNN which take a sequence in input. You can try different size of stack (i have tested 1, 2, 10 and 20). You will see that, using the dataset API, on 1-GPU or on 4-GPUs, the time did not differ for small RNN stacks (1, 2 and 5).
To summarize, the dataset API is suitable for Neural Network that have data that can't be pre-process. Depending on your task, it may be more convenient to pre-process data, for example if you want to tweak your NN in order to improve it. I agree that the dataset API is really cool for batch, padding and also convenient for shuffling large amount of data but it's also not suitable for multi-GPU training.
First:
You are recreating the dataset unnecessarily.
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
Create the dataset prior to the loop and change the regress_with_tfData input signature to use dataset instead of training_inputs and training_labels.
Second:
The problem here is that minibatches of size 50 or even 500 are too small to compensate the cost of td.data building latency. You should increase the minibatch size. Interestingly you did so with a minibatch of size 100000, but then maybe it is too big ( I am not certain of this, I think it would need more tests).
There are a couple of things you could try:
1) Increase the minibatch size to something like 10000 and see if you get an improvement
2) Change your pipeline to use an iterator, example:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
iterator = data_set.make_one_shot_iterator()
....
next_element = iterator.get_next()
That is because you are comparing apples with bananas.
On one hand, when using placeholders, you are providing a monolithic tensor as is. On the other hand, when using Dataset, you are slicing the tensor into individual samples. This is very different.
The equivalent of providing a monolothic placeholder tensor with the Dataset pipeline is by using tf.data.Dataset.from_tensors. When I use from_tensors in your example, I get similar (actually smaller) computation times than with placeholders.
If you want to compare a more sophisticated pipeline using from_tensor_slices, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.
One possible thing you are missing is a prefetch. Add a prefetch of 1 at the end of your data pipeline like so:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size).prefetch(1)
Adding a prefetch of 1 at the end of your dataset pipeline means you try and fetch 1 batch of data while training is happening. This way you wont be waiting around while the batch is prepared, it should be ready to go as soon as each train iteration is done.
The accepted answer doesn't help longer valid, as the TF behavior has changed. Per documentation:
from_tensors produces a dataset containing only a single element. To
slice the input tensor into multiple elements, use from_tensor_slices
instead.
This means you cannot batch it
X = np.arange(10)
data = tf.data.Dataset.from_tensors( X )
data = data.batch(2)
for t in data.as_numpy_iterator():
print(t)
# only one row, whereas expected 5 !!!
The documentation recommends from_tensor_slices. But this has quite some overhead when compared to numpy slicing. Slow slicing is an open issue https://github.com/tensorflow/tensorflow/issues/39750
Essentially, slicing in TF is slow and impacts input-bound or light models such as small networks (regression, word2vec).
I am trying to fine-tune inceptionv3 model using slim tensorflow library.
I am unable to understand certain things while writing the code for it. I tried to read source code (no proper documentation) and figured out few things and I am able to fine-tune it and save the check point. Here are the steps I followed
1. I created a tf.record for my training data which is fine, now I am reading the data using the below code.
import tensorflow as tf
import tensorflow.contrib.slim.nets as nets
import tensorflow.contrib.slim as slim
import matplotlib.pyplot as plt
import numpy as np
# get the data and labels here
data_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/train1.tfrecords'
# Training setting
num_epochs = 100
initial_learning_rate = 0.0002
learning_rate_decay_factor = 0.7
num_epochs_before_decay = 5
num_classes = 5980
# load the checkpoint
model_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/inception_v3.ckpt'
# log directory
log_dir = '/home/sfarkya/nvidia_challenge/datasets/detrac/fine_tuned_model'
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [128, 128, 3])
# Creates batches by randomly shuffling tensors
images, labels = tf.train.shuffle_batch([image, label], batch_size=64, capacity=128, num_threads=2,
min_after_dequeue=64)
Now I am finetuning the model using slim and this is the code.
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
# Create a coordinator and run all QueueRunner objects
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# load model
# load the inception model from the slim library - we are using inception v3
#inputL = tf.placeholder(tf.float32, (64, 128, 128, 3))
img, lbl = sess.run([images, labels])
one_hot_labels = slim.one_hot_encoding(lbl, num_classes)
with slim.arg_scope(slim.nets.inception.inception_v3_arg_scope()):
logits, inceptionv3 = nets.inception.inception_v3(inputs=img, num_classes=5980, is_training=True,
dropout_keep_prob=.6)
# Restore convolutional layers:
variables_to_restore = slim.get_variables_to_restore(exclude=['InceptionV3/Logits', 'InceptionV3/AuxLogits'])
init_fn = slim.assign_from_checkpoint_fn(model_path, variables_to_restore)
# loss function
loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
total_loss = tf.losses.get_total_loss()
# train operation
train_op = slim.learning.create_train_op(total_loss + loss, optimizer= tf.train.AdamOptimizer(learning_rate=1e-4))
print('Im here')
# Start training.
slim.learning.train(train_op, log_dir, init_fn=init_fn, save_interval_secs=20, number_of_steps= 10)
Now I have few questions about the code, which I am quite unable to figure out. Once, the code reaches slim.learning.train I don't see anything printing however, it's training, I can see in the log. Now,
1. How do I give the number of epochs to the code? Right now it's running step by step with each step has batch_size = 64.
2. How do I make sure that in the code tf.train.shuffle_batch I am not repeating my images and I am training over the whole dataset?
3. How can I print the loss values while it's training?
Here are answers to your questions.
You cannot give epochs directly to slim.learning.train. Instead, you give the number of batches as the argument. It is called number_of_steps. It is used to set an operation called should_stop_op on line 709. I assume you know how to convert number of epochs to batches.
I don't think the shuffle_batch function will repeat images because internally it uses the RandomShuffleQueue. According to this answer, the RandomShuffleQueue enqueues elements using a background thread as:
While size(queue) < capacity:
Add an element to the queue
It dequeues elements as:
While the number of elements dequeued < batch_size:
Wait until the size(queue) >= min_after_dequeue + 1 elements.
Select an element from the queue uniformly at random, remove it from the queue, and add it the output batch.
So in my opinion, there is very little chance that the elements would be repeated, because in the dequeuing operation, the chosen element is removed from the queue. So it is sampling without replacement.
Will a new queue be created for every epoch?
The tensors being inputted to tf.train.shuffle_batch are image and label which ultimately come from the filename_queue. If that queue is producing TFRecord filenames indefinitely, then I don't think a new queue will be created by shuffle_batch. You can also create a toy code like this to understand how shuffle_batch works.
Coming to the next point, how to train over the whole dataset? In your code, the following line gets the list of TFRecord filenames.
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
If filename_queue covers all TFRecords that you have, then you are surely training over the entire dataset. Now, how to shuffle the entire dataset is another question. As mentioned here by #mrry, there is no support (yet, AFAIK) to shuffle out-of-memory datasets. So the best way is to prepare many shards of your dataset such that each shard contains about 1024 examples. Shuffle the list of TFRecord filenames as:
filename_queue = tf.train.string_input_producer([data_path], shuffle=True, capacity=1000)
Note that I removed the num_epochs = 1 argument and set shuffle=True. This way it will produce the shuffled list of TFRecord filenames indefinitely. Now on each file, if you use tf.train.shuffle_batch, you will get a near-to-uniform shuffling. Basically, as the number of examples in each shard tend to 1, your shuffling will get more and more uniform. I like to not set num_epochs and instead terminate the training using the number_of_steps argument mentioned earlier.
To print the loss values, you could probably just edit the training.py and introduce logging.info('total loss = %f', total_loss). I don't know if there is any simpler way. Another way without changing the code is to view summaries in Tensorboard.
There are very helpful articles on how to view summaries in Tensorboard, including the link at the end of this answer. Generally, you need to do the following things.
Create summary object.
Write variables of interest into summary.
Merge all individual summaries.
Create a summary op.
Create a summary file writer.
Write the summaries throughout the training at a desired frequency.
Now steps 5 and 6 are already done automatically for you if you use slim.learning.train.
For first 4 steps, you could check the file train_image_classifier.py. Line 472 shows you how to create a summaries object. Lines 490, 512 and 536 write the relevant variables into summaries. Line 549 merges all summaries and the line 553 creates an op. You can pass this op to slim.learning.train and you can also specify how frequently you want to write summaries. In my opinion, do not write anything apart from loss, total_loss, accuracy and learning rate into the summaries, unless you want to do specific debugging. If you write histograms, then the tensorboard file could take tens of hours to load for networks like ResNet-50 (my tensorboard file once was 28 GB, which took 12 hours to load the progress of 6 days!). By the way, you could actually use train_image_classifier.py file to finetune and you will skip most of the steps above. However, I prefer this as you get to learn a lot of things.
See the launching tensorboard section on how to view the progress in a browser.
Additional remarks:
Instead of minimizing total_loss + loss, you could do the following:
loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
tf.losses.add_loss(loss)
total_loss = tf.losses.get_total_loss()
train_op = slim.learning.create_train_op(total_loss, optimizer=tf.train.AdamOptimizer(learning_rate=1e-4))
I found this post to be very useful when I was learning Tensorflow.
Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.
Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB
Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model