How do I check the number of filenames string_input_producer has read? Different operations will be performed depending on size in input data so I need to know how many images will be read or have been read.
Code below not telling me how much images I have read or am about to read.
import tensorflow as tf
import matplotlib.pyplot as plt
# Make a queue of file names including all the JPEG images files in the relative image directory.
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once("./MNIST_data/*.png"))
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
image = tf.image.decode_png(value) # use png or jpg decoder based on your files.
num_preprocess_threads = 1
min_queue_examples = 256
batch_size=2;
images = tf.train.shuffle_batch([image], batch_size, min_queue_examples + 3 * batch_size, num_threads=num_preprocess_threads, min_after_dequeue=min_queue_examples)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
t_image = image.eval() #here is your image Tensor :)
fig = plt.figure()
plt.imshow(t_image)
plt.show()
coord.request_stop()
coord.join(threads)
Functions like string_input_producer will add a queue to current graph which can only dequeue only one example each time. Usually the output tensors will be feed to functions like tf.train.shuffle_batch which is what you want. The argument batch_size of this function can control how many examples each time dequeued as the input of of your model
UPDATE:
if you want to check whether your input data is correct, you can run it out with sess.run(my_img) which will give you a numpy.array tensor. You can directly look at the element of this tensor or just plot it with matplotlib.
make sure you have already started queue runners before sess.run or your program will hang forever
string_input_producer returns you back a standard FIFOQueue (it returns you an input_producer and it returns you a queue.
A FIFOQueue does not have information about the number of elements it has read, only the number of elements are currently in a queue (q.size()). If you want to know how many element has been read you need to manually add a counter which you will increment each time you read an element.
Related
Can I save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory()?
from tensorflow.keras import preprocessing
train_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'training', # Estamos en training
shuffle = True,
seed= 689
)
val_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'validation',
shuffle = True,
seed= 689
)
test_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/test'
)
I'm reading the documentation but I'm not sure if it's possible.
Answer to #Lescurel question
I want to do this because I want to avoid do this preprocessing each time and have to wait while its done. And furthermore, because I want to see if this new saved file is takes up less space in my computer.
Actually, I don't care the format. I thought that if this can be done, it would already have a standard format that everyone uses.
Thank you very much.
Technically that is possible.
But you don't want that, because:
The preprocessing.text_dataset_from_directory create a generator based dataset, that supports
on the fly loading of data,
shuffle after each epoch (for training),
prefetching and other features.
If you just save a shuffled dataset as file on your computer, you will have to do it again. If the dataset would be/get larger than your RAM you would have to care about that, too.
If you still want to do it: You can get the batches of data with dataset.take(1) and then either save all individual string (using for .. in) or pickle to write the binary objects... But I repeat myself: You do not want to do that.
If you want to do preprocessing up front, use a program that works on your text files and saves them as text files back again (e.g. for cleaning etc.) - but be aware that you will have to do the same for test and production data later on, so everything you remove from the (keras) pipeline, you have to care about yourself.
I am working on training my own images read from my folders. I would be thankful if you could help me for this.
I successfully read my all images from the folder and create my own onehot_encoded labels. However, in each time I run my code, it takes a lot of time to do read all images from the folders. Therefore, I want to create dataset from these images and save it like MNIST to use faster. Thus, I will not read my whole images again. Could you please help me for this?
The code is:
path = "D:/cleandata/train_data/"
loadedImages = []
labels = []
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(len(os.listdir(path))):
imagesList = listdir(path+os.listdir(path)[i])
for image in imagesList:
image_raw_data_jpg tf.gfile.FastGFile(path+os.listdir(path)
[i]+'/'+image, 'rb').read()
raw_image =tf.image.decode_png(image_raw_data_jpg,3)
gray_resize=tf.image.resize_images(raw_image, [28, 28])
image_data =
sess.run(tf.image.rgb_to_grayscale(gray_resize))
loadedImages.append(image_data)
Here is a tutorial on how to use a TFRecords file. It shows how to create the file (containing images and labels) and read from it.
http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html
Or you could just use zipfile, and include the label in the image file name, thus keeping them together (that is what I did)
Hi I am studying the dataset API in tensorflow now and I have a question regarding to the dataset.map() function which performs data preprocessing.
file_name = ["image1.jpg", "image2.jpg", ......]
im_dataset = tf.data.Dataset.from_tensor_slices(file_names)
im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))
im_dataset = im_dataset.batch(batch_size)
iterator = im_dataset.make_initializable_iterator()
The dataset takes in image names and parse them into 3 tensors (3 infos about the image).
If I have a very larger number of images in my training folder, preprocessing them is gonna take a long time.
My question is that, since Dataset API is said to be designed for efficient input pipeline, the preprocessing is done for the whole dataset before I feed them to my workers (let's say GPUs), or it only preprocess one batch of image each time I call iterator.get_next()?
If your preprocessing pipeline is very long and the output is small, the processed data should fit in memory. If this is the case, you can use tf.data.Dataset.cache to cache the processed data in memory or in a file.
From the official performance guide:
The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.
Example use of cache in memory
Here is an example where each pre-processing takes a lot of time (0.5s). The second epoch on the dataset will be much faster than the first
def my_fn(x):
time.sleep(0.5)
return x
def parse_fn(x):
return tf.py_func(my_fn, [x], tf.int64)
dataset = tf.data.Dataset.range(5)
dataset = dataset.map(parse_fn)
dataset = dataset.cache() # cache the processed dataset, so every input will be processed once
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
# First 5 iterations will take 0.5s each, last 5 will not
print(sess.run(res))
Caching to a file
If you want to write the cached data to a file, you can provide an argument to cache():
dataset = dataset.cache('/tmp/cache') # will write cached data to a file
This will allow you to only process the dataset once, and run multiple experiments on the data without reprocessing it again.
Warning: You have to be careful when caching to a file. If you change your data, but keep the /tmp/cache.* files, it will still read the old data that was cached. For instance, if we use the data from above and change the range of the data to be in [10, 15], we will still obtain data in [0, 5]:
dataset = tf.data.Dataset.range(10, 15)
dataset = dataset.map(parse_fn)
dataset = dataset.cache('/tmp/cache')
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(res)) # will still be in [0, 5]...
Always delete the cached files whenever the data that you want to cache changes.
Another issue that may arise is if you interrupt the script before all the data is cached. You will receive an error like this:
AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists ('/tmp/cache.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator.
Make sure that you let the whole dataset be processed to have an entire cache file.
With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.
Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave
The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset = tf.data.TFRecordDataset(myShuffledFileList)
I'm relatively new to ML and very much new to TensorfFlow. I've spent quite a bit of time on the TensorFlow MINST tutorial as well as https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/how_tos/reading_data to try and figure out how to read my own data, but I'm getting a bit confused.
I have a bunch of images (.png) in a directory /images/0_Non/. I'm trying to make these into a TensorFlow Data set so then I can basically run the stuff from the MINST tutorial on it as a first pass.
import tensorflow as tf
# Make a queue of file names including all the JPEG images files in the relative image directory.
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once("../images/0_Non/*.png"))
image_reader = tf.WholeFileReader()
# Read a whole file from the queue, the first returned value in the tuple is the filename which we are ignoring.
_, image_file = image_reader.read(filename_queue)
image = tf.image.decode_png(image_file)
# Start a new session to show example output.
with tf.Session() as sess:
# Required to get the filename matching to run.
tf.initialize_all_variables().run()
# Coordinate the loading of image files.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# Get an image tensor and print its value.
image_tensor = sess.run([image])
print(image_tensor)
# Finish off the filename queue coordinator.
coord.request_stop()
coord.join(threads)
I'm having a bit of trouble understanding what's going on here. So it seems like image is a tensor and image_tensor is an numpy array?
How do I get my images into a data set? I also tried following along the Iris example which is for a CSV which brought me to here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py, but wasn't sure how to get this to work for my case where I have a bunch of png's.
Thanks!
The recently added tf.data API makes it easier to do this:
import tensorflow as tf
# Make a Dataset of file names including all the PNG images files in
# the relative image directory.
filename_dataset = tf.data.Dataset.list_files("../images/0_Non/*.png")
# Make a Dataset of image tensors by reading and decoding the files.
image_dataset = filename_dataset.map(lambda x: tf.decode_png(tf.read_file(x)))
# NOTE: You can add additional transformations, like
# `image_dataset.batch(BATCH_SIZE)` or `image_dataset.repeat(NUM_EPOCHS)`
# in here.
iterator = image_dataset.make_one_shot_iterator()
next_image = iterator.get_next()
# Start a new session to show example output.
with tf.Session() as sess:
try:
while True:
# Get an image tensor and print its value.
image_array = sess.run([next_image])
print(image_tensor)
except tf.errors.OutOfRangeError:
# We have reached the end of `image_dataset`.
pass
Note that for training you will need to get labels from somewhere. The Dataset.zip() transformation is a possible way to combine together image_dataset with a dataset of labels from a different source.