Tensorflow's decode_csv only reading one line - python

How do I get the decode_csv function to read every line in my CSV?
I'm currently trying to load data from my CSV file onto my GPU. Data loads fine onto the GPU, except... only one line of my 640-line CSV file is actually read. Where do you think I'm going wrong?
import tensorflow as tf
with tf.device('/gpu:0'):
filename_queue = tf.train.string_input_producer(['dataset.csv'])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [['']]*121
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# Iterate through all the columns
vals = []
for x in range(121):
tmp = all_columns.pop()
myval = tmp.eval(session=sess)
vals.append(myval)
coord.request_stop()
coord.join(threads)
Then if I do...
>>> import numpy as np
>>> vals = np.asarray(vals)
>>> vals.shape
(121,)
I do have 121 columns per each of the 640 rows in my CSV. The values in vals look fine to me, except I'm not actually getting all 640 lines read. I'm guessing it has to do with:
all_columns = tf.decode_csv(value, record_defaults=record_defaults)

Nvm. Figured it out.
Apparently there is a difference between sess.run() and pop() in terms of how we extract row data.
I happen to have 640 lines in my CSV file and 121 columns, hence the:
record_defaults = [['']]*121
and
for x in range(640):
Note that this is mostly hardcoded just for testing purposes. Solution below:
import tensorflow as tf
with tf.device('/gpu:0'):
filename_queue = tf.train.string_input_producer(['../Datasets/CMU_face_images_dataset.csv'])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [['']]*121
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
# TWO NEW LINES
name = all_columns[0]
data = all_columns[1:]
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
vals = []
names = []
for x in range(640):
# THIS IS THE NEW LINE
_name, _val = sess.run([name, data])
# OLD LINES
# tmp = all_columns.pop()
# myval = tmp.eval(session=sess)
# vals.append(myval)
names.append(_name)
vals.append(_val)
coord.request_stop()
coord.join(threads)

Related

Tensorflow 1.13.1 tf.data map multiple images with a single row together

I'm building my tf dataset where there are multiple inputs (images and numerical/categorical data). The problem I am having is that multiple images correspond to the same row in the pd.Dataframe I have. I am doing regression.
So how, (even when shuffling all the inputs) do I ensure that each image gets mapped to the correct row?
Again, say I have 10 rows, and 100 images, with 10 images corresponding to a particular row. Now we shuffle the dataset, and we want to make sure that the shuffled images all correspond to their respective row.
I am using tf.data.Dataset to do this. I also have a directory structure such that the folder name corresponds to an element in the DataFrame, which is what I was thinking of using if I knew how to do the mapping
i.e. folder1 would be in the df with cols like dir_name, feature1, feature2, .... Naturally, the dir_names should not be passed as data into the model to fit on.
#images
path_ds = tf.data.Dataset.from_tensor_slices(paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
#numerical&categorical features. First remove the dirs
x_train_input = X_train[X_train.columns.difference(['dir_name'])]
x_train_input=np.expand_dims(x_train_input, axis=1)
text_ds = tf.data.Dataset.from_tensor_slices(x_train_input)
#labels, y_train's cols are: 'label' and 'dir_name'
label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(y_train['label'], tf.float32))
# test creation of dataset without prior shuffling.
xtrain_ = tf.data.Dataset.zip((image_ds, text_ds))
model_ds = tf.data.Dataset.zip((xtrain_, label_ds))
# Shuffling
BATCH_SIZE = 64
# Setting a shuffle buffer size as large as the dataset ensures that
# data is completely shuffled
ds = model_ds.shuffle(buffer_size=len(paths))
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# prefetch lets the dataset fetch batches in the background while the
# model is training
# ds = ds.prefetch(buffer_size=AUTOTUNE)
ds = ds.prefetch(buffer_size=BATCH_SIZE)
My solution would be to utilize TFRecords for storing the data and holding it's integrity. This will also open doors for other efficiencies as well.
What the below code is doing...
Create dummy data. All need to be arrays with the same datatype found in the _parse_function. You can change that dtype, just also ensure you change it for your data too.
Create a dictionary that holds the arrays by name
Create feature_dimensions object that holds the shape of all arrays
Create TFRecords by looping over data dict. You can create one large file, or many small ones. This is a good starting point for you however.
Declare functions for generating the dataset. You can add and modify whatever logic you want there. The key, however, is that these functions use the feature_dimensions object to remember how to put the data back together
Create a dataset
Generate a sample. The result is a dictionary with a batch-size worth of data.
You should be able to just run this sample code all by itself and have no issues. Then just make the changes you need for it to work in your problem.
import tensorflow as tf
import pandas as pd
import numpy as np
from functools import partial
# Create dummy data, TODO replace with your own logic
# 10 images per row in DF
images_per_example = 10
examples = 200
# Save name for TFRecords, you can create multiple and pass a list of the names as well
save_name = "my_tfrecords.tfrecords"
# DF, dataframe with random categorical data
x_data = pd.DataFrame(data=(np.random.normal(size=(examples, 50)) > 0).astype(np.float32))
y_data = np.random.uniform(0, 1, size=(examples, )).reshape(-1, 1).astype(np.float32)
def load_and_preprocess_image(file):
# For dummy purposes generating instead of loading
img = np.random.uniform(high=255, low=0, size=(15, 15))
return (img / 255.).astype(np.float32)
# I would preprocess your images prior to creating the tfrecords file
img_data = np.array([[load_and_preprocess_image("add_logic") for j in range(images_per_example)]
for k in range(examples)])
# Prepare for tfrecords
data_dict = dict()
data_dict["images"] = img_data # Already an array
data_dict["x_data"] = x_data.values # Ensure it's an array
data_dict["y_data"] = y_data # Already an array
# Remember the dimensions for later restoration, replacing number of examples with -1
feature_dimensions = {k: v.shape for k, v in data_dict.items()}
feature_dimensions = {k: tuple([-1] + list(v[1:])) for k, v in feature_dimensions.items()}
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter(save_name)
# Create TFRecords file
for i in range(examples):
example_dict = dict() # New dictionary for each single example
for name, data in data_dict.items():
# if name == "images":
# break
example_dict[name] = data[i]
# Define the features of your tfrecord
feature = {k: _bytes_feature(tf.compat.as_bytes(v.tostring())) for k, v in example_dict.items()}
# Serialize to string and write to file
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
# Declare functions for creating dataset
def _parse_function(proto, feature_dimensions_: dict):
# define your tfrecord again. Remember that you saved your image as a string.
keys_to_features = {k: tf.FixedLenFeature([], tf.string) for k in feature_dimensions_.keys()}
# Load one example
parsed_features = tf.parse_single_example(proto, keys_to_features)
# Split data
for k, v in parsed_features.items():
parsed_features[k] = tf.decode_raw(v, tf.float32)
return parsed_features
def create_tf_dataset(file_paths: str, feature_dimensions_: dict, batch_size=64):
# This works with arrays as well
dataset = tf.data.TFRecordDataset(file_paths)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
parse_function = partial(_parse_function, feature_dimensions_=feature_dimensions_)
dataset = dataset.map(parse_function, num_parallel_calls=1)
# This dataset will go on forever
dataset = dataset.repeat()
# Set the number of datapoints you want to load and shuffle
dataset = dataset.shuffle(batch_size) # Put whatever you want here
# Set the batchsize
dataset = dataset.batch(batch_size)
# Set up a pipeline
dataset = dataset.prefetch(batch_size) # Put whatever you want here
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
parsed_features = iterator.get_next()
# Reshape arrays and cast to float
for k, v in parsed_features.items():
parsed_features[k] = tf.reshape(v, feature_dimensions_[k])
for k, v in parsed_features.items():
parsed_features[k] = tf.cast(v, tf.float32)
return parsed_features
# Create dataset
ds = create_tf_dataset(save_name, feature_dimensions, batch_size=64)
# The final result is a dictionary with the names used above
sample = tf.Session().run(ds)
print("Sample Length:", len(sample))
print("Sample Keys:", sample.keys())
print("images shape:", sample["images"].shape)
print("x_data shape:", sample["x_data"].shape)
print("y_data shape:", sample["y_data"].shape)
Printed Results
Sample Length: 3
Sample Keys: dict_keys(['images', 'x_data', 'y_data'])
images shape: (64, 10, 15, 15)
x_data shape: (64, 50)
y_data shape: (64, 1)

Code rewrite - MemoryError

I am writing a python script to read two csv files. Code snippet is available below. The code works perfectly, if files contain few records (8,000) however I encountered MemoryError on line (X_train = X_train.astype('float32')) if file contain large number of records (120,000).
img_lst_train = []
label_lst_train = []
img_lst_test = []
label_lst_test = []
print ('Reading training file')
with open('train.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
img = cv2.imread(row[0])
img_lst_train.append(img)
label_lst_train.append(row[1])
print ('Reading testing file')
with open('val.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
img = cv2.imread(row[0])
img_lst_test.append(img)
label_lst_test.append(row[1])
img_lst_train = np.array(img_lst_train)
label_lst_train = np.array(label_lst_train)
img_lst_test = np.array(img_lst_test)
label_lst_test = np.array(label_lst_test)
X_train = img_lst_train
y_train = label_lst_train
X_test = img_lst_test
y_test = label_lst_test
# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
Structure of train.csv and val.csv
path to image file, label
path to image file, label
path to image file, label
.........................
How I can rewrite the above code to avoid MemoryError
Numpy's astype function supports a parameter copy that, if set to false, will work on the initial array instead of generating a copy. In code:
X_train = X_train.astype('float32', copy=False)
X_test = X_test.astype('float32', copy=False)
If you still run out of memory at some point, you can also read your train, validation, and test sets sequentially instead of at the same time. Once converted to float, the arrays take up less space, and that could make the difference.

Writing past the previous loop

with tf.Session() as sess:
out = open('output.csv', 'a')
for image_path in glob.glob(folder_path+'/*'):
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
#print("%s\t%s\t%s\t%s\t%s\t%s\n" % (image_path,predictions[0][1],predictions[0][0],predictions[0][2], predictions[0][3],predictions[0][4]))
for i in predictions:
predictions= pd.DataFrame([image_path,i[0][1],i[0][0],i[0][2], i[0][3],i[0][4]], columns = ['predictions']).to_csv('prediction.csv')
#f = open('/tf_files/testinnggg', 'w')
#for row in predictions:
# f.write(row[0])
# f.close()
#test = []
#test.append([predictions[0][1],predictions[0][0],predictions[0][2], predictions[0][3],predictions[0][4]])
#THIS ACTUALLY WORKS, I see in my terminal "/tf_files/tested/pic1.jpg 0.00442768 0.995572"
#np.savetxt('testinnggg', test, delimiter = ',')#,[predictions[0][0],predictions[0][2],predictions[0][3],predictions[0][4],delimiter = ',')
#out.write("%s\t%s\t%s\n" % (image_path,predictions[0][1],predictions[0][0]))
#This does not work, because output.csv is not modified
out.close()
When using the pandas option to save the predictions,the only prediction that gets saved is the final file,i think it is overwrting the previous ones.Any suggestions as to how do i get all the predictions in the loop.
Thank you

Tensorflow: How can I run my testing set on a trained Neural Net

I have created a Neural Net that takes as input an RGB corrupted image and produces a clean version of it. After I finish training my NN I want to test it on a big set of 50 images. Each input (image) consists of a batch sized 64*32*32*3( I crop my image in 64 patches and then feed it to the NN). I train my NN with the following code:
# placeholders, variables etc here
train_step = tf.train.AdamOptimizer().minimize(loss)
#loading data to queue
training_queue = tf.train.string_input_producer(clean_set, shuffle=False)
cor_queue = tf.train.string_input_producer(corrupted_set, shuffle=False)
reader = tf.WholeFileReader()
key, value = reader.read(training_queue)
cor_key, cor_value = reader.read(cor_queue)
data = tf.image.decode_jpeg(value, channels = 3)
cor_data = tf.image.decode_jpeg(cor_value, channels = 3)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range (64):
my_data = sess.run([key,data,cor_key,cor_data])
im_list.append(my_data[1].reshape(1,-1))
key_list.append(my_data[0])
cor_im_list.append(my_data[3].reshape(1,-1))
cor_key_list.append(my_data[2])
for j in range(my_times):
_, y = sess.run([train_step,h_y],feed_dict={x: cor_im_list, y_: im_list})
print 'finished training NN'
coord.request_stop()
coord.join(threads)
This works fine!
Now I want to test my data set:
ressu = []
test_im_list = []
test_key_list = []
# I have 50 images in 50 folders (each folder contains the 64 patches of the image)
for i in range(50):
path = 'randomize_text/permutated_data/perm_test_' + str(i) + '/*.jpg'
testing_set = glob.glob(path)
testing_queue = tf.train.string_input_producer(testing_set, shuffle=False)
reader = tf.WholeFileReader()
test_key, test_value = reader.read(testing_queue)
test_data = tf.image.decode_jpeg(test_value, channels = 3)
for j in range (64):
print j
my_data = sess.run([test_key,test_data])
test_im_list.append(my_data[1].reshape(1,-1))
test_key_list.append(my_data[0])
psi = sess.run(y,feed_dict={x: test_im_list})
ressu.append(psi)
If put this code right after finishing training the NN the program becomes unresponsive and my guess is that I dont use the coord and threads so I cant handle the big set (even if place it right before I close threads). If I load it the way I loaded my training set I can only do it for one image which is not enough, I need to load them all.
How can I test my trained NN with my testing set?
Thanks

Loading images in tensorflow queue provides random image instead of FIFO

I am trying to load an image with tensorflow, but I need the files to be in order. When I load the image it loads a random image but not in the order I provided through my initial array. However, my understanding is that string_input_producer(file_names) is FIFO. Why are my images random and how do I make it load images in order?
with open("name.json", 'r') as f:
data = json.load(f)
file_names = []
for i, row in enumerate(data):
load_location = row['location']
file_names.append(load_location)
filename_queue = tf.train.string_input_producer(file_names) # list of files to read
count_num_files = tf.size(file_names)
reader=tf.WholeFileReader()
key,value=reader.read(filename_queue)
img = tf.image.decode_png(value)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
num_files = sess.run(count_num_files)
for i in range(num_files):
# this does not match
location = file_names[i]
# with this image
image_eval=img.eval()
coord.request_stop()
coord.join(threads)
Stupid mistake, string_input_producer shuffle setting defaults to True:
filename_queue = tf.train.string_input_producer(file_names, shuffle=False)

Categories

Resources