I am writing a python script to read two csv files. Code snippet is available below. The code works perfectly, if files contain few records (8,000) however I encountered MemoryError on line (X_train = X_train.astype('float32')) if file contain large number of records (120,000).
img_lst_train = []
label_lst_train = []
img_lst_test = []
label_lst_test = []
print ('Reading training file')
with open('train.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
img = cv2.imread(row[0])
img_lst_train.append(img)
label_lst_train.append(row[1])
print ('Reading testing file')
with open('val.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
img = cv2.imread(row[0])
img_lst_test.append(img)
label_lst_test.append(row[1])
img_lst_train = np.array(img_lst_train)
label_lst_train = np.array(label_lst_train)
img_lst_test = np.array(img_lst_test)
label_lst_test = np.array(label_lst_test)
X_train = img_lst_train
y_train = label_lst_train
X_test = img_lst_test
y_test = label_lst_test
# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
Structure of train.csv and val.csv
path to image file, label
path to image file, label
path to image file, label
.........................
How I can rewrite the above code to avoid MemoryError
Numpy's astype function supports a parameter copy that, if set to false, will work on the initial array instead of generating a copy. In code:
X_train = X_train.astype('float32', copy=False)
X_test = X_test.astype('float32', copy=False)
If you still run out of memory at some point, you can also read your train, validation, and test sets sequentially instead of at the same time. Once converted to float, the arrays take up less space, and that could make the difference.
Related
I am currently working on creating Face Detection Software as part of the development of a Facial Recognition Project.
I have struck an issue which I do not know how to resolve.
Essentially I am converting Images into 250x250 resolution, and then converting the image into a Flattened NumPy Array.
The Arrays are exported to CSV Files.
img = PIL.Image.open('tmp/images/train/cropped/image (' + str(convert_count) + ').jpg').convert('L')
width, height = img.size
img_size = 25, 25
img = img.resize(img_size)
imgarr = np.array(img)
pixels = list(img.getdata())
width, height = img.size
pixels = [pixels[i * width:(i + 1) * width] for i in range(height)]
pixels = np.concatenate(pixels).ravel().tolist()
with open('tmp/csv/train/train (' + str(convert_count) +').csv', 'w') as csvfile:
fieldnames = ['array']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'array': pixels})
I would assumed that the arrays would all have the same number of elements in them, as they are converted from 25x250 images. However this is not the case. Instead my first 2 arrays (images) contain 74898 and
73682 Elements.
I was wondering, why is this happening?
As Tensorflow will not let me train a model when the input sizes differ.
Code below:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import csv
count = 1
remaining_images = 3
number_images = 3
image_array = {}
image_array[1] = {}
image_array[2] = {}
while remaining_images > count:
with open('tmp/csv/train/train (' + str(count) + ').csv', 'r') as csvfile:
reader = csv.reader(csvfile)
row = [r for r in reader]
image_array[count] = row[2]
#print(image_array[count])
count = count + 1
image_array[1] = str(image_array[1])
image_array[2] = str(image_array[2])
features = np.array([image_array[1], image_array[2]
])
labels = np.array([1, 0])
#Example of the number of Elements in Arrays
array_size = len(features[0])
print(array_size)
array_size = len(features[1])
print(array_size)
batch_size = 2
dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(batch_size)
model = keras.Sequential([
keras.layers.Dense(5, activation=tf.nn.relu, input_shape=((array_size),)),
keras.layers.Dense(3, activation=tf.nn.softmax)
])
model.compile(
optimizer=keras.optimizers.Adam(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.fit(dataset, epochs=100, batch_size=batch_size, verbose=1)
I'm curious if the problem is coming from how you save to and get from the csv.
In your while loop (2nd code block), if you directly open the image file with PIL, resize, and then use the resulting image array (as you do in the 1st code block), does that resolve the size issue?
Also, since you resize to 25 x 25 = 625 pixels, I think you should just have 625 elements in each image array (rather than 74898 and 73682 elements).
I'm building my tf dataset where there are multiple inputs (images and numerical/categorical data). The problem I am having is that multiple images correspond to the same row in the pd.Dataframe I have. I am doing regression.
So how, (even when shuffling all the inputs) do I ensure that each image gets mapped to the correct row?
Again, say I have 10 rows, and 100 images, with 10 images corresponding to a particular row. Now we shuffle the dataset, and we want to make sure that the shuffled images all correspond to their respective row.
I am using tf.data.Dataset to do this. I also have a directory structure such that the folder name corresponds to an element in the DataFrame, which is what I was thinking of using if I knew how to do the mapping
i.e. folder1 would be in the df with cols like dir_name, feature1, feature2, .... Naturally, the dir_names should not be passed as data into the model to fit on.
#images
path_ds = tf.data.Dataset.from_tensor_slices(paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
#numerical&categorical features. First remove the dirs
x_train_input = X_train[X_train.columns.difference(['dir_name'])]
x_train_input=np.expand_dims(x_train_input, axis=1)
text_ds = tf.data.Dataset.from_tensor_slices(x_train_input)
#labels, y_train's cols are: 'label' and 'dir_name'
label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(y_train['label'], tf.float32))
# test creation of dataset without prior shuffling.
xtrain_ = tf.data.Dataset.zip((image_ds, text_ds))
model_ds = tf.data.Dataset.zip((xtrain_, label_ds))
# Shuffling
BATCH_SIZE = 64
# Setting a shuffle buffer size as large as the dataset ensures that
# data is completely shuffled
ds = model_ds.shuffle(buffer_size=len(paths))
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# prefetch lets the dataset fetch batches in the background while the
# model is training
# ds = ds.prefetch(buffer_size=AUTOTUNE)
ds = ds.prefetch(buffer_size=BATCH_SIZE)
My solution would be to utilize TFRecords for storing the data and holding it's integrity. This will also open doors for other efficiencies as well.
What the below code is doing...
Create dummy data. All need to be arrays with the same datatype found in the _parse_function. You can change that dtype, just also ensure you change it for your data too.
Create a dictionary that holds the arrays by name
Create feature_dimensions object that holds the shape of all arrays
Create TFRecords by looping over data dict. You can create one large file, or many small ones. This is a good starting point for you however.
Declare functions for generating the dataset. You can add and modify whatever logic you want there. The key, however, is that these functions use the feature_dimensions object to remember how to put the data back together
Create a dataset
Generate a sample. The result is a dictionary with a batch-size worth of data.
You should be able to just run this sample code all by itself and have no issues. Then just make the changes you need for it to work in your problem.
import tensorflow as tf
import pandas as pd
import numpy as np
from functools import partial
# Create dummy data, TODO replace with your own logic
# 10 images per row in DF
images_per_example = 10
examples = 200
# Save name for TFRecords, you can create multiple and pass a list of the names as well
save_name = "my_tfrecords.tfrecords"
# DF, dataframe with random categorical data
x_data = pd.DataFrame(data=(np.random.normal(size=(examples, 50)) > 0).astype(np.float32))
y_data = np.random.uniform(0, 1, size=(examples, )).reshape(-1, 1).astype(np.float32)
def load_and_preprocess_image(file):
# For dummy purposes generating instead of loading
img = np.random.uniform(high=255, low=0, size=(15, 15))
return (img / 255.).astype(np.float32)
# I would preprocess your images prior to creating the tfrecords file
img_data = np.array([[load_and_preprocess_image("add_logic") for j in range(images_per_example)]
for k in range(examples)])
# Prepare for tfrecords
data_dict = dict()
data_dict["images"] = img_data # Already an array
data_dict["x_data"] = x_data.values # Ensure it's an array
data_dict["y_data"] = y_data # Already an array
# Remember the dimensions for later restoration, replacing number of examples with -1
feature_dimensions = {k: v.shape for k, v in data_dict.items()}
feature_dimensions = {k: tuple([-1] + list(v[1:])) for k, v in feature_dimensions.items()}
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter(save_name)
# Create TFRecords file
for i in range(examples):
example_dict = dict() # New dictionary for each single example
for name, data in data_dict.items():
# if name == "images":
# break
example_dict[name] = data[i]
# Define the features of your tfrecord
feature = {k: _bytes_feature(tf.compat.as_bytes(v.tostring())) for k, v in example_dict.items()}
# Serialize to string and write to file
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
# Declare functions for creating dataset
def _parse_function(proto, feature_dimensions_: dict):
# define your tfrecord again. Remember that you saved your image as a string.
keys_to_features = {k: tf.FixedLenFeature([], tf.string) for k in feature_dimensions_.keys()}
# Load one example
parsed_features = tf.parse_single_example(proto, keys_to_features)
# Split data
for k, v in parsed_features.items():
parsed_features[k] = tf.decode_raw(v, tf.float32)
return parsed_features
def create_tf_dataset(file_paths: str, feature_dimensions_: dict, batch_size=64):
# This works with arrays as well
dataset = tf.data.TFRecordDataset(file_paths)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
parse_function = partial(_parse_function, feature_dimensions_=feature_dimensions_)
dataset = dataset.map(parse_function, num_parallel_calls=1)
# This dataset will go on forever
dataset = dataset.repeat()
# Set the number of datapoints you want to load and shuffle
dataset = dataset.shuffle(batch_size) # Put whatever you want here
# Set the batchsize
dataset = dataset.batch(batch_size)
# Set up a pipeline
dataset = dataset.prefetch(batch_size) # Put whatever you want here
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
parsed_features = iterator.get_next()
# Reshape arrays and cast to float
for k, v in parsed_features.items():
parsed_features[k] = tf.reshape(v, feature_dimensions_[k])
for k, v in parsed_features.items():
parsed_features[k] = tf.cast(v, tf.float32)
return parsed_features
# Create dataset
ds = create_tf_dataset(save_name, feature_dimensions, batch_size=64)
# The final result is a dictionary with the names used above
sample = tf.Session().run(ds)
print("Sample Length:", len(sample))
print("Sample Keys:", sample.keys())
print("images shape:", sample["images"].shape)
print("x_data shape:", sample["x_data"].shape)
print("y_data shape:", sample["y_data"].shape)
Printed Results
Sample Length: 3
Sample Keys: dict_keys(['images', 'x_data', 'y_data'])
images shape: (64, 10, 15, 15)
x_data shape: (64, 50)
y_data shape: (64, 1)
I have built a CNN to predict lymph node positivity (has cancer or not). Right now to load the data I have a self defined function that loads a batch of data and feeds it to the model for training.
Instead of loading batches I would love to use the flow_from_directory method. The problem I have is that my data are saved as arrays [#, rows, width, height, PET or CT] not images (that would later be converted to arrays). Example [0,:,:,:,0] is volume sized 48x48x32 from a ct image.
If I try to use flow_from_directory I get 0 images with 3 classes which I expected since '.mat' is not a recognized file (https://github.com/keras-team/keras-preprocessing/blob/362fe9f8daf556151328eb5d02bd5ae638c653b8/keras_preprocessing/image.py#L1868). Interestingly it doesnt raise any errors but I am indefinitely stuck on 1/150 epochs. I am going to see if I can write my own flow_from_directory. Not sure if someone has run across this problem and could give my pointers.
Illustrating how data is combined
for fname in fnames:
data = scipy.io.loadmat(os.path.join(dir_in_train, fname))['roi_patch']
data_PET = scipy.io.loadmat(os.path.join(dir_in_train_PET, fname))['roi_patch']
train_combo[0,:,:,:,0]=data/4.0950
train_combo[0,:,:,:,1]=data_PET/32.1959
train_combo[0,:,:,:,:].shape
train_combo = np.zeros((1, 48, 48, 32, 2))
scipy.io.savemat(fname, {fname: train_combo})
This will create a file ex '1.mat' that has CT data and PET data in one area
Then I have code changing it into npy files.
Example of data generator I already have
# load training data
def load_train_data_batch_generator(self, batch_size=32, rows_in=48, cols_in=48, zs_in=32,
channels_in=2, num_classes=3,
dir_in_train=None, dir_out_train=None):
# dir_in_train = main_dir + '/test_CT_PET_combo'
fnames = ['{}.mat'.format(i) for i in range(1,len(os.listdir(dir_in_train))+1)]
y_train = np.zeros((batch_size, num_classes))
x_train = np.zeros((batch_size, rows_in, cols_in, zs_in, channels_in))
while True:
count = 0
for fname in np.random.choice(fnames, batch_size, replace=False):
data_label = scipy.io.loadmat(os.path.join(dir_out_train, fname))['output']
# changing one hot encoding to integer
integer_label = np.argmax(data_label[0], axis=0)
y_train[count,:] = data_label
# Loading train ct w/ c and pet/ct combo that will be saved into new directory
train_combo = scipy.io.loadmat(os.path.join(dir_in_train, fname))[fname]
x_train[count,:,:,:,:] = train_combo
count += 1
yield(x_train, y_train)
How do I get the decode_csv function to read every line in my CSV?
I'm currently trying to load data from my CSV file onto my GPU. Data loads fine onto the GPU, except... only one line of my 640-line CSV file is actually read. Where do you think I'm going wrong?
import tensorflow as tf
with tf.device('/gpu:0'):
filename_queue = tf.train.string_input_producer(['dataset.csv'])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [['']]*121
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# Iterate through all the columns
vals = []
for x in range(121):
tmp = all_columns.pop()
myval = tmp.eval(session=sess)
vals.append(myval)
coord.request_stop()
coord.join(threads)
Then if I do...
>>> import numpy as np
>>> vals = np.asarray(vals)
>>> vals.shape
(121,)
I do have 121 columns per each of the 640 rows in my CSV. The values in vals look fine to me, except I'm not actually getting all 640 lines read. I'm guessing it has to do with:
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
Nvm. Figured it out.
Apparently there is a difference between sess.run() and pop() in terms of how we extract row data.
I happen to have 640 lines in my CSV file and 121 columns, hence the:
record_defaults = [['']]*121
and
for x in range(640):
Note that this is mostly hardcoded just for testing purposes. Solution below:
import tensorflow as tf
with tf.device('/gpu:0'):
filename_queue = tf.train.string_input_producer(['../Datasets/CMU_face_images_dataset.csv'])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [['']]*121
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
# TWO NEW LINES
name = all_columns[0]
data = all_columns[1:]
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
vals = []
names = []
for x in range(640):
# THIS IS THE NEW LINE
_name, _val = sess.run([name, data])
# OLD LINES
# tmp = all_columns.pop()
# myval = tmp.eval(session=sess)
# vals.append(myval)
names.append(_name)
vals.append(_val)
coord.request_stop()
coord.join(threads)
Getting output classification with Lasagne/Theano
I am migrating my code from pure Theano to Lasagne.
I had this certain code from a tutorial to get the result of a prediction with a certain data and I would generate a csv file to send to kaggle.
But with lasagne, it doesn't work.
I have tried several things but they all give errors.
I would love if anyone could help me figure what's wrong!
I pasted the whole code here :
http://pastebin.com/e7ry3280
test_data = np.loadtxt("../inputData/test.csv", dtype=np.uint8, delimiter=',', skiprows=1)
# The inputs are vectors now, we reshape them to monochrome 2D images,
# following the shape convention: (examples, channels, rows, columns)
data = data.reshape(-1, 1, 28, 28)
test_data = test_data.reshape(-1, 1, 28, 28)
index = T.lscalar() # index to a [mini]batch
preds = []
for it in range(len(test_data)):
test_data = test_data[it]
N = len(test_data)
# print "N : ", N
test_data = theano.shared(np.asarray(test_data, dtype=theano.config.floatX))
test_labels = T.cast(theano.shared(np.asarray(np.zeros(batch_size), dtype=theano.config.floatX)),'uint8')
###target_var
#y = T.ivector('y') # the labels are presented as 1D vector of [int] labels
#index = T.lscalar() # index to a [mini]batch
ppm = theano.function([index],lasagne.layers.get_output(network, deterministic=True),
givens={
input_var: test_data[index * batch_size: (index + 1) * batch_size],
target_var: test_labels
}, on_unused_input='warn')
p = [ppm(ii) for ii in range(N // batch_size)]
p = np.array(p).reshape((N, 10))
print (p)
p = np.argmax(p, axis=1)
p = p.astype(int)
preds.append(p)
subm = np.empty((len(preds), 2))
subm[:, 0] = np.arange(1, len(preds) + 1)
subm[:, 1] = preds
np.savetxt('submission.csv', subm, fmt='%d', delimiter=',',header='ImageId,Label', comments='')
return preds
The code fails on the line that starts with ppm = theano.function...:
TypeError: Cannot convert Type TensorType(float32, 3D) (of Variable Subtensor{int64:int64:}.0) into Type TensorType(float32, 4D). You can try to manually convert Subtensor{int64:int64:}.0 into a TensorType(float32, 4D).
I'm just trying to input the test data to the CNN and get the results to a CSV file. How can I do it? I know I must use minibatches because the whole test data wont fit on the GPU.
As pointed out by the error message and Daniel Renshaw in the comments, the problem is a mismatch of dimensions between test_data and input_var. On the first line on the loop, you write:
test_data = test_data[it]
Which turns the 4D array test_data into a 3D array with the same name (that is why using the same variable name for different types is never recommended :) ). After that you encapsulate it in a shared variable which doesn't change the dimension, and then you slice it to assign it to input_var, which again doesn't change the dimension.
If I understand your code, I think you should just remove that first line. That way test_data remains a list of examples, and you can slice it to make a batch.