Preprocessing Image dataset with numpy for CNN:Memory Error

Preprocessing Image dataset with numpy for CNN:Memory Error - python

I have a dataset (71094 train images and 17000 test) for which i need to train a CNN.During preprocessing , i tried creating a matrix using numpy that turns out to be ridiculously large(71094*100*100*3 for the train data) [all images are RGB 100 by 100].. Hence i get a memory error.How do i tackle the situation.??Pls help .
This is my code..
import numpy as np
import cv2
from matplotlib import pyplot as plt
data_dir = './fashion-data/images/'
train_data = './fashion-data/train.txt'
test_data = './fashion-data/test.txt'
f = open(train_data, 'r').read()
ims = f.split('\n')
print len(ims)
train = np.zeros((71094, 100, 100, 3)) #this line causes the error..
for ix in range(train.shape[0]):
i = cv2.imread(data_dir + ims[ix] + '.jpg')
label = ims[ix].split('/')[0]
train[ix, :, :, :] = cv2.resize(i, (100, 100))
print train[0]
train_labels = np.zeros((71094, 1))
for ix in range(train_labels.shape[0]):
l = ims[ix].split('/')[0]
train_labels[ix] = int(l)
print train_labels[0]
np.save('./data/train', train)
np.save('./data/train_labels', train_labels)

I recently ran into the same problem, and I believe its a common problem when working with image data.
There are a number of methods you can use to tackle this problem depending on what you would like to do.
1) It can make sense to sample the data from each image when training, so not to train on ALL 71094*100*100 pixels. This can be done simply, by creating a function which loads one image at a time, and samples your pixels. There is some argument that doing this randomly for each epoch can reduce overfitting, but again depends on the exact problem. Stratified sampling may also help to balance the classes should you be working with pixel classification.
2) mini-batch training - split your training data into small "mini-batches" and train on each of these individually. Your epoch will end after you have completed training on all of your mini-batches, with all of your data. Here you should, each epoch, randomise the order of the data in order to avoid overfitting.
3) Load and train one image at a time - similar to mini-batch training, but just use one image as a "mini-batch" for each iteration, and run a for-loop through all of the images in the folder. this way only 1x100x100x3 is stored in memory at a time. depending on the size of your memory, you could perhaps use more than one image per mini batch i.e. - Nx100x100x3, and run for 71094/N iterations to go over all training data
I hope this was clear.. and that it helps somewhat!

Related

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Why does my keras model get good accuracy but bad predictions?

So, I am trying to make a model which can predict doodles. I am using google's quick draw data :https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/numpy_bitmap which are images rendered into 28x28 greyscale bitmap numpy array. I only chose 10 classes and took 60,000 photos to train/evaluate. I get a test accuracy of 91% . When I try to make predictions with data from test data, it works. But when i make a drawing in paint and convert it into 28x28, it doesn't make good predictions. What sort of data do I need to have? What kind of preprocessing does the image need?
This is how i preprocessed the data from google's npy file
def load_set(name,path,resultx,resulty,label):
loaded_set = np.load(path+name+".npy")
loaded_set = loaded_set.reshape(loaded_set.shape[0],1,28,28)
# print(name,loaded_set.shape)
loaded_set = loaded_set[0:6000,0:6000,0:6000,0:6000]
resultx = np.append(resultx,loaded_set,axis=0)
resulty = createLabelArray(label,loaded_set.shape[0],resulty)
print("loaded "+name)
return resultx,resulty
def createLabelArray(label,size,result):
for i in range(0,size):
result = np.append(result,[[label]],axis=0)
return result
where label is the label i want for that category.
I shuffle them afterwards and everything.
And this is how I am trying to process new images(drawings by me):
print("[INFO] loading and preprocessing image...")
image = image_utils.load_img(os.path.join(path, name), grayscale=True,target_size=(28, 28))
image = image_utils.img_to_array(image)
print(image.shape)
image = np.expand_dims(image, axis=0)
print(image.shape)
image = image.astype('float32')
image /= 255
return image
Please help, I've been stuck on this for a while now. Thank you

Seems to be a typical case of overfitting.
Please try 10-fold cross-validation to get accuracy of model.
Further use regularization and dropout in keras to prevent overfitting.

Scikit-learn SVM digit recognition

I want to make a program to recognize the digit in an image. I follow the tutorial in scikit learn .
I can train and fit the svm classifier like the following.
First, I import the libraries and dataset
from sklearn import datasets, svm, metrics
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
Second, I create the SVM model and train it with the dataset.
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
And then, I try to read my own image and use the function predict() to recognize the digit.
Here is my image:
I reshape the image into (8, 8) and then convert it to a 1D array.
img = misc.imread("w1.jpg")
img = misc.imresize(img, (8, 8))
img = img[:, :, 0]
Finally, when I print out the prediction, it returns [1]
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
print predicted
Whatever I user others images, it still returns [1]
When I print out the "default" dataset of number "9", it looks like:
My image number "9" :
You can see the non-zero number is quite large for my image.
I dont know why. I am looking for help to solve my problem. Thanks

My best bet would be that there is a problem with your data types and array shapes.
It looks like you are training on numpy arrays that are of the type np.float64 (or possibly np.float32 on 32 bit systems, I don't remember) and where each image has the shape (64,).
Meanwhile your input image for prediction, after the resizing operation in your code, is of type uint8 and shape (1, 64).
I would first try changing the shape of your input image since dtype conversions often just work as you would expect. So change this line:
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
to this:
predicted = classifier.predict(img.reshape(img.shape[0]*img.shape[1]))
If that doesn't fix it, you can always try recasting the data type as well with
img = img.astype(digits.images.dtype).
I hope that helps. Debugging by proxy is a lot harder than actually sitting in front of your computer :)
Edit: According to the SciPy documentation, the training data contains integer values from 0 to 16. The values in your input image should be scaled to fit the same interval. (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)

1) You need to create your own training set - based on data similar to what you will be making predictions. The call to datasets.load_digits() in scikit-learn is loading a preprocessed version of the MNIST Digits dataset, which, for all we know, could have very different images to the ones that you are trying to recognise.
2) You need to set the parameters of your classifier properly. The call to svm.SVC(gamma = 0.001) is just choosing an arbitrary value of the gamma parameter in SVC, which may not be the best option. In addition, you are not configuring the C parameter - which is pretty important for SVMs. I'd bet that this is one of the reasons why your output is 'always 1'.
3) Whatever final settings you choose for your model, you'll need to use a cross-validation scheme to ensure that the algorithm is effectively learning
There's a lot of Machine Learning theory behind this, but, as a good start, I would really recommend to have a look at SVM - scikit-learn for a more in-depth description of how the SVC implementation in sickit-learn works, and GridSearchCV for a simple technique for parameter setting.

It's just a guess but... The Training set from Sk-Learn are black numbers on a white background. And you are trying to predict numbers which are white on a black background...
I think you should either train on your training set, or train on the negative version of your pictures.
I hope this help !

If you look at:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
you can see that each point in the matrix as a value between 0-16.
You can try to transform the values of the image to between 0-16. I did it and now the prediction works well for the digit 9 but not for 8 and 6. It doesn't give 1 any more.
from sklearn import datasets, svm, metrics
import cv2
import numpy as np
# Load digit database
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Train SVM classifier
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
# Read image "9"
img = cv2.imread("w1.jpg")
img = img[:,:,0];
img = cv2.resize(img, (8, 8))
# Normalize the values in the image to 0-16
minValueInImage = np.min(img)
maxValueInImage = np.max(img)
normaliizeImg = np.floor(np.divide((img - minValueInImage).astype(np.float),(maxValueInImage-minValueInImage).astype(np.float))*16)
# Predict
predicted = classifier.predict(normaliizeImg.reshape((1,normaliizeImg.shape[0]*normaliizeImg.shape[1] )))
print predicted

I have solved this problem using below methods:
check the number of attributes, too large or too small.
check the scale of your gray value, I change to [0,16].
check data type, I change it to uint8.
check the number of training data, too small or not.
I hope it helps. ^.^

Hi in addition to #carrdelling respond, i will add that you may use the same training set, if you normalize your images to have the same range of value.
For example you could binaries your data ( 1 if > 0, 0 else ) or you could divide by the maximum intensity in your image to have an arbitrary interval [0;1].

You probably want to extract features relevant to to your data set from the images and train your model on them.
One example I copied from here.
surf = cv2.SURF(400)
kp, des = surf.detectAndCompute(img,None)
But the SURF features may not be the most useful or relevant to your dataset and training task. You should try others too like HOG or others.
Remember this more high level the features you extract the more general/error-tolerant your model will be to unseen images. However, you may be sacrificing accuracy in your known samples and test cases.

how to read batches in one hdf5 data file for training?

I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.
I want to ask:
How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:
import h5py
import tensorflow as tf
data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
current_data = data['data_set'][position:position+batch_size]
sess.run(train_op, feed_dict={input: current_data})
You can also run through a huge number of iterations and randomly select a batch if you want to:
import random
for i in range(iterations):
pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
Or sequentially:
for i in range(iterations):
pos = (i % int(data_size / batch_size)) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})
You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :
import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.
Find below a screenshot of this approach
As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.

Problems while using ScikitLearn's Neural Network implementation

I am trying to implement image processing using Neural Network implementation given by Scikit Learn. I have close to 10,000 color images in 'JPG' format, I converted those images into 'PNG' format and removed the color information. The new images are all Black OR White images. After converting these images into vector format, these image vectors formed the input to the Neural Network.
To each image, there is an output as well which forms the output of the Neural Network.
The input file only has values of 0's and 1's and nothing else at all. The output for each image corresponds to a vector which is continuous, between 0 and 1 and is 22 in length. i.e. each image's output is a vector with length 22.
To start off with the processing, I began with only 100 images and their corresponding outputs and got the following error:
ValueError: Array contains NaN or infinity
I would also like to point out that the first iteration was completed here and I encountered this error during the second iteration.
To try something different, I trimmed my input and output down to 10 images each. Using the same piece of code (coming up shortly), I was able to complete 7 iterations (I had set the number of iterations to 20) and then received the same error.
I then changed the number of iterations to 5, just to check if it works. After this change, I got the following error:
ValueError: bad input shape (10, 22)
I also tried to use np.reval() on my input and output but that gave me NaN or Infinity error again.
Here is the code I am using for the whole process:
import numpy as np
import csv
import matplotlib.pyplot as plt
from scipy.ndimage import convolve
from sklearn import linear_model, datasets, metrics
from sklearn.cross_validation import train_test_split
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
def ReadCsv(fileName):
in_file = open(fileName, 'rUb')
reader = csv.reader(in_file, delimiter=',', quotechar='"')
data = [[]]
for row in reader:
data.append(row)
data.pop(0)
return data
X_train = np.asarray(ReadCsv('100Images.csv'), 'float32')
Y_train = np.asarray(ReadCsv('100Images_Y_new.csv'), 'float32')
X_test = np.asarray(ReadCsv('ImagesForTest.csv'), 'float32')
Y_test = np.asarray(ReadCsv('ImagesForTest_Y_new.csv'), 'float32')
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
rbm.learning_rate = 0.06
rbm.n_iter = 5
rbm.n_components = 100
logistic.C = 6000.0
classifier.fit(X_train, Y_train)
print()
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
Y_test,
classifier.predict(X_test))))
I would really appreciate any kind of help on this issue.
TIA.

Change learning rate to a small value might fix this issue. (i.e rbm.learning_rate)
At least this fixed the problem I had before.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.