How can I use the Keras OCR example? - python

I found examples/ which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?
What I did
Installing the depencencies:
Install cairocffi: sudo apt-get install python-cairocffi
Install editdistance: sudo -H pip install editdistance
Change train to return the model and save the trained model.
Run the script to train the model.
Now I have a model.h5. What's next?
See for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.
Related side questions
What is CTC? Connectionist Temporal Classification?
Are there algorithms which reliably detect the rotation of a document?
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
What I tried
#!/usr/bin/env python
from keras import backend as K
import keras
from keras.models import load_model
import os
from image_ocr import ctc_lambda_func, create_model, TextImageGenerator
from keras.layers import Lambda
from keras.utils.data_utils import get_file
import scipy.ndimage
import numpy
img_h = 64
img_w = 512
pool_size = 2
words_per_epoch = 16000
val_split = 0.2
val_words = int(words_per_epoch * (val_split))
if K.image_data_format() == 'channels_first':
input_shape = (1, img_w, img_h)
input_shape = (img_w, img_h, 1)
fdir = os.path.dirname(get_file('wordlists.tgz',
origin='', untar=True))
img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),
bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),
downsample_factor=(pool_size ** 2),
val_split=words_per_epoch - val_words
print("Input shape: {}".format(input_shape))
model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)
x = scipy.ndimage.imread('example.png', mode='L').transpose()
x = x.reshape(x.shape + (1,))
# Does not work
this gives
2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "", line 45, in <module>
File "/usr/local/lib/python2.7/dist-packages/keras/engine/", line 1567, in predict
File "/usr/local/lib/python2.7/dist-packages/keras/engine/", line 106, in _standardize_input_data
'Found: array with shape ' + str(data.shape))
ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1)

Well, I will try to answer everything you asked here:
As commented in the OCR code, Keras doesn't support losses with multiple parameters, so it calculated the NN loss in a lambda layer. What does this mean in this case?
The neural network may look confusing because it is using 4 inputs ([input_data, labels, input_length, label_length]) and loss_out as output. Besides input_data, everything else is information used only for calculating the loss, it means it is only used for training. We desire something like in line 468 of the original code:
Model(inputs=input_data, outputs=y_pred).summary()
which means "I have an image as input, please tell me what is written here". So how to achieve it?
1) Keep the original training code as it is, do the training normally;
2) After training, save this model Model(inputs=input_data, outputs=y_pred)in a .h5 file to be loaded wherever you want;
3) Do the prediction: if you take a look at the code, the input image is inverted and translated, so you can use this code to make it easy:
from scipy.misc import imread, imresize
#use width and height from your neural network here.
def load_for_nn(img_file):
image = imread(img_file, flatten=True)
image = imresize(image,(height, width))
image = image.T
images = np.ones((1,width,height)) #change 1 to any number of images you want to predict, here I just want to predict one
images[0] = image
images = images[:,:,:,np.newaxis]
images /= 255
return images
With the image loaded, let's do the prediction:
def predict_image(image_path): #insert the path of your image
image = load_for_nn(image_path) #load from the snippet code
raw_word = model.predict(image) #do the prediction with the neural network
final_word = decode_output(raw_word)[0] #the output of our neural network is only numbers. Use decode_output from to get the desirable string.
return final_word
This should be enough. From my experience, the images used in the training are not good enough to make good predictions, I will release a code using other datasets that improved my results later if necessary.
Answering related questions:
What is CTC? Connectionist Temporal Classification?
It is a technique used to improve sequence classification. The original paper proves it improves results on discovering what is said in audio. In this case it is a sequence of characters. The explanation is a bit trick but you can find a good one here.
Are there algorithms which reliably detect the rotation of a document?
I am not sure but you could take a look at Attention mechanism in neural networks. I don't have any good link now but I know it could be the case.
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
OpenCV implements Maximally Stable Extremal Regions (known as MSER). I really like the results of this algorithm, it is fast and was good enough for me when I needed.
As I said before, I will release a code soon. I will edit the question with the repository when I do, but I believe the information here is enough to get the example running.

Now I have a model.h5. What's next?
First I should comment that the model.h5 contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json like this example:
model_json = model_json = model.to_json()
with open("model_arch.json", "w") as json_file:
Now, once you have your model and its weights you can load them on demand by doing the following:
json_file = open('model_arch.json', 'r')
loaded_model_json =
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
# if you already have a loaded model and dont need to save start from here
# compile loaded model with certain specifications
sgd = SGD(lr=0.01)
loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
Then, with that loaded_module you can proceed to predict the classification of certain input like this:
prediction = loaded_model.predict(some_input, batch_size=20, verbose=0)
Which will return the classification of that input.
About the Side Questions:
CTC seems to be a term they are defining in the paper you refered, extracting from it says:
In what follows, we refer to the task of labelling un-
segmented data sequences as
temporal classification
(Kadous, 2002), and to our use of RNNs for this pur-
pose as
connectionist temporal classification
To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that Keras example you are using does, as we can see from that git:
This example uses a convolutional stack followed by a recurrent stack
and a CTC logloss function to perform optical character recognition
of generated text images.
You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.
Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.
Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:
predict(self, x, batch_size=32, verbose=0)
Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.

Here, you created a model that needs 4 inputs:
model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)
Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array
From your code, the necessary inputs are:
input_data = Input(name='the_input', shape=input_shape, dtype='float32')
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
The original code and your training work because they're using the TextImageGenerator. This generator cares to give you the four necessary inputs for the model.
So, what you have to do is to predict using the generator. As you have the fit_generator() method for training with the generator, you also have the predict_generator() method for predicting with the generator.
Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.
You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.
Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.
The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.

Hi You can Look in to my github repo for the same. You need to train the model for type of images you want to do the ocr.
import matplotlib.pyplot as plt
import keras_ocr
images = ["/content/sample_data/IMG_20200224_113657.jpg")] #Image path
pipeline = keras_ocr.pipeline.Pipeline()
prediction = pipeline.recognize(images)
x_max = 0
temp_str = ""
myfile = open("/content/sample_data/my_file.txt", "a+")#Text File Path to save text
for i in prediction[0]:
x_max_local = i[1][:, 0].max()
if x_max_local > x_max:
x_max = x_max_local
temp_str = temp_str + " " + i[0].ljust(15)
x_max = 0
temp_str = temp_str + "\n"
temp_str = ""


Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
index = index + batch_size_train
if index > len(data):
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.
From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.
Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition :
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::

Resizing layer in Tensorflow crashes because of different picture shapes

I'm new to TensorFlow. I have a image classification problem with different image sizes. In the documentation I read about how it is beneficial to do the resizing inside the model instead of in the function.
I batch my dataset like this:
ds_train = ds_train\
My model is very simple:
base_model = tf.keras.applications.ResNet50V2(
include_top=True, weights=None, input_tensor=None, input_shape=(224,224,3),
pooling=None, classes=NUM_CLASSES, classifier_activation='softmax')
seed = 42
model = tf.keras.Sequential([
tf.keras.Input(shape=(None, None, 3)),
tf.keras.layers.experimental.preprocessing.Resizing(224, 224),
tf.keras.layers.experimental.preprocessing.RandomFlip(mode='horizontal_and_vertical', seed=seed),
This gives me the error:
InvalidArgumentError: Cannot add tensor to the batch: number of elements does not match. Shapes are: [tensor]: [95,116,3], [batch]: [108,112,3]. How can I use the resize layer with batching?
The error is that you cannot batch elements of different sizes. There's unfortunately no way around that. The documentation specifies that preprocessing inside the model is useful at inference (i.e. when you call model.predict()).
The key benefit to doing this is that it makes your model portable [...] When all data preprocessing is part of the model, other people can load and use your model without having to be aware of how each feature is expected to be encoded & normalized. Your inference model will be able to process raw images or raw structured data, and will not require users of the model to be aware of the details of e.g. the tokenization scheme used for text, the indexing scheme used for categorical features, whether image pixel values are normalized to [-1, +1] or to [0, 1], etc.
During training, if you want to use a batch size of >1, you will need to do the preprocessing yourself, if the images have different sizes. You can do that with

Attention is all you need, keeping only the encoding part for video classification

I am trying to modify a code that could find in the following link in such a way that the proposed Transformer model that is related to the paper: all you need is attention would keep only the Encoder part of the whole Transformer model. Furthermore, I would like to modify the input of the Network, instead of being a sequence of text to be a sequence of images (or better-extracted features of images) coming from a video. In a sense, I would like to figure out which frames are related to each other from my input and encode that info in an output embedding in the same way that is happening to the Transformers model.
The project as it is in the link provided is mainly performing sequence-sequence transformation. The input is text from one language and the output is text in another language. The main formation of the model is happening in the lines 386-463. Where the model is initialized and the compile of the Model is happening. For me I would like to do something like:
self.encoder = SelfAttention(d_model, d_inner_hid, n_head, layers, dropout)
#self.decoder = Decoder(d_model, d_inner_hid, n_head, layers, dropout)
#self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
enc_output = self.encoder(src_emb, src_seq, active_layers=active_layers)
#dec_output = self.decoder(tgt_emb, tgt_seq, src_seq, enc_output, active_layers=active_layers)
#final_output = self.target_layer(dec_output)
Furthermore, since I would like to combine the output of the Encoder which is the output of MultiHeadAttention and PositionwiseFeedForward using an LSTM and a Dense layer which will tune the whole Encoding procedure using classification optimization. Therefore, I add when I define my model the following layers:
self.lstm = LSTM(units = 256, input_shape = (None, 256), return_sequences = False, dropout = 0.5)
self.fc1 = Dense(64, activation='relu', name = "dense_one")
self.fc2 = Dense(6, activation='sigmoid', name = "dense_two")
and then pass the output of the encoder, in line 434 using the following code:
enc_output = self.lstm(enc_output)
enc_output = self.fc1(enc_output)
enc_output = self.fc2(enc_output)
Now the video data that I would like to replace the text data provided with the Github code, have the following dimensionality: Nx10x256 where N is the number of samples, 10 is the number of frames and 256 the number of features for each frame. I have some difficulties to understand some parts of the code, in order to successfully, modified it to my needs. I guess, that now the Embedding layer is not necessary for me anymore since it is related to text classification and NLP.
Furthermore, I need to modify the input to 419-420 to be sth like:
src_seq_input = Input(shape=(None, 256,), dtype='float32') # source input related to video
tgt_seq_input = Input(shape=(6,), dtype='int32') # the target classification size (since I have 6 classes)
What other parts of the code do I need to skip or modify? What is the usefulness of the PosEncodingLayer that is used in the following line:
self.pos_emb = PosEncodingLayer(len_limit, d_emb) if self.src_loc_info else None
Is it needed in my case? Can I skip it?
After my modification in the code I noticed that when I run the code, I can check the loss function from the def get_loss(y_pred, y_true), however, in my case it is crucial to define a loss for the classification task that returns also the accuracy. How can I do so, with the provided code?
I have to add that I treat my input as the output of the Embedding layer from the initial NLP code. Therefore, for me (in the version of code that functioned for me):
src_seq_input = Input(shape=(None, 256,), dtype='float32')
tgt_seq_input = Input(shape=(6,), dtype='int32')
src_seq = src_seq_input
#src_emb_ = self.i_word_emb(src_seq)
src_emb = src_seq
enc_output = self.encoder(src_emb, src_emb, active_layers=active_layers)
I treat src_emb as my input and completely ignore src_seq.
The way that the loss is calculated is using the following code:
def get_loss(y_pred, y_true):
y_true = tf.cast(y_true, 'int32')
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
loss = K.mean(loss)
return loss
loss = get_loss(enc_output, tgt_seq_input)
self.ppl = K.exp(loss)
As it is the loss function (sparse_softmax_cross_entropy_with_logits) returns a loss score. Even if the whole procedure is about classification. How, can I further, tune my system to return also the accuracy?
I'm afraid this approach is not going to work.
Video data has massive dependence between adjacent frames, with each frame very similar to the last. There is also a weaker dependence on prior frames, because objects tend to continue to move relative to other objects in similar ways. Modern video formats use this redundancy to achieve high compression rates by modelling the motions.
This means that your network will have an extremely strong attention on the previous image. As you suggest, you could subsample frames several seconds apart to destroy much of the dependence on the previous frame, but if you did so I really wonder whether you would find structure at all in the result? Even if you feed it hand-coded features optimised for the purpose, there are are few general rules about which features will be in motion and which will not, so what structure can your attention network learn?
The problem of handling video is just radically different from handling sentences. Video has very complex elements (pictures) that are largely static over time and have locally predictable motions over a few frames in very simple ways. Text has simple elements (words) in a complex sentence structure with complex dependence extending over many words. These differences mean they require fundamentally different approaches.

Issue with fine-tuning inceptionv3 in slim tensorflow and tf record batches

I am trying to fine-tune inceptionv3 model using slim tensorflow library.
I am unable to understand certain things while writing the code for it. I tried to read source code (no proper documentation) and figured out few things and I am able to fine-tune it and save the check point. Here are the steps I followed
1. I created a tf.record for my training data which is fine, now I am reading the data using the below code.
import tensorflow as tf
import tensorflow.contrib.slim.nets as nets
import tensorflow.contrib.slim as slim
import matplotlib.pyplot as plt
import numpy as np
# get the data and labels here
data_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/train1.tfrecords'
# Training setting
num_epochs = 100
initial_learning_rate = 0.0002
learning_rate_decay_factor = 0.7
num_epochs_before_decay = 5
num_classes = 5980
# load the checkpoint
model_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/inception_v3.ckpt'
# log directory
log_dir = '/home/sfarkya/nvidia_challenge/datasets/detrac/fine_tuned_model'
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example =
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [128, 128, 3])
# Creates batches by randomly shuffling tensors
images, labels = tf.train.shuffle_batch([image, label], batch_size=64, capacity=128, num_threads=2,
Now I am finetuning the model using slim and this is the code.
init_op =, tf.local_variables_initializer())
# Create a coordinator and run all QueueRunner objects
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# load model
# load the inception model from the slim library - we are using inception v3
#inputL = tf.placeholder(tf.float32, (64, 128, 128, 3))
img, lbl =[images, labels])
one_hot_labels = slim.one_hot_encoding(lbl, num_classes)
with slim.arg_scope(slim.nets.inception.inception_v3_arg_scope()):
logits, inceptionv3 = nets.inception.inception_v3(inputs=img, num_classes=5980, is_training=True,
# Restore convolutional layers:
variables_to_restore = slim.get_variables_to_restore(exclude=['InceptionV3/Logits', 'InceptionV3/AuxLogits'])
init_fn = slim.assign_from_checkpoint_fn(model_path, variables_to_restore)
# loss function
loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
total_loss = tf.losses.get_total_loss()
# train operation
train_op = slim.learning.create_train_op(total_loss + loss, optimizer= tf.train.AdamOptimizer(learning_rate=1e-4))
print('Im here')
# Start training.
slim.learning.train(train_op, log_dir, init_fn=init_fn, save_interval_secs=20, number_of_steps= 10)
Now I have few questions about the code, which I am quite unable to figure out. Once, the code reaches slim.learning.train I don't see anything printing however, it's training, I can see in the log. Now,
1. How do I give the number of epochs to the code? Right now it's running step by step with each step has batch_size = 64.
2. How do I make sure that in the code tf.train.shuffle_batch I am not repeating my images and I am training over the whole dataset?
3. How can I print the loss values while it's training?
Here are answers to your questions.
You cannot give epochs directly to slim.learning.train. Instead, you give the number of batches as the argument. It is called number_of_steps. It is used to set an operation called should_stop_op on line 709. I assume you know how to convert number of epochs to batches.
I don't think the shuffle_batch function will repeat images because internally it uses the RandomShuffleQueue. According to this answer, the RandomShuffleQueue enqueues elements using a background thread as:
While size(queue) < capacity:
Add an element to the queue
It dequeues elements as:
While the number of elements dequeued < batch_size:
Wait until the size(queue) >= min_after_dequeue + 1 elements.
Select an element from the queue uniformly at random, remove it from the queue, and add it the output batch.
So in my opinion, there is very little chance that the elements would be repeated, because in the dequeuing operation, the chosen element is removed from the queue. So it is sampling without replacement.
Will a new queue be created for every epoch?
The tensors being inputted to tf.train.shuffle_batch are image and label which ultimately come from the filename_queue. If that queue is producing TFRecord filenames indefinitely, then I don't think a new queue will be created by shuffle_batch. You can also create a toy code like this to understand how shuffle_batch works.
Coming to the next point, how to train over the whole dataset? In your code, the following line gets the list of TFRecord filenames.
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
If filename_queue covers all TFRecords that you have, then you are surely training over the entire dataset. Now, how to shuffle the entire dataset is another question. As mentioned here by #mrry, there is no support (yet, AFAIK) to shuffle out-of-memory datasets. So the best way is to prepare many shards of your dataset such that each shard contains about 1024 examples. Shuffle the list of TFRecord filenames as:
filename_queue = tf.train.string_input_producer([data_path], shuffle=True, capacity=1000)
Note that I removed the num_epochs = 1 argument and set shuffle=True. This way it will produce the shuffled list of TFRecord filenames indefinitely. Now on each file, if you use tf.train.shuffle_batch, you will get a near-to-uniform shuffling. Basically, as the number of examples in each shard tend to 1, your shuffling will get more and more uniform. I like to not set num_epochs and instead terminate the training using the number_of_steps argument mentioned earlier.
To print the loss values, you could probably just edit the and introduce'total loss = %f', total_loss). I don't know if there is any simpler way. Another way without changing the code is to view summaries in Tensorboard.
There are very helpful articles on how to view summaries in Tensorboard, including the link at the end of this answer. Generally, you need to do the following things.
Create summary object.
Write variables of interest into summary.
Merge all individual summaries.
Create a summary op.
Create a summary file writer.
Write the summaries throughout the training at a desired frequency.
Now steps 5 and 6 are already done automatically for you if you use slim.learning.train.
For first 4 steps, you could check the file Line 472 shows you how to create a summaries object. Lines 490, 512 and 536 write the relevant variables into summaries. Line 549 merges all summaries and the line 553 creates an op. You can pass this op to slim.learning.train and you can also specify how frequently you want to write summaries. In my opinion, do not write anything apart from loss, total_loss, accuracy and learning rate into the summaries, unless you want to do specific debugging. If you write histograms, then the tensorboard file could take tens of hours to load for networks like ResNet-50 (my tensorboard file once was 28 GB, which took 12 hours to load the progress of 6 days!). By the way, you could actually use file to finetune and you will skip most of the steps above. However, I prefer this as you get to learn a lot of things.
See the launching tensorboard section on how to view the progress in a browser.
Additional remarks:
Instead of minimizing total_loss + loss, you could do the following:
loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
total_loss = tf.losses.get_total_loss()
train_op = slim.learning.create_train_op(total_loss, optimizer=tf.train.AdamOptimizer(learning_rate=1e-4))
I found this post to be very useful when I was learning Tensorflow.

Scikit-learn SVM digit recognition

I want to make a program to recognize the digit in an image. I follow the tutorial in scikit learn .
I can train and fit the svm classifier like the following.
First, I import the libraries and dataset
from sklearn import datasets, svm, metrics
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
Second, I create the SVM model and train it with the dataset.
classifier = svm.SVC(gamma = 0.001)[:n_samples],[:n_samples])
And then, I try to read my own image and use the function predict() to recognize the digit.
Here is my image:
I reshape the image into (8, 8) and then convert it to a 1D array.
img = misc.imread("w1.jpg")
img = misc.imresize(img, (8, 8))
img = img[:, :, 0]
Finally, when I print out the prediction, it returns [1]
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
print predicted
Whatever I user others images, it still returns [1]
When I print out the "default" dataset of number "9", it looks like:
My image number "9" :
You can see the non-zero number is quite large for my image.
I dont know why. I am looking for help to solve my problem. Thanks
My best bet would be that there is a problem with your data types and array shapes.
It looks like you are training on numpy arrays that are of the type np.float64 (or possibly np.float32 on 32 bit systems, I don't remember) and where each image has the shape (64,).
Meanwhile your input image for prediction, after the resizing operation in your code, is of type uint8 and shape (1, 64).
I would first try changing the shape of your input image since dtype conversions often just work as you would expect. So change this line:
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
to this:
predicted = classifier.predict(img.reshape(img.shape[0]*img.shape[1]))
If that doesn't fix it, you can always try recasting the data type as well with
img = img.astype(digits.images.dtype).
I hope that helps. Debugging by proxy is a lot harder than actually sitting in front of your computer :)
Edit: According to the SciPy documentation, the training data contains integer values from 0 to 16. The values in your input image should be scaled to fit the same interval. (
1) You need to create your own training set - based on data similar to what you will be making predictions. The call to datasets.load_digits() in scikit-learn is loading a preprocessed version of the MNIST Digits dataset, which, for all we know, could have very different images to the ones that you are trying to recognise.
2) You need to set the parameters of your classifier properly. The call to svm.SVC(gamma = 0.001) is just choosing an arbitrary value of the gamma parameter in SVC, which may not be the best option. In addition, you are not configuring the C parameter - which is pretty important for SVMs. I'd bet that this is one of the reasons why your output is 'always 1'.
3) Whatever final settings you choose for your model, you'll need to use a cross-validation scheme to ensure that the algorithm is effectively learning
There's a lot of Machine Learning theory behind this, but, as a good start, I would really recommend to have a look at SVM - scikit-learn for a more in-depth description of how the SVC implementation in sickit-learn works, and GridSearchCV for a simple technique for parameter setting.
It's just a guess but... The Training set from Sk-Learn are black numbers on a white background. And you are trying to predict numbers which are white on a black background...
I think you should either train on your training set, or train on the negative version of your pictures.
I hope this help !
If you look at:
you can see that each point in the matrix as a value between 0-16.
You can try to transform the values of the image to between 0-16. I did it and now the prediction works well for the digit 9 but not for 8 and 6. It doesn't give 1 any more.
from sklearn import datasets, svm, metrics
import cv2
import numpy as np
# Load digit database
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Train SVM classifier
classifier = svm.SVC(gamma = 0.001)[:n_samples],[:n_samples])
# Read image "9"
img = cv2.imread("w1.jpg")
img = img[:,:,0];
img = cv2.resize(img, (8, 8))
# Normalize the values in the image to 0-16
minValueInImage = np.min(img)
maxValueInImage = np.max(img)
normaliizeImg = np.floor(np.divide((img - minValueInImage).astype(np.float),(maxValueInImage-minValueInImage).astype(np.float))*16)
# Predict
predicted = classifier.predict(normaliizeImg.reshape((1,normaliizeImg.shape[0]*normaliizeImg.shape[1] )))
print predicted
I have solved this problem using below methods:
check the number of attributes, too large or too small.
check the scale of your gray value, I change to [0,16].
check data type, I change it to uint8.
check the number of training data, too small or not.
I hope it helps. ^.^
Hi in addition to #carrdelling respond, i will add that you may use the same training set, if you normalize your images to have the same range of value.
For example you could binaries your data ( 1 if > 0, 0 else ) or you could divide by the maximum intensity in your image to have an arbitrary interval [0;1].
You probably want to extract features relevant to to your data set from the images and train your model on them.
One example I copied from here.
surf = cv2.SURF(400)
kp, des = surf.detectAndCompute(img,None)
But the SURF features may not be the most useful or relevant to your dataset and training task. You should try others too like HOG or others.
Remember this more high level the features you extract the more general/error-tolerant your model will be to unseen images. However, you may be sacrificing accuracy in your known samples and test cases.

