Problem definition:
I am implementing a CNN using Tensorflow. The Input and output are of size samples x 128 x 128 x 1 (grayscale image). In loss function I already have SSIM (0-1) and now my goal is to combine SSIM value with perceptual loss using pre-trained VGG16. I have already consulted following answers link1, link2 but instead of concatenating VGG model at end of the main model I would like to compute feature maps inside loss function at specific layers (e.g. pool1, pool2, pool3) and compute overall MSE. I have defined loss function as following:
Combined Loss function:
def lossfun( yTrue, yPred):
alpha = 0.5
return (1-alpha)*perceptual_loss(yTrue, yPred) + alpha*K.mean(1-tf.image.ssim(yTrue, yPred, 1.0))
and perceptual loss:
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
model = VGG16()
model = Model(inputs=model.inputs, outputs=model.layers[1].output)
def perceptual_loss(yTrue, yPred):
true = model(preprocess_input(yTrue))
P=Concatenate()([yPred,yPred,yPred])
pred = model(preprocess_input(P))
vggLoss = tf.math.reduce_mean(tf.math.square(true - pred))
return vggLoss
The Error I am running into is followig:
ValueError: Dimensions must be equal, but are 224 and 128 for 'loss_22/conv2d_132_loss/sub' (op: 'Sub') with input shapes: [?,224,224,64], [?,128,128,64].
Error arises due to following reason:
yPred has size None,128,128,1 , after concatenating it three time and pred = model(preprocess_input(P)) I receive feature map named pred of size None,128,128,64. While yTrue has size None and after true = model(preprocess_input(yTrue)) dimension of true is None,224,224,64. This eventually creates dimension incompatibility while computing final vggLoss.
Question
Since I am new to this task I am not sure if I am approaching the problem in the right manner. Should I create samples of size 224x244 instead of 128x128 in order to avoid this conflict, or is there any other workaround to fix this issue?
Thank you !
I pass a list a to my custom function and I want to tf.tile it after converting it to a constant tensor. The times I tile it depends on the shape of y_true. I don't know how I can get the shape of y_true as integers. Here's the code:
def getloss(a):
a = tf.constant(a, tf.float32)
def loss(y_true, y_pred):
a = tf.reshape(a, [1,1,-1])
ytrue_shape = y_true.get_shape().as_list() #????
multiples = tf.constant([ytrue_shape[0], ytrue_shape[1], 1], tf.int32)
a = tf.tile(a, multiples)
#...
return loss
I have tried y_true.get_shape().as_list() but it reports an error because the first dimension (batch size) is None when compiling the model. Is there any way I can use the shape of y_true here?
When trying to access the shape of a tensor during the building of the model, when not all shapes are known, it is best to use tf.shape. It will be evaluated when the model is ran, as stated in the doc :
tf.shape and Tensor.shape should be identical in eager mode. Within tf.function or within a compat.v1 context, not all dimensions may be known until execution time. Hence when defining custom layers and models for graph mode, prefer the dynamic tf.shape(x) over the static x.shape.
ytrue_shape = tf.shape(y_true)
This will yield a Tensor, so use TF ops to get what you want :
multiples = tf.concat((tf.shape(y_true_shape)[:2],[1]),axis=0)
I am having some issues with regards to the dimensionality of my tensors in my training function at present. I am using the MNIST dataset, so 10 possible targets, and originally wrote the prototype code using a training batch size of 10, which was in retrospect not the wisest choice. It gave poor results during some earlier tests, and increasing the amount of training iterations saw no benefit. Upon trying to then increase the batch size, I realised that what I had written was not that general, and I was likely never training it on the proper data. Below is my training function:
def Train(tLoops, Lrate):
for _ in range(tLoops):
tempData = train_data.view(batch_size_train, 1, 1, -1)
output = net(tempData)
trainTarget = train_targets
criterion = nn.MSELoss()
print("target:", trainTarget.size())
print("Output:", output.size())
loss = criterion(output, trainTarget.float())
# print("error is:", loss)
net.zero_grad() # zeroes the gradient buffers of all parameters
loss.backward()
for j in net.parameters():
j.data.sub_(j.grad.data * Lrate)
of which the print functions return
target: torch.Size([100])
Output: torch.Size([100, 1, 1, 10])
before the error message on the line where loss is calculated;
RuntimeError: The size of tensor a (10) must match the size of tensor b (100) at non-singleton dimension 3
The first print, target, is a 1-dimensional list of the respective ground truth values for each image. Output contains the output of the neural net for each of those 100 samples, so a 10 x 100 list, however from skimming and reshaping the data from 28 x 28 to 1 x 784 earlier, I seem to have extra dimensions unnecessarily. Does PyTorch provide a way to remove these? I couldn't find anything in the documentation, or is there something else that could be my issue?
There are several problems in your training script. I will address each of them below.
First, you should NOT do data batching by hand. Pytorch/torchvision have functions for that, use a dataset and a data loader: https://pytorch.org/tutorials/recipes/recipes/loading_data_recipe.html.
You should also NEVER update the parameters of you network by hand. Use an Optimizer: https://pytorch.org/docs/stable/optim.html. In your case, SGD without momentum will have the same effect.
The dimensionality of your input seems to be wrong, for MNIST an input tensor should be (batch_size, 1, 28, 28) or (batch_size, 784) if you're training a MLP. Furthermore, the output of your network should be (batch_size, 10)
I am trying to achieve the segmentation of the bone on the cross sectional area of MRI images with the Unet I found here https://github.com/zhixuhao/unet. The label is a binary png image which I intend to compare to my prediction. For the frames, I wrote my own data loader for my DICOM MRI files
I am using the following dice loss in keras
def DiceLoss(targets, inputs, smooth=1e-6):
#flatten label and prediction tensors
inputs = K.flatten(inputs)
targets = K.flatten(targets)
intersection = K.sum(K.dot(targets, inputs))
dice = (2*intersection + smooth) / (K.sum(targets) + K.sum(inputs) + smooth)
return 1 - dice
I have seen it in kaggle https://www.kaggle.com/bigironsphere/loss-function-library-keras-pytorch
among other similar implementations in Keras that I found out there.
But the DICE loss is always too small (10^-4) even though there is no overlap between the prediction and the label. Why is this? should I first convert my prediction to a binary mask? if so, How can I do this in keras? I tried to convert targets to a numpy array by doing targets.numpy() and then apply a threshold, but it throws a "tensor does not have attribute numpy" error. Do you have any other idea?
thank you
I found examples/image_ocr.py which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?
What I did
Installing the depencencies:
Install cairocffi: sudo apt-get install python-cairocffi
Install editdistance: sudo -H pip install editdistance
Change train to return the model and save the trained model.
Run the script to train the model.
Now I have a model.h5. What's next?
See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.
Related side questions
What is CTC? Connectionist Temporal Classification?
Are there algorithms which reliably detect the rotation of a document?
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
What I tried
#!/usr/bin/env python
from keras import backend as K
import keras
from keras.models import load_model
import os
from image_ocr import ctc_lambda_func, create_model, TextImageGenerator
from keras.layers import Lambda
from keras.utils.data_utils import get_file
import scipy.ndimage
import numpy
img_h = 64
img_w = 512
pool_size = 2
words_per_epoch = 16000
val_split = 0.2
val_words = int(words_per_epoch * (val_split))
if K.image_data_format() == 'channels_first':
input_shape = (1, img_w, img_h)
else:
input_shape = (img_w, img_h, 1)
fdir = os.path.dirname(get_file('wordlists.tgz',
origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))
img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),
bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),
minibatch_size=32,
img_w=img_w,
img_h=img_h,
downsample_factor=(pool_size ** 2),
val_split=words_per_epoch - val_words
)
print("Input shape: {}".format(input_shape))
model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)
model.load_weights("my_model.h5")
x = scipy.ndimage.imread('example.png', mode='L').transpose()
x = x.reshape(x.shape + (1,))
# Does not work
print(model.predict(x))
this gives
2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "eval_example.py", line 45, in <module>
print(model.predict(x))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict
check_batch_axis=False)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data
'Found: array with shape ' + str(data.shape))
ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1)
Well, I will try to answer everything you asked here:
As commented in the OCR code, Keras doesn't support losses with multiple parameters, so it calculated the NN loss in a lambda layer. What does this mean in this case?
The neural network may look confusing because it is using 4 inputs ([input_data, labels, input_length, label_length]) and loss_out as output. Besides input_data, everything else is information used only for calculating the loss, it means it is only used for training. We desire something like in line 468 of the original code:
Model(inputs=input_data, outputs=y_pred).summary()
which means "I have an image as input, please tell me what is written here". So how to achieve it?
1) Keep the original training code as it is, do the training normally;
2) After training, save this model Model(inputs=input_data, outputs=y_pred)in a .h5 file to be loaded wherever you want;
3) Do the prediction: if you take a look at the code, the input image is inverted and translated, so you can use this code to make it easy:
from scipy.misc import imread, imresize
#use width and height from your neural network here.
def load_for_nn(img_file):
image = imread(img_file, flatten=True)
image = imresize(image,(height, width))
image = image.T
images = np.ones((1,width,height)) #change 1 to any number of images you want to predict, here I just want to predict one
images[0] = image
images = images[:,:,:,np.newaxis]
images /= 255
return images
With the image loaded, let's do the prediction:
def predict_image(image_path): #insert the path of your image
image = load_for_nn(image_path) #load from the snippet code
raw_word = model.predict(image) #do the prediction with the neural network
final_word = decode_output(raw_word)[0] #the output of our neural network is only numbers. Use decode_output from image_ocr.py to get the desirable string.
return final_word
This should be enough. From my experience, the images used in the training are not good enough to make good predictions, I will release a code using other datasets that improved my results later if necessary.
Answering related questions:
What is CTC? Connectionist Temporal Classification?
It is a technique used to improve sequence classification. The original paper proves it improves results on discovering what is said in audio. In this case it is a sequence of characters. The explanation is a bit trick but you can find a good one here.
Are there algorithms which reliably detect the rotation of a document?
I am not sure but you could take a look at Attention mechanism in neural networks. I don't have any good link now but I know it could be the case.
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
OpenCV implements Maximally Stable Extremal Regions (known as MSER). I really like the results of this algorithm, it is fast and was good enough for me when I needed.
As I said before, I will release a code soon. I will edit the question with the repository when I do, but I believe the information here is enough to get the example running.
Now I have a model.h5. What's next?
First I should comment that the model.h5 contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json like this example:
model_json = model_json = model.to_json()
with open("model_arch.json", "w") as json_file:
json_file.write(model_json)
Now, once you have your model and its weights you can load them on demand by doing the following:
json_file = open('model_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
# if you already have a loaded model and dont need to save start from here
loaded_model.load_weights("model.h5")
# compile loaded model with certain specifications
sgd = SGD(lr=0.01)
loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
Then, with that loaded_module you can proceed to predict the classification of certain input like this:
prediction = loaded_model.predict(some_input, batch_size=20, verbose=0)
Which will return the classification of that input.
About the Side Questions:
CTC seems to be a term they are defining in the paper you refered, extracting from it says:
In what follows, we refer to the task of labelling un-
segmented data sequences as
temporal classification
(Kadous, 2002), and to our use of RNNs for this pur-
pose as
connectionist temporal classification
(CTC).
To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that Keras example you are using does, as we can see from that git:
This example uses a convolutional stack followed by a recurrent stack
and a CTC logloss function to perform optical character recognition
of generated text images.
You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.
Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.
Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:
predict(self, x, batch_size=32, verbose=0)
Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.
Here, you created a model that needs 4 inputs:
model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)
Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array
From your code, the necessary inputs are:
input_data = Input(name='the_input', shape=input_shape, dtype='float32')
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
The original code and your training work because they're using the TextImageGenerator. This generator cares to give you the four necessary inputs for the model.
So, what you have to do is to predict using the generator. As you have the fit_generator() method for training with the generator, you also have the predict_generator() method for predicting with the generator.
Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.
You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.
Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.
The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.
Hi You can Look in to my github repo for the same. You need to train the model for type of images you want to do the ocr.
# USE GOOGLE COLAB
import matplotlib.pyplot as plt
import keras_ocr
images = [keras_ocr.tools.read("/content/sample_data/IMG_20200224_113657.jpg")] #Image path
pipeline = keras_ocr.pipeline.Pipeline()
prediction = pipeline.recognize(images)
x_max = 0
temp_str = ""
myfile = open("/content/sample_data/my_file.txt", "a+")#Text File Path to save text
for i in prediction[0]:
x_max_local = i[1][:, 0].max()
if x_max_local > x_max:
x_max = x_max_local
temp_str = temp_str + " " + i[0].ljust(15)
else:
x_max = 0
temp_str = temp_str + "\n"
myfile.write(temp_str)
print(temp_str)
temp_str = ""
myfile.close()