I am asking a lot but I am very stuck on this one...
I have this part of code I used to extract features with SIFT, and I am trying to adapdt it to extract features based on a VGG16 model.
No matter how hard I try, I can't get passed through and always rise errors.
So if anyone can help to get the features in a way to use it for a clustering afterwards.
Here is the code with SIFT :
# identification of key points and associated descriptors
import time, cv2
sift_keypoints = []
temps1=time.time()
sift = cv2.xfeatures2d.SIFT_create(500)
for image_num in range(len(list_photos)) :
if image_num%100 == 0 : print(image_num)
image = cv2.imread(path+list_photos[image_num],0) # convert in gray
image = cv2.GaussianBlur(image,(7,7),cv2.BORDER_DEFAULT) #apply gaussianblur filter
# image = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
res = cv2.equalizeHist(image) # equalize image histogram
kp, des = sift.detectAndCompute(res, None)
sift_keypoints.append(des)
sift_keypoints_by_img = np.asarray(sift_keypoints)
sift_keypoints_all = np.concatenate(sift_keypoints_by_img, axis=0)
And here is how I use it for my clustering :
from sklearn import cluster, metrics
# Determination number of clusters
k = int(round(np.sqrt(len(sift_keypoints_all)),0))
print("Nombre de clusters estimés : ", k)
print("Création de",k, "clusters de descripteurs ...")
# Clustering
kmeans = cluster.MiniBatchKMeans(n_clusters=k, init_size=3*k, random_state=0)
kmeans.fit(sift_keypoints_all)
What should I do to be able to extract features with a VGG model?
Thanks
There is an example regarding feature extraction with VGG16 in the official Keras documentation [1].
Note the layers of a convolutional network are successive representations of varying dimensions of your picture. Depending the layer you choose as output, the results from clustering may be very different.
[1] https://keras.io/api/applications/
Related
I have several images and I want to know if there is any aircraft in the images or not.
I used the clip shown below but the output is [[1.0]], while the image is the face of humans. I think it is because it uses softmax.
I tried to use logits_per_image but the value is not understandable to me tensor([[20.03]]).
Is there any way to know if an image is related to a word in percent or so?
Can I use object detection in my problem to see if there are any aircraft in my image?
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open('image_4.jpg')
inputs = processor(text=['aircraft'], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
probs.tolist()
I'm new to image clustering, and I followed this tutorial:
Which results in the following code:
from sklearn.cluster import KMeans
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
directory = './imageSample'
vgg16_feature_list = []
for filename in os.listdir(directory):
if(filename != '.DS_Store'):
img_path = directory + '/' + filename
print(img_path)
img = image.load_img(img_path)#color_mode = "grayscale")
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
vgg16_feature = model.predict(img_data)
vgg16_feature_np = np.array(vgg16_feature)
vgg16_feature_list.append(vgg16_feature_np.flatten())
vgg16_feature_list_np = np.array(vgg16_feature_list)
kmeans = KMeans(n_clusters=5, random_state=0).fit(vgg16_feature_list_np)
However, I'm receiving this error:
ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (5). Possibly due to duplicate points in X.
return_n_iter=True)
I wonder if it is because of the sample images? They look like these, of 80x80 pixels, there are 52 of them:
I tried changing the color mode to grayscale, however I received
IndexError: index 1 is out of bounds for axis 3 with size 1 instead.
Kindly advice if such clustering is feasible with my dataset. Will it work if I expand the dataset to perhaps 100-200 images? or if there is any other approach I should look at to group the dataset. Thanks!
UPDATE
Seems that the real issue is same image features extracted, so I've move this to another post: Keras Same Feature Extraction from Different Images
I am trying to implement myself a bag of words classifier to classify a dataset I have. To be certain that my implementation is correct, I used just two classes from the Caltech dataset (http://www.vision.caltech.edu/Image_Datasets/Caltech101/) to test my implementation: elephant and electric guitar. As they are totally different visually, I believe that a correct implementation of Bag Of Visual Words (BOVW) classification could classify these images accurately.
From my understanding (please correct me if I am wrong), the correct BOVW classification happens in three steps:
Detect SIFT 128-Dimensional descriptors from training images and clusterize them with k-means.
Test the training and testing images SIFT descriptors in the k-means classifier (trained in step 1) and make a histogram of classification results.
Use these histograms as feature vectors for SVM classification
As I explained before, I tried to solve a very easy problem of classifying two very distinct classes. I am reading the training and testing files from a text file, I use the training images SIFT descriptors to train a k-means classifier, use the training and testing images to get the histogram of classifications and finally use them as feature vectors for classification.
The source code of my solution follows:
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score
#this function will get SIFT descriptors from training images and
#train a k-means classifier
def read_and_clusterize(file_images, num_cluster):
sift_keypoints = []
with open(file_images) as f:
images_names = f.readlines()
images_names = [a.strip() for a in images_names]
for line in images_names:
print(line)
#read image
image = cv2.imread(line,1)
# Convert them to grayscale
image =cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
# SIFT extraction
sift = cv2.xfeatures2d.SIFT_create()
kp, descriptors = sift.detectAndCompute(image,None)
#append the descriptors to a list of descriptors
sift_keypoints.append(descriptors)
sift_keypoints=np.asarray(sift_keypoints)
sift_keypoints=np.concatenate(sift_keypoints, axis=0)
#with the descriptors detected, lets clusterize them
print("Training kmeans")
kmeans = MiniBatchKMeans(n_clusters=num_cluster, random_state=0).fit(sift_keypoints)
#return the learned model
return kmeans
#with the k-means model found, this code generates the feature vectors
#by building an histogram of classified keypoints in the kmeans classifier
def calculate_centroids_histogram(file_images, model):
feature_vectors=[]
class_vectors=[]
with open(file_images) as f:
images_names = f.readlines()
images_names = [a.strip() for a in images_names]
for line in images_names:
print(line)
#read image
image = cv2.imread(line,1)
#Convert them to grayscale
image =cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
#SIFT extraction
sift = cv2.xfeatures2d.SIFT_create()
kp, descriptors = sift.detectAndCompute(image,None)
#classification of all descriptors in the model
predict_kmeans=model.predict(descriptors)
#calculates the histogram
hist, bin_edges=np.histogram(predict_kmeans)
#histogram is the feature vector
feature_vectors.append(hist)
#define the class of the image (elephant or electric guitar)
class_sample=define_class(line)
class_vectors.append(class_sample)
feature_vectors=np.asarray(feature_vectors)
class_vectors=np.asarray(class_vectors)
#return vectors and classes we want to classify
return class_vectors, feature_vectors
def define_class(img_patchname):
#print(img_patchname)
print(img_patchname.split('/')[4])
if img_patchname.split('/')[4]=="electric_guitar":
class_image=0
if img_patchname.split('/')[4]=="elephant":
class_image=1
return class_image
def main(train_images_list, test_images_list, num_clusters):
#step 1: read and detect SURF keypoints over the input image (train images) and clusterize them via k-means
print("Step 1: Calculating Kmeans classifier")
model= bovw.read_and_clusterize(train_images_list, num_clusters)
print("Step 2: Extracting histograms of training and testing images")
print("Training")
[train_class,train_featvec]=bovw.calculate_centroids_histogram(train_images_list,model)
print("Testing")
[test_class,test_featvec]=bovw.calculate_centroids_histogram(test_images_list,model)
#vamos usar os vetores de treino para treinar o classificador
print("Step 3: Training the SVM classifier")
clf = svm.SVC()
clf.fit(train_featvec, train_class)
print("Step 4: Testing the SVM classifier")
predict=clf.predict(test_featvec)
score=accuracy_score(np.asarray(test_class), predict)
file_object = open("results.txt", "a")
file_object.write("%f\n" % score)
file_object.close()
print("Accuracy:" +str(score))
if __name__ == "__main__":
main("train.txt", "test.txt", 1000)
main("train.txt", "test.txt", 2000)
main("train.txt", "test.txt", 3000)
main("train.txt", "test.txt", 4000)
main("train.txt", "test.txt", 5000)
As you can see, I tried to vary a lot the number of clusters in the kmeans classifier. However, no matter what I try, the accuracy is always 53.62%, which is terrible, considering that the images classes are quite diferent.
So, is there any problem with my understanding or implementation of BOVW? What I've mistaken here?
The solution is simpler than I thought.
In this line:
hist, bin_edges=np.histogram(predict_kmeans)
The number of bins is the standard number of bins from numpy (I belive it is 10). By doing this:
hist, bin_edges=np.histogram(predict_kmeans, bins=num_clusters)
The accuracy increased from the 53.62% I reported to 78.26% using 1000 clusters and, therefore 1000 dimensional vectors.
It looks like you are creating clusters and histograms for each image. But in order to make it work you have to aggregate the sift features for all images and then clusters theses and use these common clusters to create the histogram. Check out also https://github.com/shackenberg/Minimal-Bag-of-Visual-Words-Image-Classifier
I found examples/image_ocr.py which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?
What I did
Installing the depencencies:
Install cairocffi: sudo apt-get install python-cairocffi
Install editdistance: sudo -H pip install editdistance
Change train to return the model and save the trained model.
Run the script to train the model.
Now I have a model.h5. What's next?
See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.
Related side questions
What is CTC? Connectionist Temporal Classification?
Are there algorithms which reliably detect the rotation of a document?
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
What I tried
#!/usr/bin/env python
from keras import backend as K
import keras
from keras.models import load_model
import os
from image_ocr import ctc_lambda_func, create_model, TextImageGenerator
from keras.layers import Lambda
from keras.utils.data_utils import get_file
import scipy.ndimage
import numpy
img_h = 64
img_w = 512
pool_size = 2
words_per_epoch = 16000
val_split = 0.2
val_words = int(words_per_epoch * (val_split))
if K.image_data_format() == 'channels_first':
input_shape = (1, img_w, img_h)
else:
input_shape = (img_w, img_h, 1)
fdir = os.path.dirname(get_file('wordlists.tgz',
origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))
img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),
bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),
minibatch_size=32,
img_w=img_w,
img_h=img_h,
downsample_factor=(pool_size ** 2),
val_split=words_per_epoch - val_words
)
print("Input shape: {}".format(input_shape))
model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)
model.load_weights("my_model.h5")
x = scipy.ndimage.imread('example.png', mode='L').transpose()
x = x.reshape(x.shape + (1,))
# Does not work
print(model.predict(x))
this gives
2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "eval_example.py", line 45, in <module>
print(model.predict(x))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict
check_batch_axis=False)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data
'Found: array with shape ' + str(data.shape))
ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1)
Well, I will try to answer everything you asked here:
As commented in the OCR code, Keras doesn't support losses with multiple parameters, so it calculated the NN loss in a lambda layer. What does this mean in this case?
The neural network may look confusing because it is using 4 inputs ([input_data, labels, input_length, label_length]) and loss_out as output. Besides input_data, everything else is information used only for calculating the loss, it means it is only used for training. We desire something like in line 468 of the original code:
Model(inputs=input_data, outputs=y_pred).summary()
which means "I have an image as input, please tell me what is written here". So how to achieve it?
1) Keep the original training code as it is, do the training normally;
2) After training, save this model Model(inputs=input_data, outputs=y_pred)in a .h5 file to be loaded wherever you want;
3) Do the prediction: if you take a look at the code, the input image is inverted and translated, so you can use this code to make it easy:
from scipy.misc import imread, imresize
#use width and height from your neural network here.
def load_for_nn(img_file):
image = imread(img_file, flatten=True)
image = imresize(image,(height, width))
image = image.T
images = np.ones((1,width,height)) #change 1 to any number of images you want to predict, here I just want to predict one
images[0] = image
images = images[:,:,:,np.newaxis]
images /= 255
return images
With the image loaded, let's do the prediction:
def predict_image(image_path): #insert the path of your image
image = load_for_nn(image_path) #load from the snippet code
raw_word = model.predict(image) #do the prediction with the neural network
final_word = decode_output(raw_word)[0] #the output of our neural network is only numbers. Use decode_output from image_ocr.py to get the desirable string.
return final_word
This should be enough. From my experience, the images used in the training are not good enough to make good predictions, I will release a code using other datasets that improved my results later if necessary.
Answering related questions:
What is CTC? Connectionist Temporal Classification?
It is a technique used to improve sequence classification. The original paper proves it improves results on discovering what is said in audio. In this case it is a sequence of characters. The explanation is a bit trick but you can find a good one here.
Are there algorithms which reliably detect the rotation of a document?
I am not sure but you could take a look at Attention mechanism in neural networks. I don't have any good link now but I know it could be the case.
Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
OpenCV implements Maximally Stable Extremal Regions (known as MSER). I really like the results of this algorithm, it is fast and was good enough for me when I needed.
As I said before, I will release a code soon. I will edit the question with the repository when I do, but I believe the information here is enough to get the example running.
Now I have a model.h5. What's next?
First I should comment that the model.h5 contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json like this example:
model_json = model_json = model.to_json()
with open("model_arch.json", "w") as json_file:
json_file.write(model_json)
Now, once you have your model and its weights you can load them on demand by doing the following:
json_file = open('model_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
# if you already have a loaded model and dont need to save start from here
loaded_model.load_weights("model.h5")
# compile loaded model with certain specifications
sgd = SGD(lr=0.01)
loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
Then, with that loaded_module you can proceed to predict the classification of certain input like this:
prediction = loaded_model.predict(some_input, batch_size=20, verbose=0)
Which will return the classification of that input.
About the Side Questions:
CTC seems to be a term they are defining in the paper you refered, extracting from it says:
In what follows, we refer to the task of labelling un-
segmented data sequences as
temporal classification
(Kadous, 2002), and to our use of RNNs for this pur-
pose as
connectionist temporal classification
(CTC).
To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that Keras example you are using does, as we can see from that git:
This example uses a convolutional stack followed by a recurrent stack
and a CTC logloss function to perform optical character recognition
of generated text images.
You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.
Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.
Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:
predict(self, x, batch_size=32, verbose=0)
Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.
Here, you created a model that needs 4 inputs:
model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)
Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array
From your code, the necessary inputs are:
input_data = Input(name='the_input', shape=input_shape, dtype='float32')
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
The original code and your training work because they're using the TextImageGenerator. This generator cares to give you the four necessary inputs for the model.
So, what you have to do is to predict using the generator. As you have the fit_generator() method for training with the generator, you also have the predict_generator() method for predicting with the generator.
Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.
You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.
Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.
The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.
Hi You can Look in to my github repo for the same. You need to train the model for type of images you want to do the ocr.
# USE GOOGLE COLAB
import matplotlib.pyplot as plt
import keras_ocr
images = [keras_ocr.tools.read("/content/sample_data/IMG_20200224_113657.jpg")] #Image path
pipeline = keras_ocr.pipeline.Pipeline()
prediction = pipeline.recognize(images)
x_max = 0
temp_str = ""
myfile = open("/content/sample_data/my_file.txt", "a+")#Text File Path to save text
for i in prediction[0]:
x_max_local = i[1][:, 0].max()
if x_max_local > x_max:
x_max = x_max_local
temp_str = temp_str + " " + i[0].ljust(15)
else:
x_max = 0
temp_str = temp_str + "\n"
myfile.write(temp_str)
print(temp_str)
temp_str = ""
myfile.close()
I want to make a program to recognize the digit in an image. I follow the tutorial in scikit learn .
I can train and fit the svm classifier like the following.
First, I import the libraries and dataset
from sklearn import datasets, svm, metrics
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
Second, I create the SVM model and train it with the dataset.
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
And then, I try to read my own image and use the function predict() to recognize the digit.
Here is my image:
I reshape the image into (8, 8) and then convert it to a 1D array.
img = misc.imread("w1.jpg")
img = misc.imresize(img, (8, 8))
img = img[:, :, 0]
Finally, when I print out the prediction, it returns [1]
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
print predicted
Whatever I user others images, it still returns [1]
When I print out the "default" dataset of number "9", it looks like:
My image number "9" :
You can see the non-zero number is quite large for my image.
I dont know why. I am looking for help to solve my problem. Thanks
My best bet would be that there is a problem with your data types and array shapes.
It looks like you are training on numpy arrays that are of the type np.float64 (or possibly np.float32 on 32 bit systems, I don't remember) and where each image has the shape (64,).
Meanwhile your input image for prediction, after the resizing operation in your code, is of type uint8 and shape (1, 64).
I would first try changing the shape of your input image since dtype conversions often just work as you would expect. So change this line:
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
to this:
predicted = classifier.predict(img.reshape(img.shape[0]*img.shape[1]))
If that doesn't fix it, you can always try recasting the data type as well with
img = img.astype(digits.images.dtype).
I hope that helps. Debugging by proxy is a lot harder than actually sitting in front of your computer :)
Edit: According to the SciPy documentation, the training data contains integer values from 0 to 16. The values in your input image should be scaled to fit the same interval. (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)
1) You need to create your own training set - based on data similar to what you will be making predictions. The call to datasets.load_digits() in scikit-learn is loading a preprocessed version of the MNIST Digits dataset, which, for all we know, could have very different images to the ones that you are trying to recognise.
2) You need to set the parameters of your classifier properly. The call to svm.SVC(gamma = 0.001) is just choosing an arbitrary value of the gamma parameter in SVC, which may not be the best option. In addition, you are not configuring the C parameter - which is pretty important for SVMs. I'd bet that this is one of the reasons why your output is 'always 1'.
3) Whatever final settings you choose for your model, you'll need to use a cross-validation scheme to ensure that the algorithm is effectively learning
There's a lot of Machine Learning theory behind this, but, as a good start, I would really recommend to have a look at SVM - scikit-learn for a more in-depth description of how the SVC implementation in sickit-learn works, and GridSearchCV for a simple technique for parameter setting.
It's just a guess but... The Training set from Sk-Learn are black numbers on a white background. And you are trying to predict numbers which are white on a black background...
I think you should either train on your training set, or train on the negative version of your pictures.
I hope this help !
If you look at:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
you can see that each point in the matrix as a value between 0-16.
You can try to transform the values of the image to between 0-16. I did it and now the prediction works well for the digit 9 but not for 8 and 6. It doesn't give 1 any more.
from sklearn import datasets, svm, metrics
import cv2
import numpy as np
# Load digit database
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Train SVM classifier
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
# Read image "9"
img = cv2.imread("w1.jpg")
img = img[:,:,0];
img = cv2.resize(img, (8, 8))
# Normalize the values in the image to 0-16
minValueInImage = np.min(img)
maxValueInImage = np.max(img)
normaliizeImg = np.floor(np.divide((img - minValueInImage).astype(np.float),(maxValueInImage-minValueInImage).astype(np.float))*16)
# Predict
predicted = classifier.predict(normaliizeImg.reshape((1,normaliizeImg.shape[0]*normaliizeImg.shape[1] )))
print predicted
I have solved this problem using below methods:
check the number of attributes, too large or too small.
check the scale of your gray value, I change to [0,16].
check data type, I change it to uint8.
check the number of training data, too small or not.
I hope it helps. ^.^
Hi in addition to #carrdelling respond, i will add that you may use the same training set, if you normalize your images to have the same range of value.
For example you could binaries your data ( 1 if > 0, 0 else ) or you could divide by the maximum intensity in your image to have an arbitrary interval [0;1].
You probably want to extract features relevant to to your data set from the images and train your model on them.
One example I copied from here.
surf = cv2.SURF(400)
kp, des = surf.detectAndCompute(img,None)
But the SURF features may not be the most useful or relevant to your dataset and training task. You should try others too like HOG or others.
Remember this more high level the features you extract the more general/error-tolerant your model will be to unseen images. However, you may be sacrificing accuracy in your known samples and test cases.