Unable to improve the mask RCNN model for document images?

Unable to improve the mask RCNN model for document images? - python

I am training a model to extract all the necessary fields from a resume for which I am using mask rcnn to detect the fields in image. I have trained my mask RCNN model for 1000 training samples with 49 fields to extract. I am unable to improve the accuracy. How to improve the model? Is there any pretrained weights that may help?
Difficulty in reading following text -

Looks like you want to do text classification/processing, you need to extract details from the text but you are applying object detection algorithms. I believe you need to use OCR to extract text (if you have cv as an image) and use the text classification model. Check out the below links more information about text classification -
https://medium.com/#armandj.olivares/a-basic-nlp-tutorial-for-news-multiclass-categorization-82afa6d46aa5
https://www.tensorflow.org/tutorials/tensorflow_text/intro

You can break up the problem two different ways:
Step 1- OCR seems to be the most direct way to get to your data. But increase the image size, thus resolution, otherwise, you may lose data.
Step 2- Store the coordinates of each OCRed word. This is valuable information in this context. How words line up have significance.
Step 3- At this point you can try to use basic positional clustering to group words. However, this can easily fail on a columnar vs row-based distribution of related text.
Step 4- See if you can identify which of 49 tags these clusters belong to.
Look at text classification for Hidden Markov models, Baum-Welch Algorithms. i.e. Go for basic models first.
OR
The above ignores the inherent classification opportunity that is the image of a, well, a properly formatted cv.
Step 1- Train your model to partition the image into sections without OCR. A good model should not break up the sentences, tables etc. This approach may leverage separators lines etc. There is also opportunity to decrease the size of your image since you are not OCRing yet.
Step 2 -OCR image sections and try to classify similar to above.

Another option is to use the neural networks like - PixelLink: Detecting Scene Text via Instance Segmentation
https://arxiv.org/pdf/1801.01315.pdf

Related

How to create Yolo model from train and test images?

I have a dataset of images that have two folders: test and training. I need to do object detection using OpenCV and Yolo.
Thus, I need to create my own Yolo model for the street objects.
For the training folder:
training
Example training image:
training image
For the test folder:
test
I have the classes txt file which includes id, name and classification (warning, indication and mandatory).
Example:
0 = animal crossing (warning)
1 = soft verges (warning)
2 = road narrows (warning)
Here, the numbers are the numbers (or ids) in the training folder, names, and classification.
My purpose is to create a Yolo model from these training images. I have checked some papers and articles, but in their case, they label the full image using labelimg, but in my case training images are so small and they don't need any labeling.
Thus, I'm confused about how to do this. Could you please give me some ideas?

Labeling images is a must in YOLO's that's how they deal with their loss functions. To detect objects something called (intersection over union )
More easy way to label images is by using (roboflow site ).

I would refer to this image that describes the different types of computer vision tasks.
I think what you want to do is a Classification tasks. Yolo is for Object Detection tasks, where you usually want to detect more than one object per image.
For classification tasks, it can be easier because you don't need to make separate label files. The names of the folders are the labels. Here is an example of a classification model that you can use https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
If you really want to use Yolo you will need to make label files. If you are going to do Classification of the whole image then the format of the annotation will be easy. It would be something like this.
`0 0.5 0.5 1 1' The first column is the class number: 0,1,2,3 etc. You will need to make one file for each image with the name .txt.
Does this help you?

Document classification using machine learning

I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.
I have the following labels:
Invoice
Packing list
Certificate
I am trying to figure out how I should approach this problem.
My initial thoughts
I was thinking the best way to solve this issue would be to perform text classification, based on the document text.
Step 1 - Train a model
First convert the PDF files to text.
Then label the text content by one of the three labels. (Do this for a large dataset)
Step 2 - Use the model
Once the model is trained, for new incoming documents, convert it to text.
Run the text content through the model to get the text classification.
Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.
You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.

I understand your problem.
Some key point about it
a) First do pre-processing of input data. i.e ( for e.g. how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.
b) Trained Model using Image, Visual\layout and text. You will get good accuracy.
c) You can used Computer vison and deep learning (Keras and tensorflow)

Detecting small custom object using keras

I want to detect small objects (9x9 px) in my images (around 1200x900) using neural networks. Searching in the net, I've found several webpages with codes for keras using customized layers for custom objects classification. In this case, I've understood that you need to provide images where your object is alone. Although the training is goodand it classifies them properly, unfortunately I haven't found how to later load this trained network to find objects in my big images.
On the other side, I have found that I can do this using the cnn class in cv if I load the weigths from the Yolov3 netwrok. In this case I provide the big images with the proper annotations but the network is not well trained...
Given this context, could someone show me how to load weigths in cnn that are trained with a customized network and how to train that nrtwork?

After a lot of search, I've found a better approach:
Cut your images in subimages (I cut it in 2 rows and 4 columns).
Feed yolo with these subimages and their proper annotations. I used yolov3 tiny, with a size of 960x960 for 10k steps. In my case, intensity and color was important so random parameters such as hue, saturation and exposition were kept at 0. Use random angles. If your objects do not change in size, disable random at yolo layers (random=0 in cfg files. It only randomizes the fact that it changes the size for training in every step). For this, I'm using Alexey darknet fork. If you have some blur object, add blur=1 in the [net] properties in cfg file (after hue). For blur you need Alexey fork and to be compiled with opencv (appart from cuda if you can).
Calculate anchors with Alexey fork. Cluster_num is the number of pairs of anchors you use. You can know it by opening your cfg and look at any anchors= line. Anchors are the size of the boxes that darknet will use to predict the positions. Cluster_num = number of anchors pairs.
Change cfg with your new anchors. If you have fixed size objects, anchors will be very close in size. I left the ones for bigger (first yolo layer) but for the second, the tinies, I modified and I even removed 1 pair. If you remove some, then change the order in mask [yolo] (in all [yolo]). Mask refer to the index of the anchors, starting at 0 index. If you remove some, change also the num= inside the [yolo].
After, detection is quite good.It could happen that if you detect on a video, there are objects that are lost in some frames. You can try to avoid this by using the lstm cfg. https://github.com/AlexeyAB/darknet/issues/3114
Now, if you also want to track them, you can apply a deep sort algorithm with your yolo pretrained network. For example, you can convert your pretrained network to keras using https://github.com/allanzelener/YAD2K (add this commit for tiny yolov3 https://github.com/allanzelener/YAD2K/pull/154/commits/e76d1e4cd9da6e177d7a9213131bb688c254eb20) and then use https://github.com/Qidian213/deep_sort_yolov3
As an alternative, you can train it with mask-rcnn or any other faster-rcnn algorithm and then look for deep-sort.

Unify text and image classification (Python)

I am working on a code to classify texts of scientific articles (using the title and the abstract). And for this I'm using an SVM, which delivers a good accuracy (83%). At the same time I used a CNN to classify the images of these articles. My idea is to merge the text classifier with the image classifier, to improve the accuracy.
It is possible? If so, you would have some idea how I could implement it or some kind of guideline?
Thank you!

You could use the CNN to do both. For this you'd need two (or even three) inputs. One for the text (or two where one is for the abstract and the other for the title) and the second input for the image. Then you'd have some conv-max pooling layers before you merge them at one point. You then plug in some additional CNN or dense layers.
You could also have multiple outputs in this model. E.g a combined one, one for the text and one for the images. If you're using keras you would need the functional API. A picture of an example model can be found here (They're using LSTM in the example, but I guess you should stick to CNN.)

If you get probability from both classifiers you can average them and take the combined result. However taking a weighted average might be a better approach in which case you can use a validation set to find the suitable value for the weight.
prob_svm = probability from SVM text classifier
prob_cnn = probability from CNN image classifier
prob_total = alpha * prob_svm + (1-alpha) * prob_cnn # fine-tune alpha with validation set
If you can get another classifier (maybe a different version of any of these two classifiers), you can also do a majority voting i.e., take the class on which two or all three classifiers agree on.

Clustering images using unsupervised Machine Learning

I have a database of images that contains identity cards, bills and passports.
I want to classify these images into different groups (i.e identity cards, bills and passports).
As I read about that, one of the ways to do this task is clustering (since it is going to be unsupervised).
The idea for me is like this: the clustering will be based on the similarity between images (i.e images that have similar features will be grouped together).
I know also that this process can be done by using k-means.
So the problem for me is about features and using images with K-means.
If anyone has done this before, or has a clue about it, please would you recommend some links to start with or suggest any features that can be helpful.

Most simple way to get good results will be to break down the problem into two parts :
Getting the features from the images: Using the raw pixels as features will give you poor results. Pass the images through a pre trained CNN(you can get several of those online). Then use the last CNN layer(just before the fully connected) as the image features.
Clustering of features : Having got the rich features for each image, you can do clustering on these(like K-means).
I would recommend implementing(using already implemented) 1, 2 in Keras and Sklearn respectively.

Label a few examples, and use classification.
Clustering is as likely to give you the clusters "images with a blueish tint", "grayscale scans" and "warm color temperature". That is a quote reasonable way to cluster such images.
Furthermore, k-means is very sensitive to outliers. And you probably have some in there.
Since you want your clusters correspond to certain human concepts, classification is what you need to use.

I have implemented Unsupervised Clustering based on Image Similarity using Agglomerative Hierarchical Clustering.
My use case had images of People, so I had extracted the Face Embedding (aka Feature) Vector from each image. I have used dlib for face embedding and so each feature vector was 128d.
In general, the feature vector of each image can be extracted. A pre-trained VGG or CNN network, with its final classification layer removed; can be used for feature extraction.
A dictionary with KEY as the IMAGE_FILENAME and VALUE as the FEATURE_VECTOR can be created for all the images in the folder. This will make the co-relation between the filename and it’s feature vector easier.
Then create a single feature vector say X, which comprises of individual feature vectors of each image in the folder/group which needs to be clustered.
In my use case, X had the dimension as : NUMBER OF IMAGE IN THE FOLDER, 128 (i.e SIZE OF EACH FEATURE VECTOR). For instance, Shape of X : 50,128
This feature vector can then be used to fit an Agglomerative Hierarchical Cluster. One needs to fine tune the distance threshold parameter empirically.
Finally, we can write a code to identify which IMAGE_FILENAME belongs to which cluster.
In my case, there were about 50 images per folder so this was a manageable solution. This approach was able to group image of a single person into a single clusters. For example, 15 images of PERSON1 belongs to CLUSTER 0, 10 images of PERSON2 belongs to CLUSTER 2 and so on…

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.