Document classification using machine learning

Document classification using machine learning - python

I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.
I have the following labels:
Invoice
Packing list
Certificate
I am trying to figure out how I should approach this problem.
My initial thoughts
I was thinking the best way to solve this issue would be to perform text classification, based on the document text.
Step 1 - Train a model
First convert the PDF files to text.
Then label the text content by one of the three labels. (Do this for a large dataset)
Step 2 - Use the model
Once the model is trained, for new incoming documents, convert it to text.
Run the text content through the model to get the text classification.
Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.
You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.

I understand your problem.
Some key point about it
a) First do pre-processing of input data. i.e ( for e.g. how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.
b) Trained Model using Image, Visual\layout and text. You will get good accuracy.
c) You can used Computer vison and deep learning (Keras and tensorflow)

Related

Unable to improve the mask RCNN model for document images?

I am training a model to extract all the necessary fields from a resume for which I am using mask rcnn to detect the fields in image. I have trained my mask RCNN model for 1000 training samples with 49 fields to extract. I am unable to improve the accuracy. How to improve the model? Is there any pretrained weights that may help?
Difficulty in reading following text -

Looks like you want to do text classification/processing, you need to extract details from the text but you are applying object detection algorithms. I believe you need to use OCR to extract text (if you have cv as an image) and use the text classification model. Check out the below links more information about text classification -
https://medium.com/#armandj.olivares/a-basic-nlp-tutorial-for-news-multiclass-categorization-82afa6d46aa5
https://www.tensorflow.org/tutorials/tensorflow_text/intro

You can break up the problem two different ways:
Step 1- OCR seems to be the most direct way to get to your data. But increase the image size, thus resolution, otherwise, you may lose data.
Step 2- Store the coordinates of each OCRed word. This is valuable information in this context. How words line up have significance.
Step 3- At this point you can try to use basic positional clustering to group words. However, this can easily fail on a columnar vs row-based distribution of related text.
Step 4- See if you can identify which of 49 tags these clusters belong to.
Look at text classification for Hidden Markov models, Baum-Welch Algorithms. i.e. Go for basic models first.
OR
The above ignores the inherent classification opportunity that is the image of a, well, a properly formatted cv.
Step 1- Train your model to partition the image into sections without OCR. A good model should not break up the sentences, tables etc. This approach may leverage separators lines etc. There is also opportunity to decrease the size of your image since you are not OCRing yet.
Step 2 -OCR image sections and try to classify similar to above.

Another option is to use the neural networks like - PixelLink: Detecting Scene Text via Instance Segmentation
https://arxiv.org/pdf/1801.01315.pdf

Tensorflow model for text classification

I'm building an Android application with OCR and Tensorflow. It scans price tags in supermarkets and has to put the scanned data into different fields. I've done the OCR part, so the image -> text recognition works fine and Tensorflow is only required to work with text input.
I'm new to Tensorflow and machine learning in overall. Is it possible to do the following work using Tensorflow and if yes, could you share some ideas on how to do so?
The average input looks like this:
CARLSBERG
EESTI
HELE OLU 5%
1.59 +0.10
500 ml pudel
3.18 /I
4740019113419
The goal is to sort this data as follows:
Brand: CARLSBERG
Product name: HELE OLU 5%
Size: 500
Units: ml
The parameters that determine how a particular string will be classified are:
Case
Line number
Supermarket (it's known by default)
Total number of lines
Letters/numbers ratio

I think the first step would be to get your hands on or to generate some labelled training data. You should look into at feature extraction; for example, if you notice that for a certain item, the second line is usually the price, you could represent that as a parameter. Or say if a number is followed by a unit like ml/l/oz, it's likely to be the volume. What you want to know is how confident you are that a specific line/string is say the price.
However, I think TensorFlow would be more suited for the OCR portion of the problem, which you have already solved. What you are asking is more towards text parsing, which could be better solved with an NLP approach.

As mentioned in 4d11's answer, one of the biggest challenges in machine learning is often getting a high quality, significantly sized set of training data.
In terms of feeding data into a Tensorflow network/model, I'd recommend you check out their 'get started' tutorial on feature columns:
https://www.tensorflow.org/get_started/feature_columns
Feature columns are used to represent data of different types numerically for a representation that can be fed into the model. The tutorial goes into some detail on the ways in which this works and why you may choose to represent different data in different ways. I found it pretty helpful as an intro.

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.
https://github.com/emedvedev/attention-ocr

Should I have two set of word vectors(word2vec), one for questionSet and one for answerSet before neural network training?

I am confuse with when building a machine learning chatbot for a close domain topic about car.
I have lot of text format information about different car models and do a Word2Vec process with these data and saved a Word2Vec.model.
Then question sets and answer set will be converted to vectors by looking up in the word2vec.model. And finally, put them to a seq2seq model for network training.
——-
My questions:
Should I build two word2vec.model instead of one?, e.g. word2vec_question.model and word2vec_answer.model? and feed to question-set and converts to vector based on word2vec_question.model while answer set according to word2vec_answer.model ?
Why there are chatbot examples do not use Word embeddings but just tokenised those question-set and answer-sets and go straight to seq2seq training? Is this because the conversation sets are huge enough to train the S2S network without vectorise?
Should we say that if data is hugh enough, just tokenise is enough and no need to do word2vec modelling?
Back to my car expert system, please give me some advise what the right way to prepare the data and feed to the Q&A examples. My ultimate wish is that, every week, I feed the word2vec model(s?) with information from car magazines (not in conversation format but just passages about new cars) then, the chatbot can answer questions also about that new model.
Thanks in advance.

How to create my own dataset to train/test a convolutional neural network

So here is my question:
I want to make my very own dataset using a motion capture camera system to get the ground truth poses and one RGB camera to get images, and then using this as input to my network, train/test a convNet.
I have looked around at other datasets for tensorflow, caffe and Matlab. I have viewed the MNIST, Cats/Dogs, Iris, LSP, HumanEva, HumanEva3.6, FLIC, etc. datasets and have viewed and tried to understand their data as best as I can. I have viewed online people trying to make their own datasets. The one thing is usually when you use their datasets as an example, you download a .txt file that already contains the labels.
If anyone could please explain to me how to use the image data with the labels to feed it into my network, it would be a tremendous help. I have made code before using tensorflow to input a .txt file into the network and get the correct predicted output. But, my brain is missing something to understand how to input an image with a label. How to I create that dataset?

Your input images and your labels are two separate variables. You will be writing separate bits of code to import them. The videos typically need to be converted to JPG files (it's a royal pain to read video files directly, mostly because you can't randomly skip around the video easily).
Probably the easiest way to structure you data is via a CSV that contains filename, poseinfoA, poseinfoB, etc. And the filename refers to the JPG image on disk.
To get started on the basics, I suggest looking at the Aymericdamen tutorial examples, I haven't found tutorials anywhere that were as clear and concise.
https://github.com/aymericdamien/TensorFlow-Examples
Those examples don't go into detail on the data input pipeline though. To set up a good data input pipeline in tensorflow I suggest you use the new (as of TF 1.4) Dataset object. It will force you into a good data input pipline workflow, and it's the way all data input is going in tensorflow, so it's worth learning. It's also easy to test and debug when you write it this way. Here's the guide you want to follow.
https://www.tensorflow.org/programmers_guide/datasets
You can start your Dataset object from the CSV, and use a dataset.map_fn() to load the images using tf.image.decode_jpeg
Since you're doing pose estimation I'll also suggest a nice blog I came across recently that will probably interest you. The topic is segmentation, but pose estimation is quite related.
http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

Steps involved in classifying images?

I am very new to machine learning and have been implementing ML algorithms on the datasets.
But how do I go about classifying images using the Ml algorithms?
How do I feed the images to the learning models in the form of numpy arrays?
Can anyone brief me about the steps involved? I have been reading about feature extraction but I am not able to figure out how to do that.

Image classification is not much different, at its core, from any other sort of classification.
Your data are images, right? Well, we need to create some variables ("features") from those images in order to get a sense of what's in the images. Computers can understand matrices, not just straight-up images like humans do (although there are arguments that what humans are doing when they see images is deconstructing images into patterns of pixels, but let's keep it simple). Using OpenCV is a great way to turn image pixels into matrices.
Each matrix (i.e. each image) will have a corresponding tag or classification (e.g. "dog" or "cat"). You feed those matrices through your algorithm in order to classify each image.
That will get you started. There's so much that goes into machine learning related to images, but at its core, the problem is the same as elsewhere: take a matrix/set of data and use an algorithm to find patterns in the data and a function that maps the input to the output label. You might be served well by reading an intro to machine learning book or taking a course.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.