how to classified and digitalized huge amount of paper using python - python

I have an archive papers in a company representing different business operation form different sections.
I want to scan all these documents and after that I want a way to classify all these scanned document into different category and sub-category based on custom preference such as (name, age, section, ..etc).
I want the end result to be digital files categorized according to the preferences that I set.
How can I do this using Python NLP or any other machine learning approach

I think that this can be a basic pipeline:
Scanning part: papers images preprocessing with opencv + text extraction using some OCR libraries (pytesseract, easyOCR);
Topic extraction: get the desired information to classify the documents using e.g. Spacy
Cathegorize using simply python, maybe pandas.

Related

How can I tag or give a document of text a topic?

I have set of documents and corresponding set of tags for those documents
ex.
Document-"Learned Counsel appearing for the Appellants however points out that in the..etc etc"
Tags - "Compensation, Fundamental Right"
Now I have multiple documents with their corresponding tags and I another test set of data without any tags what NLP techniques do I use to give these documents tag? Do I use text classification or topic modeling can someone please guide or suggest some ideas.
you can use two approaches:
1- rule based (extract common words in each tag and classify documents with them)
2- machine learning
if you have a large scale training data you can use machine learning to classify documents:
you can use this approaches:
https://arxiv.org/abs/1904.08398
https://medium.com/#armandj.olivares/using-bert-for-classifying-documents-with-long-texts-5c3e7b04573d

Extracting and ranking keywords from short text

I am working on a project to extract a keyword from short texts (3-4 sentences). Using the spaCy library I extract noun phrases and NER and use them as keywords. However, I would like to sort them based on their importance wrt the original text.
I tried standard informational retrieval approaches, like tfidf, and even a couple of graph-based algorithms but having such short text the results weren't so great.
I was thinking that maybe using a NN with an attention mechanism could help me rank those keywords. Is there any way to use the pre-trained models that come with spaCy to do some kind of ranking?
How about something like maximal marginal relevance? http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py
The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.
To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.

NLTK extract categories present in text and map to a taxonomy

How can I map the categories extracted by my program from a text analysis (using NLP/NLTK or Textblob) to a standard (or almost standard) taxonomy
Preferably open source products
I would also prefer to download the selected taxonomies (by theme) and work offline over them in Python (than use an online service/api)
I've just found this on the subject...
http://www.iab.com/guidelines/iab-quality-assurance-guidelines-qag-taxonomy/
There are several companies which provide REST API for classifications. I tried those three below which where applicable for me.
1) AYLIEN - NLP service
2) https://www.uclassify.com/ - providing multiple NLP classifiers
3) https://iab.taxonome.org - I found this one very simple and easy to use, they also have a kind free trial and some video classifications demo

Categories

Resources