I need some help, i am working on a problem where i have the OCR of an image of an invoice and i want to extract certain data from it like invoice number, amount, date etc which is all present within the OCR. I tried with the classification model where i was individually passing each sentence from the OCR to the model and to predict it the invoice number or date or anything else, but this approach takes a lot of time and i don't think this is the right approach.
So, i was thinking whether there is an algorithm where i can have an input string and have outputs mapped from that string like, invoice number, date and amount are present within the string.
E.g:
Inp string: The invoice 1234 is due on 12 oct 2018 with amount of 287
Output: Invoice Number: 1234, Date: 12 oct 2018, Amount 287
So, my question is, is there an algorithm which i can train on several invoices and then make predictions?
Essentially you are looking for NER (Named entity recognition). There are multiple free and paid tools available for intent and entity mapping. You can use Google DialogFlow, MS LUIS, or open source RASA for entity identification in given text.
if you want to develop your own solution then you can look at OpenNLP too.
Please revert on your observation on these wrt to your problem
What you are searching for is invoice data extraction ML. There are plenty of ML algorithms available, but none of them is done for your use case. Why? Because it is a very special use case. You can't just use Tensorflow and use sentences as input, although it can return multiple outputs.
You could use NLP (natural language processing) approaches to extract data. It is used by Taggun to extract data from receipts. In that case, you can use only sentences. But you will still need to convert your sentences into NLP form (tokenization).
You could use deep learning (e.g. Tensorflow). In that case, you need to vectorize your sentences into vectors that can be input into a neural network. This approach needs much more creativity while there is no standard approach to do that. The goal is to describe every sentence as good as possible. But there is still one problem - how to parse dates, amounts, etc. Would it help NN if you would mark sentences with contains_date True/False? Probably yes.
A similar approach is used in invoice data extraction services like:
rossum.ai
typless.com
So if you are doing it for fun/research I suggest starting with a really simple invoice. Try to write a program that will extract invoice number, issue date, supplier and total amount with parsing and if statements. It will help you to define properties for feature vector input of NN. For example, contains_date, contains_total_amount_word, etc. See this tutorial to start with NN.
If you are using it for work I suggest taking a look at one of the existing services for invoice data extraction.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.
Related
So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!
You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.
Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.
Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.
Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
I am training an text classifier for addresses such that if given sentence is an address or not.
Sentence examples :-
(1) Mirdiff City Centre, DUBAI United Arab Emirates
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India
As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.
As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).
However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.
I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.
Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.
When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.
I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.
I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.
As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links:
https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/
https://www.red-gate.com/products/sql-development/sql-data-generator/
https://openaddresses.io/
You need to build a labeled data as #Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.
Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.
As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.
https://nlp.stanford.edu/software/CRF-NER.shtml
It is a conditional random field NER. It is available in java.
If you are looking for a python implementation of the same. Have a look at:
How to install and invoke Stanford NERTagger?
Here you can gather info from a multiple sequence of tags like
, , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.
Thanks.
I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.
I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.
Pre Sales
Post Sales
Purchased
Service query
Now the problems that I'm facing are these :
There is a huge number of overlapping words between each of the
categories, hence using NaiveBayes is failing.
The size of tweets being only 160 chars, what is the best way to
prevent words from one category falling into the another.
What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field
Thanks
I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.
A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.
An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.
This is going to be a complex problem.
How do you define the categories? Get as many tweets and FB posts as you can and tag them all with the correct categories to get some ground truth data
Then you can identify which words/phrases are the best for identifying particular category using e.g. PCA
Look into scikit-learn they have tutorials for text processing and classification.