Fasttext model representations for numbers - python

I would like to create a fasttext model for numbers. Is this a good approach?
Use Case:
I have a given number set of about 100.000 integer invoice-numbers.
Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers.
As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.
But it is not working and I don´t know if my idea is completly wrong? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.
Here some of my code:
import pandas as pd
from random import seed
from random import randint
import fasttext
# seed random number generator
seed(9999)
number_of_vnr = 100000
min_vnr = 1111
max_vnr = 999999999
# generate vnr integers
versicherungsscheinnummern = [randint(min_vnr, max_vnr) for i in range(number_of_vnr)]
# save numbers as csv
df_vnr = pd.DataFrame(versicherungsscheinnummern, columns=['VNR'])
df_vnr['VNR'].dropna().astype(str).to_csv('vnr_str.csv', index=False)
# train model
model = fasttext.train_unsupervised('vnr_str.csv',"cbow", minn=2, maxn=5)
Even data in the space is not found
model.get_nearest_neighbors("833803015")
[(0.10374893993139267, '</s>')]
model has no words
model.words
["'</s>'"]

I doubt FastText is the right approach for this.
Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.
Every '###' or '####' is going to have a similar frequency. (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.)
Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.)
That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText. You might get useful ideas from Peter Norvig's classic writeup on "How to Write a Spelling Corrector".

Related

improve gensim most_similar() return values by using wordnet hypernyms

import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')
I first ran this code to download the pre-trained model.
glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])
would then result in:
[('nahyan', 0.5181387066841125),
('caviar', 0.4778318405151367),
('paella', 0.4497394263744354),
('nahayan', 0.44313961267471313),
('zayed', 0.4321245849132538),
('omani', 0.4285220503807068),
('seafood', 0.4279175102710724),
('saif', 0.426000714302063),
('dirham', 0.4214130640029907),
('sashimi', 0.4165934920310974)]
and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.
The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.
With much research, the closest solution I found is to somehow incorporate
lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().
I am not sure if my idea make sense and would like the community feedback on this.
My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'
Not sure if it makes sense...
Does your proposed approach give adequate results when you try it?
That's the only test of whether the idea makes sense.
Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)
Word2vec is also fairly oblivious to polysemy in its standard formulation.
Some other things worth trying might be:
if you need only answers of a certain type, mix-in some measurement ranking candidate answers by their closeness to a word either describing that type ('food') or representing multiple examples (say an average vector for many food-names you'd know to be good answers)
choose another vector-set, or train your own. There's no universal "goodness" for word-vectors: their quality for certain tasks will vary based on their training data & parameters. Vectors trained on something broader than Wikipedia (your named vector file), or some text corpus more focused on your domain-of-interest – say, food criticism – might do better on some tasks. Changing training parameters can also change which kinds of similarity are most emphasized in the resulting vectors. For example, some observers have noticed small context-windows tend to put words that are direct drop-in replacements for each other closer-together, while larger context-windows bring words from the same domains-of-use, even if not drop-in replacements of the same 'type', closer. (It sounds like your current need might be best served with a model trained with smaller windows.)
Nahyan is from the UAE - it seems to be part of the name of all three presidents. So you seem to be getting what you ask for. If you want more foods, add "food" to your positive query, and maybe "people" to your negative query?
Another approach is to post-filter your results to remove anything that isn't a food. Or is a person. (WordNet won't be much help, as it is nowhere near comprehensive on foods, and even less so on people; Wikidata is likely to be more useful.)
By the way, if you find the common hypernym of sushi and UAE it will probably be the top-level entity in wordnet. So that will give you no filtering.

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Python3 - Doc2Vec: Get document by vector/ID

I've already built my Doc2Vec model, using around 20.000 files. I'm looking for a way to find the string representation of a given vector/ID, which might be similar to Word2Vec's index2entity. I'm able to get the vector itself, using model['n'], but now I'm wondering whether there's a way to get some sort of string representation of it as well.
If you want to look up your actual training text, for a given text+tag that was part of training, you should retain that mapping outside the Doc2Vec model. (The model doesn't store training texts – only looking at them, repeatedly, during training.)
If you want to generate a text from a Doc2Vec doc-vector, that's not an existing feature, nor do I know any published work describing a reliable technique for doing so.
There's a speculative/experimental bit of work-in-progress for gensim Doc2Vec that will forward-propagate a doc-vector through the model's neural-network, and report back the most-highly-predicted target words. (This is somewhat the opposite of the way infer_vector() works.)
That might, plausibly, give a sort-of summary text. For more details see this open issue & the attached PR-in-progress:
https://github.com/RaRe-Technologies/gensim/issues/2459
Whether this is truly useful or likely to become part of gensim is still unclear.
However, note that such a set-of-words wouldn't be grammatical. (It'll just be the ranked-list of most-predicted words. Perhaps some other subsystem could try to string those words together in a natural, grammatical way.)
Also, the subtleties of whether a concept has many potential associates words, or just one, could greatly affect the "top N" results of such a process. Contriving a possible example: there are many words for describing a 'cold' environment. As a result, a doc-vector for a text about something cold might have lots of near-synonyms for 'cold' in the 11th-20th ranked positions – such that the "total likelihood" of at least one cold-ish word is very high, maybe higher than any one other word. But just looking at the top-10 most-predicted words might instead list other "purer" words whose likelihood isn't so divided, and miss the (more-important-overall) sense of "coldness". So, this experimental pseudo-summarization method might benefit from a second-pass that somehow "coalesces" groups-of-related-words into their most-representative words, until some overall proportion (rather than fixed top-N) of the doc-vector's predicted-words are communicated. (This process might be vaguely like finding a set of M words whose "Word Mover's Distance" to the full set of predicted-words is minimized – though that could be a very expensive search.)

How to dynamically assign the right "size" for Word2Vec?

The question is two-fold:
1. How to select the ideal value for size?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
{kfhahf}
Lfhslnf;
.
.
.
Row 2
(stdgff ksshu, hsihf)
asgasf;
.
.
.
Etc.
Based on this post: Python: What is the "size" parameter in Gensim Word2vec model class The size parameter should be less than (or equal to?) the vocabulary size. So, I am trying to dynamically assign the size as following:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# I do Word2Vec for each row
For item in dataset:
Tokenized = word_tokenize(item)
model = Word2Vec([Tokenized], min_count=1)
I get the vocabulary size here. So I create a second model:
model1 = Word2Vec([Tokenized], min_count=1, size=len(model.wv.vocab))
This sets the size value to the current vocab value of the current row, as I intended. But is it the right way to do? What is the right size for a small vocabulary text?
There's no simple formula for the best size - it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size (and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new model for every item in your dataset (discarding the previous model on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item))). Any Word2Vec model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1 makes a model worse than larger values (like the default of min_count=5). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)

Sample size for Named Entity Recognition gold standard corpus

I have a corpus of 170 Dutch literary novels on which I will apply Named Entity Recognition. For an evaluation of existing NER taggers for Dutch I want to manually annotate Named Entities in a random sample of this corpus – I use brat for this purpose. The manually annotated random sample will function as the 'gold standard' in my evaluation of the NER taggers. I wrote a Python script that outputs a random sample of my corpus on the sentence level.
My question is: what is the ideal size of the random sample in terms of the amount of sentences per novel? For now, I used a random 100 sentences per novel, but this leads to a pretty big random sample containing almost 21626 lines (which is a lot to manually annotate, and which leads to a slow working environment in brat).
NB, before the actual answer: The biggest issue I see is that you only can evaluate the tools wrt. those 170 books. So at best, it will tell you how good the NER tools you evaluate will work on those books or similar texts. But I guess that is obvious...
As to sample sizes, I would guesstimate that you need no more than a dozen random sentences per book. Here's a simple way to check if your sample size is already big enough: Randomly choose only half of the sentences (stratified per book!) you annotated and evaluate all the tools on that subset. Do that a few times and see if results for the same tool varies widely between runs (say, more than +/- 0.1 if you use F-score, for example - mostly depending on how "precise" you have to be to detect significant differences between the tools). If the variances are very large, continue to annotate more random sentences. If the numbers start to stabilize, you're good and can stop annotating.
Indeed, the "ideal" size would be... the whole corpus :)
Results will be correlated to the degree of detail of the typology: just PERS, LOC, ORG would require require a minimal size, but what about a fine-grained typology or even full disambiguation (linking)? I suspect good performance wouldn't need much data (just enough to validate), whilst low performance should require more data to have a more detailed view of errors.
As an indicator, cross-validation is considered as a standard methodology, it often uses 10% of the corpus to evaluate (but the evaluation is done 10 times).
Besides, if working with ancient novels, you will probably face lexical coverage problem: many old proper names would not be included in available softwares lexical resources and this is a severe drawback for NER accuracy. Thus it could be a nice idea to split corpus according to decades / centuries and conduct multiple evaluation so as to measure the impact of this trouble on performances.

Categories

Resources