I have downloaded a .bin FastText model,and load it as follows:
ft = fasttext.load_model('/content/drive/MyDrive/dataset/cc.en.300.bin')
how can i make preprocessing and normalization on cc.en.300.bin model.
i want to make lemmatization, removing stopwords and other operation
Your question doesn't really make sense, given how FastText models are usually used, on a couple levels:
A pre-trained FastText model, like cc.en.300.bin, no longer has any original text inside it, as would normally be the input for preprocessing/normalization. It is the end result of someone else's training on a corpus they already prepared for the FastText model. Essentially, you're stuck with their choices of tokenization, normalization, & other preprocessing.
Because FastText models learn from the same kind of word-morphology (including roots/stems/alternate-forms) that's removed by stemming/lemmatization, generally the training texts used for FastText training aren't preprocessed in that way. And, you wouldn't perform any such transformation on word-tokens you want to look-up in such a pre-trained model.
The only way I can imagine your question representing a real need is if you already have some other texts that have been destructively preprocessed – such as by replacing words with their stems/lemmas – and now you want to look up those words in this model.
If that's your true need:
You should try to go back to texts that weren't destructively changed, and look up the original (non-lemmatized) words in the FastText model. That's the usual way to use FastText, and what I'd expect to work best. (For example, looking up each of 'walking', 'walked', 'walks', etc will ikely give better vectors than reducing them all to the lemma 'walk' and only looking that up.) If you can't recover the original words...
You could try just looking up the lemmas/etc you have directly. The FastText model will give you its vector for that root word, synthesizing a guess-vector if necessary from the word-fragments it knows. That might work fine. Or...
You could conceivably iterate over all the model's known words, and map them to the alernate/normalized words of your preprocessing scheme. This would potentially map N of known words to 1 new 'reduced' word. Then, create a new model where that single reduced word, in your new scheme, has just one vector – perhaps by some sort of (weighted) average of all the others words it might have been, before the reductive canonicalization. But, I have a hard time imagining any situation where this extra complexity would offer any advantage over either (1) above, using FastText as intended by looking up the words that are already in the model, or (2) above, just settling for what it already has for your new word. So really, don't do anything like this unless you've got some good reason.
Related
I am working in an NLP project using FastText. I have some texts which contains words like #.poisonjamak, #aminagabread, #iamquak123 and I want to see their FastText representation. I want to mention that the model has the following form:
# FastText
ft_model = FastText(word_tokenized_corpus,
max_n=0,
vector_size=64,
window=5,
min_count=1,
sg=1,
workers=20,
epochs=50,
seed=42)
Using this I can see their representation, however I have an error
print(ft_model.wv['#.poisonjamak'])
KeyError: 'cannot calculate vector for OOV word without ngrams'
Of course, these words are in my texts. I have the above error in all these 3 words, however if I do the following this is working.
print(ft_model.wv['#.poisonjamak']) -----> print(ft_model.wv['poisonjamak'])
print(ft_model.wv['#aminagabread']) -----> print(ft_model.wv['aminagabread'])
print(ft_model.wv['#_iamquak123_']) -----> print(ft_model.wv['_iamquak123_'])
Question: So do you know why I have this problem?
Update:
My dataset called 'df' and the column with texts called 'text'. I using the following code to prepare the texts for the fast text. The FastText is trained on word_tokenized_corpus
extra_list = df.text.tolist()
final_corpus = [sentence for sentence in extra_list if sentence.strip() !='']
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]
As comments note, the main issue is likely with your tokenizer, which won't put '#' characters inside your tokens. As a result, your FastText model isn't seeing the tokens you expect – but probably does have a word-vector for the 'word' '#'.
Separately reviewing your actual word_tokenized_corpus, to see what it truly includes before the mdoel gets to do its training, is a good way to confirm this (or catch this class of error in the future).
There is however another contributing issue: your use of the max_n=0 parameter. This essentially turns off subword learning, by qualifying no positive-length word-substrings (aka 'character n-grams') for vector-learning. This setting essentially turns FastText into plain Word2Vec.
If instead you were using FastText in a more usual way, it would've learned subword-vectors for some of the subwords in 'aminagabread' etc, and thus would'vbe provided synthetic "guess" word-vectors for the full '#aminagabread' unseen OOV token.
So in a way, you're only seeing the error letting you know about a problem in your tokenization because of this other deviation from usual FastText OOV behavior. If you really want FastText for its unique benefit of synthetic vectors for OOV words, you should return to a more typical max_n setting.
Separate usage tips:
min_count=1 is usually a bad idea with such word2vec-family algorithms, as such rare words don't have enough varied usage examples to get good vectors themselves, but the failed attempt to try degrades training for surrounding words. Often, discarding such words (as with the default min_count=5 as if they weren't there at all improves downstream evaluations.
Because of some inherent threading inefficiencies of the Python Global Interpreter Lock ("GIL"), and the Gensim approach to iterating over your corpus in one thread, parcelling work out to worker threads, it is likely you'll get higher training throughput with fewer workers than your workers=20 setting, even if you have 20 (or far more) CPU cores. The exact best setting in any situation will vary by a lot of things, including some of the model parameters, and only trial-and-error can narrow the best values. But it's more likely to be in the 6-12 range, even when more cores are available, than 16+.
I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand].
As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this.
Now I would like to try to create and train a neural network that would take entire sentences, e.g.
Jaguar F-PACE is a great SUV sports car.
Among cats, only tigers and lions are bigger than jaguars.
And then it would undertake the task of text classification (I have a dataset with several categories like animals, cars, etc.), but the result would be new representations for the word jaguar, but in different contexts, so two different embeddings.
Does anyone have any idea how I could create such a network? I don't hide that I'm a beginner and have no idea how to go about it.
If you've already been able to perform sense-disambiguation outside word2vec, then you can change the word-tokens to reflect your external judgement. For example, change some appearances of the token 'jaguar' to 'jaguar*car' and others to 'jaguar*animal'. Proceeding with normal word2vec training will then get your two different tokens two different word-vectors.
If you're hoping for the training to discover these itself, as ~Erwan mentioned in a comment, that seems like an open research question, without a standard or off-the-shelf solution that a beginner could drop-in.
I'd once seen a paper (around the time of the original word2vec papers, but can't find the link now) that tried to do this in a word2vec-compatible way by 1st proceeding with traditional polysemy-oblivious training. Then, for every appearance of a word X, model its surrounding context via some combination of the word-vectors of neighbors within a certain number of positions. (That in itself is very similar to the preparation of a context-vector in the CBOW mode of word2vec.) Perform some clustering on that collection-of-all-contexts to come up with some idea of alternate senses – each associated with one cluster. Then, in a followup pass on the original corpus, replace word-tokens with those that also reflect their nearby-context cluster. (EG: 'jaguar' might be replaced with 'jaguar*1', 'jaguar*2', etc based on which discrete cluster its context suggested.) Then, repeat (or continue) word2vec training to get sense-specific word-vectors. Of course, the devil would be in the details of how contexts are defined, how clusters are deduced, and tough edge-cases (where potentially the text's author is themselves deploying the multiple senses).
Some other interesting efforts to model or deduce polysemy in word2vec models:
"Linear Algebraic Structure of Word Meanings"
"A Simple Approach to Learn Polysemous Word Embeddings"
But per above, I've not seen these sorts of techniques widely implemented/adopted in a form that's easy to drop-in to another project.
In my NLP project I build my own model to identify sentences in a PDF document. Now I would like to check if my extracted sentences are complete sentences. During my research I have already come across this question, with the solutions presented there allowing quite a few false positives. Does anyone perhaps have a tip on how I can check whether a sentence is a complete sentence?
This is a non-trivial problem, so no approach will work in each and every case. You should also consider that whatever parser you use might merge or split sentences which in the original document were complete sentences, but after they are parsed are not any more.
Generally an alternative to the purely rule-based approaches: you could use a model which was pretrained on the CoLA (Corpus of Linguistic Acceptability) task. These models try to classify sentences/documents into the classes "linguistically acceptable" and "lingustically inacceptable".
On huggingface's model hub there are several pretrained transformer models for this, see for example this inference API for one which is a fine-tuned version of Facebook's RoBERTa model:
Correct Sentence
Incorrect Sentence
You should have a look at how the model was trained when it comes to bullet points/self-standing half sentences etc. though, as some scores might be surprising at first glance.
You might want to combine the models results with a rule-based approach, say for example: "The sentence is acceptable if the score is 0.95 or higher AND the sentence has at least 4 words AND ends with a . ? or !.". You can see what sentences your model + rule-based approach combinations spits out and keep modifying the rules until the results are to your satisfaction.
I have a couple of issues regarding Gensim in its Word2Vec model.
The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?
The second is concerning the WV object in the doc page says:
This object essentially contains the mapping between words and embeddings.
After training, it can be used directly to query those embeddings in various ways.
See the module level docstring for examples.
But that is not clear to me, allow me to explain I have my own created word vectors which I have substitute in the
word2vecObject.wv['word'] = my_own
Then call the train method with those replacement word vectors. But I would like to know which part am I replacing, is it the input to hidden weight layer or the hidden to input? This is to check if it can be called pre-training or not. Any help? Thank you.
I've not tried the nonsense parameter epochs=0, but it might behave as you expect. (Have you tried it and seen otherwise?)
However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab() & .train(), in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab() & its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)
The "word vectors" in the .wv property of type KeyedVectors are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)
So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg (or .syn1 for HS mode) property.
Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.