word2Vec and abbreviations

word2Vec and abbreviations - python

I am working on text classification task where my dataset contains a lot of abbreviations and proper nouns. For instance: Milka choc. bar.
My idea is to use bidirectional LSTM model with word2vec embedding.
And here is my problem how to code words, that not appears in the dictionary?
I partially solved this problem by merging pre-trained vectors with randomly initialized. Here is my implementation:
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('ru.vec', binary=False, unicode_errors='ignore')
EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,num_words)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))
for word, i in word_index.items():
if i>=num_words:
continue
try:
embedding_vector = word_vectors[word]
embedding_matrix[i] = embedding_vector
except KeyError:
embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),EMBEDDING_DIM)
def LSTMModel(X,words_nb, embed_dim, num_classes):
_input = Input(shape=(X.shape[1],))
X = embedding_layer = Embedding(words_nb,
embed_dim,
weights=[embedding_matrix],
trainable=True)(_input)
X = The_rest_of__the_LSTM_model()(X)
Do you think, that allowing the model to adjust the embedding weights is a good idea?
Could you please tell me, how can I encode words like choc? Obviously, this abbreviation stands for chocolate.

It is often not a good idea to adjust word2vec embeddings if you do not have sufficiently large corpus in your training. To clarify that, take an example where your corpus has television but not TV. Even though they might have word2vec embeddings, after training only television will be adjust and not TV. So you disrupt the information from word2vec.
To solve this problem you have 3 options:
You let the LSTM in the upper layer figure out what the word might mean based on its context. For example, I like choc. the LSTM can figure out it is an object. This was demonstrated by Memory Networks.
Easy option, pre-process, canonicalise as much as you can before passing to the model. Spell checkers often capture these very well and are really fast.
You can use character encoding along side word2vec. This is employed in many of the question answering models such as BiDAF where the character representation is merged with word2vec so you have some information relating characters to words. In this case, choc might be similar to chocolate.

One way to do this would be to add a function that maps your abbreviations to existing vectors that are most likely to be related ie: initialize the choc vector to the chocolate vector in w2v.
word_in_your_embedding_matrix[:len(abbreviated_word)]
There are two possible cases:
There is only one candidate that starts with the same n letters as your abbreviation, then, you can initialize your abbreviation embedding with that vector.
There are multiple items that start with the same n letters as your abbreviation, you can use the average as the yout initialization vector.

Related

Decoding hidden layer embeddings in T5

I'm new to NLP (pardon the very noob question!), and am looking for a way to perform vector operations on sentence embeddings (e.g., randomization in embedding-space in a uniform ball around a given sentence) and then decode them. I'm currently attempting to use the following strategy with T5 and Huggingface Transformers:
Encode the text with T5Tokenizer.
Run a forward pass through the encoder with model.encoder. Use the last hidden state as the embedding. (I've tried .generate as well, but it doesn't allow me to use the decoder separately from the encoder.)
Perform any desired operations on the embedding.
The problematic step: Pass it through model.decoder and decode with the tokenizer.
I'm having trouble with (4). My sanity check: I set (3) to do nothing (no change to the embedding), and I check whether the resulting text is the same as the input. So far, that check always fails.
I get the sense that I'm missing something rather important (something to do with the lack of beam search or some other similar generation method?). I'm unsure of whether what I think is an embedding (as in (2)) is even correct.
How would I go about encoding a sentence embedding with T5, modifying it in that vector space, and then decoding it into generated text? Also, might another model be a better fit?
As a sample, below is my incredibly broken code, based on this:
t5_model = transformers.T5ForConditionalGeneration.from_pretrained("t5-large")
t5_tok = transformers.T5Tokenizer.from_pretrained("t5-large")
text = "Foo bar is typing some words."
input_ids = t5_tok(text, return_tensors="pt").input_ids
encoder_output_vectors = t5_model.encoder(input_ids, return_dict=True).last_hidden_state
# The rest is what I think is problematic:
decoder_input_ids = t5_tok("<pad>", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output = t5_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors)
t5_tok.decode(decoder_output.last_hidden_state[0].softmax(0).argmax(1))

how to get rid of NOUN tag for unknown words in Spacy POS tag?

I am doing POS tag for some texts. I use spacy to get POS tags. Why I am getting NOUN tags for unknown words? for instance, If I pass sbxdata, I am getting noun tag for that. I hope there is no meaningful word as sbxdata. What I want is, I shouldn't get any tags for unknown words or I want to get only POS tags for common English words. Any libraries/any approaches available for this?
For instance, I am having the below sentence.
value large column sbxdata actual maximum ptsavatar
for this, I am getting the below POS tags.
How to get rid of the Noun tag for sbxdata and ptsavatar. Similarly, I need to get rid of any random tags for unknown words. Also, I suspect, by default it is giving as NOUN. Any help would be really appreciated.
This is my code.
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])
doc = nlp(" ".join("value large column sbxdata actual maximum ptsavatar"))
print('NLP doc :::', doc)
for token in doc:
print('token :::',token,' --> token.pos_ :::',token.pos_)

Sadly, I did not find a clear notice about this in the spacy documentation
but as it seems, your guess about the default assignment of NOUN is correct.
There were some suggestions to use spacy's vectors to determine if a word is a true English word. This did not work for me, however.
What about using a plain old spell checker? At least the 2 obvious cases in your example can be filtered out with pyspellchecker.
import spacy
from spellchecker import SpellChecker
nlp = spacy.load('en', disable=['parser', 'ner'])
spell = SpellChecker()
if __name__ == '__main__':
doc = nlp("He who values value large column sbxdata actual maximum ptsavatar")
known = set(spell.known([token.text.lower() for token in doc]))
for token in doc:
if token.text.lower() in known:
pos = token.pos_
else:
pos = ""
print(f'{pos:5s} {token}')
gives me
PRON He
PRON who
VERB values
VERB value
ADJ large
NOUN column
sbxdata
ADJ actual
ADJ maximum
ptsavatar

A POS tagger tries to find a tag for each token (or word) which fits the word best. If you use unknown words like sbxdata in a sentence, it's more likely you use it as a noun. Think of it syntatically rather than semantically. Here is an example: "I want to get rid of sbxdata." Although in this sentence sbxdata has no known meaning, it IS used as a noun in the sentence actually. So it's not that random in fact.
Also keep in mind since a POS tagger MUST find a tag for any given word, it is normal if the tagger has a hard time finding the right tag, it just returns a default tag like noun.

To clarify some misconceptions.
SpaCy doesn't have a concept of a default TAG, except for being an Out-of-Vocabulary may produce a bias in model predictions. What spaCy does for TAG assignment is using:
Convolutional layers with residual connections, layer normalization and maxout non-linearity are used, giving much better efficiency than the standard BiLSTM solution.
to predict multilabel class probabilities:
This convolutional layer is shared between the tagger, parser and NER, and will also be shared by the future neural lemmatizer. Because the parser shares these layers with the tagger, the parser does not require tag features. I got this trick from David Weiss’s “Stack Combination” paper4.
To boost the representation, the tagger actually predicts a “super tag” with POS, morphology and dependency label5. The tagger predicts these supertags by adding a softmax layer onto the convolutional layer – so, we’re teaching the convolutional layer to give us a representation that’s one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions, too). The parser model makes a state vector by concatenating the vector representations for its context tokens. source
You can find the source code for prediction here
The magic of choosing certain TAG out of multilabel probabilities happens here:
doc_guesses = doc_scores.argmax(axis=1)
This is a simple argmax out of an array of predictions. In general in Machine Learning there will be no "default" value, unless one has a very strong reason to do so.
Back to the OP:
How to get rid of the Noun tag for sbxdata and ptsavatar.
The TAG's assigned are reasonable choices of the model based on the vectors, including OOV vector all zeros, and info about surrounding tokens (see the links above). If you do not like them for some reason, I can think of 2 scenarios:
either you drop/substitute what you don't like with a dummy token <UNK_TAG> eg, or
retrain your model from scratch by assigning the words that look strange <UNK_TAG> in raw text input
Note, in both cases you'll need to define what is acceptable and what is not

Extracting Key-Phrases from text based on the Topic with Python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

How to use Elmo word embedding with the original pre-trained model (5.5B) in interactive mode

I am trying to learn how to use Elmo embeddings via this tutorial:
https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md
I am specifically trying to use the interactive mode as described like this:
$ ipython
> from allennlp.commands.elmo import ElmoEmbedder
> elmo = ElmoEmbedder()
> tokens = ["I", "ate", "an", "apple", "for", "breakfast"]
> vectors = elmo.embed_sentence(tokens)
> assert(len(vectors) == 3) # one for each layer in the ELMo output
> assert(len(vectors[0]) == len(tokens)) # the vector elements
correspond with the input tokens
> import scipy
> vectors2 = elmo.embed_sentence(["I", "ate", "a", "carrot", "for",
"breakfast"])
> scipy.spatial.distance.cosine(vectors[2][3], vectors2[2][3]) # cosine
distance between "apple" and "carrot" in the last layer
0.18020617961883545
My overall question is how do I make sure to use the pre-trained elmo model on the original 5.5B set (described here: https://allennlp.org/elmo)?
I don't quite understand why we have to call "assert" or why we use the [2][3] indexing on the vector output.
My ultimate purpose is to average the all the word embeddings in order to get a sentence embedding, so I want to make sure I do it right!
Thanks for your patience as I am pretty new in all this.

By default, ElmoEmbedder uses the Original weights and options from the pretrained models on the 1 Bil Word benchmark. About 800 million tokens. To ensure you're using the largest model, look at the arguments of the ElmoEmbedder class. From here you could probably figure out that you can set the options and weights of the model:
elmo = ElmoEmbedder(
options_file='https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json',
weight_file='https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5'
)
I got these links from the pretrained models table provided by AllenNLP.
assert is a convenient way to test and ensure specific values of variables. This looks like a good resource to read more. For example, the first assert statement ensure the embedding has three output matrices.
Going off of that, we index with [i][j] because the model outputs 3 layer matrices (where we choose the i-th) and each matrix has n tokens (where we choose the j-th) each of length 1024. Notice how the code compares the similarity of "apple" and "carrot", both of which are the 4th token at index j=3. From the example documentation, i represents one of:
The first layer corresponds to the context insensitive token
representation, followed by the two LSTM layers. See the ELMo paper or
follow up work at EMNLP 2018 for a description of what types of
information is captured in each layer.
The paper provides the details on those two LSTM layers.
Lastly, if you have a set of sentences, with ELMO you don't need to average the token vectors. The model is a character-wise LSTM, which works perfectly fine on tokenized whole sentences. Use one of the methods designed for working with sets of sentences: embed_sentences(), embed_batch(), etc. More in the code!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.