How to embed user names in word2vec model in gensim

How to embed user names in word2vec model in gensim - python

I have some volunteer essay writings in the format of:
volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...
I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.
I am happy to provide more details if needed.

Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).
Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.

Related

How to generate a meaningful sentence from words only?

I want to generate a sentence from list of words. I have tried n-gram model but it only generates the text from already existing sentence i.e. we input a sentence and it outputs the next generated words based on the value of n. Which model will be helpful to generate a meaningful sentence from only the list of words and which dataset should be used to train the model?

The dataset:
Just take a dataset constisting of sentences. Tokenize each sentence and shuffle the sentences. These shuffled tokens are your input, your sentence the output. Therefore you can generate as many samples as you wish:
def create_input(sentence):
tokens = nltk.word_tokenize(sentence)
shuffle(tokens)
return tokens
More difficult is the model: You could try to Fine-Tune a BERT model and I guess it will probably work.

You can use GPT-J. It is a free GPT model and its performance is comparable to GPT-3. The model takes the input that you provide it with, and tries to complete it.
How I use GPT-J to generate a sentence from a set of keywords:
Input:
Make a sentence with the following words: earth, dirt, alligator
Sentence: While the alligator is a species which mainly lives in the water, the earth is not uncommon territory and they like to dig through the dirt.
Make a sentence with the following words: shape, lantern, hair
Sentence:
Output:
Make a sentence with the following words: earth, dirt, alligator
Sentence: While the alligator is a species which mainly lives in the water, the earth is not uncommon territory and they like to dig through the dirt.
Make a sentence with the following words: shape, lantern, hair
Sentence: The hair is so thick on the lantern that it is almost like a shape.
How to tweak to a certain use-case?
Giving an example of what you want in the input (example keywords + sentence) can help GPT to understand the structure of the desired output. Explicitly explaining the GPT what the desired task is in the input (Make a sentence...) can help it to understand the task in my experience.
You can change the complexity of the output sentence by changing the example sentence to something like: An alligator likes to dig dirt out of the earth.
How to use?
Git repo: https://github.com/kingoflolz/mesh-transformer-jax
As shown in the repo, you can use the web demo of the model for testing, and you can implement it using Colab.
Web demo: https://6b.eleuther.ai/
Colab notebook: http://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb
I do not recommend trying to run it locally.

What you want is called lexically constrained beam search in natural language generation literature.
pip install -q git+https://github.com/huggingface/transformers.git
then this code can generated a sentence with the forced words list.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
encoder_input_str = "Generate a sentence:"
force_words = ["I", "school"]
input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids
outputs = model.generate(
input_ids,
force_words_ids=force_words_ids,
num_beams=5,
num_return_sequences=1,
no_repeat_ngram_size=1,
remove_invalid_values=True,
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For further information refer to this.
If you don't want to use deep learning, index a lot of sentences,
search for the keywords using a retrieval system like Lucence, and retrieve a sentence that is closest to your query.

Thanks to text generation models like GPT-3, GPT-J, and GPT-NeoX, you can generate content out of simple keywords.
For example, let's say you want to generate a product description out of a couple of keywords, you could use few-shot learning and do something like this:
Generate a product description out of keywords.
Keywords: shoes, women, $59
Sentence: Beautiful shoes for women at the price of $59.
###
Keywords: trousers, men, $69
Sentence: Modern trousers for men, for $69 only.
###
Keywords: gloves, winter, $19
Sentence: Amazingly hot gloves for cold winters, at $19.
###
Keywords: t-shirt, men, $39
Sentence:
I actually wrote an article about this that you might find useful: effectively using GPT-J with few-shot learning

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

Extracting Key-Phrases from text based on the Topic with Python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

I want to use part of speech (POS) returned from nltk.pos_tag for sklearn classifier, How can I convert them to vector and use it?
e.g.
sent = "This is POS example"
tok=nltk.tokenize.word_tokenize(sent)
pos=nltk.pos_tag(tok)
print (pos)
This returns following
[('This', 'DT'), ('is', 'VBZ'), ('POS', 'NNP'), ('example', 'NN')]
Now I am unable to apply any of the vectorizer (DictVectorizer, or FeatureHasher, CountVectorizer from scikitlearn to use in classifier
Pls suggest

If I'm understanding you right, this is a bit tricky. Once you tag it, your sentence (or document, or whatever) is no longer composed of words, but of pairs (word + tag), and it's not clear how to make the most useful vector-of-scalars out of that.
Most text vectorizers do something like counting how many times each vocabulary item occurs, and then making a feature for each one:
the: 4, player: 1, bats: 1, well: 2, today: 3,...
The next document might have:
the: 0, quick:5, flying:3, bats:1, caught:1, bugs:2
Both can be stored as arrays of integers so long as you always put the same key in the same array element (you'll have a lot of zeros for most documents) -- or as a dict. So a vectorizer does that for many "documents", and then works on that.
So your question boils down to how to turn a list of pairs into a flat list of items that the vectorizors can count.
The most trivial way is to flatten your data to
('This', 'POS_DT', 'is', 'POS_VBZ', 'POS', 'POS_NNP', 'example', 'POS_NN')
The usual counting would then get a vector of 8 vocabulary items, each occurring once. I renamed the tags to make sure they can't get confused with words.
That would get you up and running, but it probably wouldn't accomplish much. That's because just knowing how many occurrences of each part of speech there are in a sample may not tell you what you need -- notice that any notion of which parts of speech go with which words is gone after the vectorizer does its counting.
Running a classifier on that may have some value if you're trying to distinguish something like style -- fiction may have more adjectives, lab reports may have fewer proper names (maybe), and so on.
Instead, you could change your data to
('This_DT', 'is_VBZ', 'POS_NNP', 'example_NN')
That keeps each tag "tied" to the word it belongs with, so now the vectors will be able to distinguish samples where "bat" is used as a verbs, from samples where it's only used as a noun. That would tell you slightly different things -- for example, "bat" as a verb is more likely in texts about baseball than in texts about zoos.
And there are many other arrangements you could do.
To get good results from using vector methods on natural language text, you will likely need to put a lot of thought (and testing) into just what features you want the vectorizer to generate and use. It depends heavily on what you're trying to accomplish in the end.
Hope that helps.

I know this is a bit late, but gonna add an answer here.
Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following:
word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn
Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM.
This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

What about merging the word and its tag like 'word/tag' then you may feed your new corpus to a vectorizer that count the word (TF-IDF or word of bags) then make a feature for each one:
wpt = nltk.WordPunctTokenizer()
text = wpt.tokenize('Someone should have this ring to a volcano')
text_tagged = nltk.pos_tag(text)
new_text = []
for word in text_tagged:
new_text.append(word[0] + "/" + word[1])
doc = ' '.join(new_text)
output for this is
Someone/NN should/MD have/VB this/DT piece/NN of/IN shit/NN to/TO a/DT volcano/NN

I think a better method would be to :
Step-1: Create word/sentence embeddings for each text/sentence.
Step-2: Calculate the POS-tags. Feed the POS-tags to a embedder as Step-1.
Step-3: Elementwise multiply the two vectors. (This is to ensure that the word-embeddings in each sentence is weighted by the POS-tags associated with it.
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.