Gensim doc2vec training on ngrams

Gensim doc2vec training on ngrams - python

I have several thousand documents that I'd like to use in a gensim doc2vec model, but I only have 5grams for each of the documents, not the full texts in their original word order. In the doc2vec tutorial on the gensim website (https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html), a corpus is created with full texts and then the model is trained on that corpus. It looks something like this:
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern',...], tags=[1]), TaggedDocument(words=[.....], tags=[2]),...]
Is it possible create a training corpus where each document consists of a list of 5grams rather than a list of words in their original order?

If you have "all" the 5-grams from the documents – perhaps even still in the order they appeared – it should be possible to stitch-together the original documents (or nearly-equivalent pseudo-documents), as if the 5-grams were puzzle-pieces or dominoes.
(For example, find the 1st 5-gram, by either its ordinal position in your data, or by finding a 5-gram whose 4-prefix-tokens isn't any other 5-grams' 4-suffix-tokens. Then, find its successor by matching its 4-suffix-tokens to the 4-prefix-tokens of another candidate 5-gram. If at any point you have more than one candidate 'start' or 'continuation', you could try any one & keep going until you either finish or reach a dead end – depth-1st search for consistent chains – & if a dead-end, then back up & try another. Though also, you could probably just pick another good start 5-gram, & continue, at risk fo slightly misordering the document & repeating a few tokens. A bunch of such errors won't have much effect on final results in a large corpus.)
Alternatively, the 'PV-DBOW' mode (dm=0) doesn't use context-windows or neighboring words – so getting the exact original word order doesn't matter, just stand-in documents with the right words in any order. So just concatenating all the 5-grams creates a reasonable pseudo-document – especially if you then discard 4/5 of any word (to account for the fact that any one word in the original doc, except at the very beginning or end, appears in 5 5-grams).

Related

Retrieve n-grams with word2vec

I have a list of texts. I turn each text into a token list. For example if one of the texts is 'I am studying word2vec' the respective token list will be (assuming I consider n-grams with n = 1, 2, 3) ['I', 'am', 'studying ', 'word2vec, 'I am', 'am studying', 'studying word2vec', 'I am studying', 'am studying word2vec'].
Is this the right way to transform any text in order to apply most_similar()?
(I could also delete n-grams that contain at least one stopword, but that's not the point of my question.)
I call this list of lists of tokens texts. Now I build the model:
model = Word2Vec(texts)
then, if I use
words = model.most_similar('term', topn=5)
Is there a way to determine what kind of results i will get? For example, if term is a 1-gram then will I get a list of five 1-gram? If term is a 2-gram then will I get a list of five 2-gram?

Generally, the very best way to determine "what kinds of results" you will get if you were to try certain things is to try those things, and observe the results you actually get.
In preparing text for word2vec training, it is not typical to convert an input text to the form you've shown, with a bunch of space-delimited word n-grams added. Rather, the string 'I am studying word2vec' would typically just be preprocessed/tokenized to a list of (unigram) tokens like ['I', 'am', 'studying', 'word2vec'].
The model will then learn one vector per single word – with no vectors for multigrams. And since it only knows such 1-word vectors, all the results its reports from .most_similar() will also be single words.
You can preprocess your text to combine some words into multiword entities, based on some sort of statistical or semantic understanding of the text. Very often, this process converts the runs-of-related-words to underscore-connected single tokens. For example, 'I visited New York City' might become ['I', 'visited', 'New_York_City'].
But any such preprocessing decisions are separate from the word2vec algorithm itself, which just considers whatever 'words' you feed it as 1:1 keys for looking-up vectors-in-training. It only knows tokens, not n-grams.

string matching with NLP

I have two dataframes, df1 and df2, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.
The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1 has latitude and longitude columns, whereas df2 doesn't, so I am trying to do string matching between the two dataframes so at least some rows of df2 will have geolocations available. I am blocking on the 'district' column while doing the string matching.
This is the code that I have so far:
import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)
This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu' and 'government girls primary school muzaka' are clearly different, yet they are being matched.
I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government', 'girls', 'boys', 'primary', 'school', etc. so I can put less emphasis on those words, and put more emphasis on meilabu, muzaka etc. while doing the string matching, but I am not so sure where to start.
(For reference, many of the polling stations are 'government (i.e.public) schools')
Any advice would be greatly appreciated!

The topic is very broad, just pay attention to standard approaches:
TFIDF: term frequency–inverse document frequency is often used as a weighting factor.
Measure similarity between two sentences using cosine similarity

#ipj said it correct, the topic is very broad. You can try out below methods,
def get_sim_measure(sentence1, sentence1):
vec1 = get_vector(sentence1)
vec2 = get_vector(sentence2)
return cosine_similarity(vec1, vec2)
Now the get_vector method can be many things.
Remove the stop words first and then you can use word2vec, GloVe on a word level and average them for the sentence. (simple)
Use doc2vec from Gensim for vector embedding of the sentence. (medium)
Use Bert (DistilBert or something lighter) for dynamic embedding with context. (hard)
Use TF-IDF and then use GloVe embedding. (simple)
Use spaCy's entity recognition and then do similarity matching (in this case words from government girls primary school will act as stop words) on entity labels. (slow process but simple)
Use BleuScore for measuring the similar words (in case you need it). (maybe misguiding)
There can be many situations, so better give few simple ones a try and go ahead.

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

I want to use part of speech (POS) returned from nltk.pos_tag for sklearn classifier, How can I convert them to vector and use it?
e.g.
sent = "This is POS example"
tok=nltk.tokenize.word_tokenize(sent)
pos=nltk.pos_tag(tok)
print (pos)
This returns following
[('This', 'DT'), ('is', 'VBZ'), ('POS', 'NNP'), ('example', 'NN')]
Now I am unable to apply any of the vectorizer (DictVectorizer, or FeatureHasher, CountVectorizer from scikitlearn to use in classifier
Pls suggest

If I'm understanding you right, this is a bit tricky. Once you tag it, your sentence (or document, or whatever) is no longer composed of words, but of pairs (word + tag), and it's not clear how to make the most useful vector-of-scalars out of that.
Most text vectorizers do something like counting how many times each vocabulary item occurs, and then making a feature for each one:
the: 4, player: 1, bats: 1, well: 2, today: 3,...
The next document might have:
the: 0, quick:5, flying:3, bats:1, caught:1, bugs:2
Both can be stored as arrays of integers so long as you always put the same key in the same array element (you'll have a lot of zeros for most documents) -- or as a dict. So a vectorizer does that for many "documents", and then works on that.
So your question boils down to how to turn a list of pairs into a flat list of items that the vectorizors can count.
The most trivial way is to flatten your data to
('This', 'POS_DT', 'is', 'POS_VBZ', 'POS', 'POS_NNP', 'example', 'POS_NN')
The usual counting would then get a vector of 8 vocabulary items, each occurring once. I renamed the tags to make sure they can't get confused with words.
That would get you up and running, but it probably wouldn't accomplish much. That's because just knowing how many occurrences of each part of speech there are in a sample may not tell you what you need -- notice that any notion of which parts of speech go with which words is gone after the vectorizer does its counting.
Running a classifier on that may have some value if you're trying to distinguish something like style -- fiction may have more adjectives, lab reports may have fewer proper names (maybe), and so on.
Instead, you could change your data to
('This_DT', 'is_VBZ', 'POS_NNP', 'example_NN')
That keeps each tag "tied" to the word it belongs with, so now the vectors will be able to distinguish samples where "bat" is used as a verbs, from samples where it's only used as a noun. That would tell you slightly different things -- for example, "bat" as a verb is more likely in texts about baseball than in texts about zoos.
And there are many other arrangements you could do.
To get good results from using vector methods on natural language text, you will likely need to put a lot of thought (and testing) into just what features you want the vectorizer to generate and use. It depends heavily on what you're trying to accomplish in the end.
Hope that helps.

I know this is a bit late, but gonna add an answer here.
Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following:
word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn
Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM.
This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

What about merging the word and its tag like 'word/tag' then you may feed your new corpus to a vectorizer that count the word (TF-IDF or word of bags) then make a feature for each one:
wpt = nltk.WordPunctTokenizer()
text = wpt.tokenize('Someone should have this ring to a volcano')
text_tagged = nltk.pos_tag(text)
new_text = []
for word in text_tagged:
new_text.append(word[0] + "/" + word[1])
doc = ' '.join(new_text)
output for this is
Someone/NN should/MD have/VB this/DT piece/NN of/IN shit/NN to/TO a/DT volcano/NN

I think a better method would be to :
Step-1: Create word/sentence embeddings for each text/sentence.
Step-2: Calculate the POS-tags. Feed the POS-tags to a embedder as Step-1.
Step-3: Elementwise multiply the two vectors. (This is to ensure that the word-embeddings in each sentence is weighted by the POS-tags associated with it.
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.