Python treat multiple words as single - python

Is there any method to treat multiple word as single in Python? I've written a script to find Tf-Idf value of words in a collection of documents. The problem is that, it gives the Tf-Idf for individual words. But there are cases where i've to treat multiple word as as one, such as words like Big Data , Machine Learning should be treated as a single word and Tf-Idf score for those word should be calculated. Any help would be highly useful.

I would approach it using scikit-learn and the TfidfVectorizer. Tweaking some of it's parameters would basically allow you to do all the work.
It's hard to show it's functionality though without a good example.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = "lots of text"
vectorizer = TfidfVectorizer(ngram_range=(2,2))
result = vectorizer.fit_transform(corpus)
Know that the ngram_range parameter allows you to choose if you are interested in e.g. bigrams, trigrams, etc. by choosing a range.

Related

classification of documents considering the order of words

I'm trying to classify a list of documents. I'm using CountVectorizer and TfidfVectorizer to vectorize the documents before the classification. The results are good but I think that they could be better if we will consider not only the existence of specific words in the document but also the order of these words. I know that it is possible to consider also pairs and triples of words but I'm looking for something more inclusive.
Believe it or not, but bag of words approaches work quite well on a wide range of text datasets. You've already thought of bi-grams or tri-grams. Let's say you had 10-grams. You have information about the order of your words, but it turns out there are rarely more than one instance of each 10-gram, so there would be few examples for your classification model to learn from. You could try some other custom feature engineering based on the text, but it would be a good amount of work that rarely help much. There are other successful approaches in Natural Language Processing, especially in the last few years, but they usually focus on more than word ordering.

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:
model.wv.most_similar(positive = ['man'])
Out[14]:
[('woman', 0.7380284070968628),
('lady', 0.6933152675628662),
('monk', 0.6662989258766174),
('guy', 0.6513140201568604),
('soldier', 0.6491742134094238),
('priest', 0.6440571546554565),
('farmer', 0.6366012692451477),
('sailor', 0.6297377943992615),
('knight', 0.6290514469146729),
('person', 0.6288090944290161)]
--------------------------------------------
Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,
model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215
This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.
This is what I am doing here:
#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList
#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")
I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.
I've written a small Python package that implements this for you: transvec. Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).
First of all, you should really use self.wv.similarity().
I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.
Eg. Until your model knows that
Man:Aadmi::Woman:Aurat,
from the word embeddings, it can never make out the
Raja:King::Rani:Queen
relation. And for that, you need some anchor between the two corpuses.
Here are a few suggestions that you can try out:
Make an independent Hindi corpus/model
Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
Randomly replace input document words with their counterparts from the corresponding document while training
These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.
After reading the comments, I think that the problem is in the very different grammatical construction between English and Hindi sentences. I have worked with Hindi NLP models and it is much more difficult to get similar results as English (since you mention it).
In Hindi there's no order between words at all, only when declining them. Moreover, the translation of a sentence between languages that are not even descendants of the same root language is somewhat random and you can not assume that the contexts of both sentences are similar.

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?
To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.
Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.
Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

R/python: build model from training sentences

What I'm trying to achieve:
I have been looking for an approach for a long while now but I'm not able to find an (effective) way to this:
build a model from example sentences while taking word order and synonyms into account.
map a sentence against this model and get a similarity score (thus a score indicating how much this sentence fits the model, in other words fits the sentences which were used to train the model)
What I tried:
Python: nltk in combination with gensim (as far as I could code and read it was only capable to use word similarity (but not taking order into
account).
R: used tm to build a TermDocumentMatrix which looked really promising but was not able to map anything to this matrix. Further this TermDocumentMatrix seems to take the order into account but misses the synonyms (I think).
I know the lemmatization didn't go that well hahah :)
Question:
Is there any way to do achieve the steps described above using either R or Python? A simple sample code would be great (or references to a good tutorial)
There are many ways to do what you described above, and it will of course take lots of testing to find an optimized solution. But here is some helpful functionality to help solve this using python/nltk.
build a model from example sentences while taking word order and
synonyms into account.
1. Tokenization
In this step you will want to break down individual sentences into a list of words.
Sample code:
import nltk
tokenized_sentence = nltk.word_tokenize('this is my test sentence')
print(tokenized_sentence)
['this', 'is', 'my', 'test', 'sentence']
2. Finding synonyms for each word.
Sample code:
from nltk.corpus import wordnet as wn
synset_list = wn.synsets('motorcar')
print(synset_list)
[Synset('car.n.01')]
Feel free to research synsets if you are unfamiliar, but for now just know the above returns a list, so multiple synsets are possibly returned.
From the synset you can get a list of synonyms.
Sample code:
print( wn.synset('car.n.01').lemma_names() )
['car', 'auto', 'automobile', 'machine', 'motorcar']
Great, now you are able to convert your sentence into a list of words, and you're able to find synonyms for all words in your sentences (while retaining the order of your sentence). Also, you may want to consider removing stopwords and stemming your tokens, so feel free to look up those concepts if you think it would be helpful.
You will of course need to write the code to do this for all sentences, and store the data in some data structure, but that is probably outside the scope of this question.
map a sentence against this model and get a similarity score (thus a
score indicating how much this sentence fits the model, in other words
fits the sentences which were used to train the model)
This is difficult to answer since the possibilities to do this are endless, but here are a few examples of how you could approach it.
If you're interested in binary classification you could do something as simple as, Have I seen this sentence of variation of this sentence before (variation being same sentence but words replaced by their synonyms)? If so, score is 1, else score is 0. This would work, but may not be what you want.
Another example, store each sentence along with synonyms in a python dictionary and calculate score depending on how far down the dictionary you can align the new sentence.
Example:
training_sentence1 = 'This is my awesome sentence'
training_sentence2 = 'This is not awesome'
And here is a sample data structure on how you would store those 2 sentences:
my_dictionary = {
'this': {
'is':{
'my':{
'awesome': {
'sentence':{}
}
},
'not':{
'awesome':{}
}
}
}
}
Then you could write a function that traverses that data structure for each new sentence, and depending how deep it gets, give it a higher score.
Conclusion:
The above two examples are just some possible ways to approach the similarity problem. There are countless articles/whitepapers about computing semantic similarity between text, so my advice would be just explore many options.
I purposely excluded supervised classification models, since you never mentioned having access to labelled training data, but of course that route is possible if you do have a gold standard data source.

How to calculate the distance in meaning of two words in Python

I am wondering if it's possible to calculate the distance/similarity between two related words in Python (like "fraud" and "steal"). These two words are not synonymous per se but they are clearly related. Are there any concepts/algorithms in NLP that can show this relationship numerically? Maybe via NLTK?
I'm not looking for the Levenshtein distance as that relates to the individual characters that make up a word. I'm looking for how the meaning relates.
Would appreciate any help provided.
My suggestion is as follows:
Put each word through the same thesaurus, to get a list of synonyms.
Get the size of the set of similar synonyms for the two words.
That is a measure of similarity between the words.
If you would like to do a more thorough analysis:
Also get the antonyms for each of the two words.
Get the size of the intersection of the sets of antonyms for the two words.
If you would like to go further!...
Put each word through the same thesaurus, to get a list of synonyms.
Use the top n (=5, or whatever) words from the query result to initiate a new query.
Repeat this to a depth you feel is adequate.
Make a collection of synonyms from the repeated synonym queries.
Get the size of the set of similar synonyms for the two words from the two collections of synonyms.
That is a measure of similarity between the words.
NLTK's wordnet is the tool you'd want to use for this. First get the set of all the senses of each word using:
synonymSet = wordnet.synsets(word)
Then loop through each possible sense of each of the 2 words and compare them to each other in a nested loop:
similarity = synonym1.res_similarity(synonym2,semcor_ic)
Either average that value or use the maximum you find; up to you.
This example is using a word similarity comparison that uses "IC" or information content. This will score similarity higher if the word is more specific, or contains more information, so generally it's closer to what we mean when we think about word similarity.
To use this stuff you'll need the imports and variables:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
As #jose_bacoy suggested above, the Gensim library can provide a measure of similarity between words using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. To use your example:
word_vectors.similarity('fraud', 'steal')
>>> 0.19978741
Twenty percent similarity may be a surprisingly low level of similarity to you if you consider these words to be similar. But consider that fraud is a noun and steal is generally a verb. This will give them very different associations as viewed by word2vec.
They become much more similar if you modify the noun to become a verb:
word_vectors.similarity('defraud', 'steal')
>>> 0.43293646

Categories

Resources