How to cluster sentences by using n-gram overlap in Python?

How to cluster sentences by using n-gram overlap in Python? - python

I need to cluster sentences according to common n-grams they contain. I am able to extract n-grams easily with nltk but I have no idea how to perform clustering based on n-gram overlap. That is why I couldn't write such a real code, first of all I am sorry for it. I wrote 6 simple sentences and expected output to illustrate the problem.
import nltk
Sentences= """I would like to eat pizza with her.
She would like to eat pizza with olive.
There are some sentences must be clustered.
These sentences must be clustered according to common trigrams.
The quick brown fox jumps over the lazy dog.
Apples are red, bananas are yellow."""
sent_detector = nltk.data.load('tokenizers/punkt/'+'English'+'.pickle')
sentence_tokens = sent_detector.tokenize(sentences.strip())
mytrigrams=[]
for sentence in sentence_tokens:
trigrams=ngrams(sentence.lower().split(), 3)
mytrigrams.append(list(trigrams))
After this I have no idea (I am not even sure whether this part is okay.) how to cluster them according to common trigrams. I tried to do with itertools-combinations but I got lost, and I didn't know how to generate multiple clusters, since the number of clusters can not be known without comparing each sentence with each other. The expected output is given below, thanks in advance for any help.
Cluster1: 'I would like to eat pizza with her.'
'She would like to eat pizza with olive.'
Cluster2: 'There are some sentences must be clustered.'
'These sentences must be clustered according to common trigrams.'
Sentences do not belong to any cluster:
'The quick brown fox jumps over the lazy dog.'
'Apples are red, bananas are yellow.'
EDIT: I have tried with combinations one more time, but it didn't cluster at all, just returned the all sentence pairs. (obviously I did something wrong).
from itertools import combinations
new_dict = {k: v for k, v in zip(sentence_tokens, mytrigrams)}
common=[]
no_cluster=[]
sentence_pairs=combinations(new_dict.keys(), 2)
for keys, values in new_dict.items():
for values in sentence_pairs:
sentence1= values[0]
sentence2= values[1]
#print(sentence1, sentence2)
if len(set(sentence1) & set(sentence2))!=0:
common.append((sentence1, sentence2))
else:
no_cluster.append((sentence1, sentence2))
print(common)
But even if this code worked it would not give the output I expect, as I don't know how to generate multiple clusters based on common n-grams

To better understand your problem, you could explain the purpose and expected outcome.
Using Ngrams is something that must be done very carefully, when using ngrams, you increase the number of dimensions of your dataset.
I advise you to first use TD-IDF and only then if you have not reached the minimum hit rate, you go to n-grams.
If you can better explain your problem I can see if I can help you.

Related

How can color names be more accurately recognised and extracted from strings?

It may be a naïve approach that I use to recognise and extract colour names despite slight variations or misspellings in texts, which in a first throw also works better in English than in German, but the challenges seem to be approximately the same.
Different spellings grey/gray or weiß/weiss where the similarity from a human perspective does not seem to be huge but from word2vec grey and green are more similar.
Colours not yet known or available in color_list, in the following case brown May not best example, but perhaps it can be deduced from the context in the sentence. Just as you as a human being get an idea that it could be a color.
Both cases could presumably be covered by an extension of the vocabulary with a lot of other color names. But, not knowing about such combinations in the first place seems difficult.
Does anyone see another adjusting screw or even a completely different procedure that could possibly achieve even better results?
from collections import Counter
from math import sqrt
import pandas as pd
#list of known colors
colors = ['red','green','yellow','black','gray']
#dict or dataframe of sentences that contains color/s or not
df = pd.DataFrame({
'id':[1,2,3,4],
'text':['grey donkey with black mane',
'brown dog with sharp teeth',
'red cat with yellowish / seagreen glowing eyes',
'proud rooster with red comb']
}
)
#creating vector of the word
def word2vec(word):
cw = Counter(word)
sw = set(cw)
lw = sqrt(sum(c*c for c in cw.values()))
return cw, sw, lw
#check cosin distance between word and color
def cosdis(v1, v2):
common = v1[1].intersection(v2[1])
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
df['color_matches'] = [[(w,round(cd, 2),c) for w in s.split() for c in colors if (cd:=cosdis(word2vec(c), word2vec(w))) >= 0.85] for s in df.text]
id
text
color_matches
0
1
grey donkey with black mane
[('black', 1.0, 'black')]
1
2
brown dog with sharp teeth
[]
2
3
red cat with yellowish / seagreen glowing eyes
[('red', 1.0, 'red'), ('yellowish', 0.85, 'yellow'), ('seagreen', 0.91, 'green')]
3
4
proud rooster with red comb
[('red', 1.0, 'red')]

Your best strategy is to begin by having a list of colors ahead of time. A color includes adjectives which contain the word "color" in their definition. I said includes because this doesn't cover all cases: The type of edge case that can kill you would be something like
"The yellowish jacket matched his yellow shoes".
This has the problem that "yellowish" in the oxford dictionary is defined as:
adjective: having a yellow tinge; slightly yellow.
Now you can do a little bit of recursion here:
First colors are adjectives which contain the word "color" in their definition
Second colors are adjectives which contain a <first color> in their definition
Third colors are adjectives which contain a <second color> in their definition.
etc...
Mining this from a dictionary data set can let you scoop up as many colors as possible. You might need to be a little careful here though and only select adjectives whose definitions include a phrase of the form adverb color_of_lower_rank
Once you have a set of colors then compound colors example "blue-green" become tractable. There are also ill defined colors such as "royal blue". Parsing these is more difficult because you need to know if the "royal" refers to the blue OR to the object ex:
"The prince's royal blue cloak was beautiful"
The royal property here has to do with the fact its a prince's cloak.
vs
"The shirt was a beautiful royal blue".
Here you can just imagine a shirt that colored a beautiful shade of blue that you consider "royal blue".
So in general parsing adverb-adjective phrases can get a bit complex.

The space of color names is so small, compared to all words, that I'd suggest you mainly focus on hand-curated lists of colors.
You could initially extract such a lexicon from reference materials, whether they're general references (like WordNet) or domain-specific documentation (like say HTML Color Names - even if you limited to the subset of one-word names).
If you then need to expand your fixed-list to include other novel colors, I'd expect that the words very-close to known-colors, in a suitably well trained word2vec model, are likely to be other, and related, colors. You'd probably need some manual review, but again: the space of colors isn't that large, so a manual process seems reasonable.
Your comment that word2vec placed green closer to grey than gray surprises me. Are you sure you weren't using some severely-impoverished word2vec model, far undertrained on insufficient data, or poorly-parameterized? Placing variant-spellings of the same concept near each other is something a word2vec model, trained on sufficient data, usually does very very well. Have you tried looking at the nearest-neighbors of known color-names in a large, sufficiently-trained word2vec model, like the large 3M word GoogleNews model released by Google circa 2013?
(Note: no toy-sized examples, like the 4-text, ~25-word dataset showin in your code, will show much useful from word2vec-style algorithms, which require many subtly-contrasting realistic uses of words in context to generate good vectors for those words.)

Equate strings based on meaning

Is there a way to equate strings in python based on their meaning despite not being similar.
For example,
temp. Max
maximum ambient temperature
I've tried using fuzzywuzzy and difflib and although they are generally good for this using token matching, they also provide false positives when I threshold the outputs over a large number of strings.
Is there some other method using NLP or tokenization that I'm missing here?
Edit:
The answer provided by A CO does solve the problem mentioned above but is there any way to match specific substrings using word2vec from a key?
e.g. Key = max temp
Sent = the maximum ambient temperature expected tomorrow in California is 34 degrees.
So here I'd like to get the substring "maximum ambient temperature". Any tips on that?

As you say, packages like fuzzywuzzy or difflib will be limited because they compute similarities based on the spelling of the strings, not on their meaning.
You could use word embeddings. Word embeddings are vector representations of the words, computed in a way that allows to represent their meaning, to a certain extend.
There are different methods for generating word embeddings, but the most common one is to train a neural network on one - or a set - of word-level NLP tasks, and use the penultimate layer as a representation of the word. This way, the final representation of the word is supposed to have accumulated enough information to complete the task, and this information can be interpreted as an approximation for the meaning of the word. I recommend that you read a bit about Word2vec, which is the method that made word embeddings popular, as it is simple to understand but representative for what word embeddings are. Here is a good introductory article. The similarity between two words can be computed then using usually the cosine distance between their vector representations.
Of course, you don't need to train word embeddings yourself, as there exist plenty of pretrained vectors available (glove, word2vec, fasttext, spacy...). The choice of which embedding you will use depend on the observed performance and your understanding of how fit they are for the task you want to perform. Here is an example with spacy's word vectors, where the sentence vector is computed by averaging the word vectors:
# Importing spacy and fuzzy wuzzy
import spacy
from fuzzywuzzy import fuzz
# Loading spacy's large english model
nlp_model = spacy.load('en_core_web_lg')
s1 = "temp. Max"
s2 = "maximum ambient temperature"
s3 = "the blue cat"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors (The document or sentence vector is the average of the word vectors it contains)
print("Document vectors similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, doc1.similarity(doc2)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, doc1.similarity(doc3)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, doc2.similarity(doc3)))
# Fuzzy logic
print("Character ratio similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))
This will print:
>>> Document vectors similarity between 'temp. Max' and 'maximum ambient temperature' is: 0.6432
>>> Document vectors similarity between 'temp. Max' and 'the blue cat' is: 0.3810
>>> Document vectors similarity between 'maximum ambient temperature' and 'the blue cat' is: 0.3117
>>> Character ratio similarity between 'temp. Max' and 'maximum ambient temperature' is: 28.0000
>>> Character ratio similarity between 'temp. Max' and 'the blue cat' is: 38.0000
>>> Character ratio similarity between 'maximum ambient temperature' and 'the blue cat' is: 21.0000
As you can see, the similarity with word vectors reflects better the similarity in the meaning of the documents.
However this is really just a starting point as there can be plenty of caveats. Here is a list of some of the things you should watch out for:
Word (and document) vectors do not represent the meaning of the word (or document) per se, they are a way to approximate it. That implies that they will hit a limitation at some point and you cannot take for granted that they will allow you to differentiate all nuances of the language.
What we expect to be the "similarity in meaning" between two words/sentences varies according to the task we have. As an example, what would be the "ideal" similarity between "maximum temperature" and "minimum temperature"? High because they refer to an extreme state of the same concept, or low because they refer to opposite states of the same concept? With word embeddings, you will usually get a high similarity for these sentences, because as "maximum" and "minimum" often appear in the same contexts the two words will have similar vectors.
In the example given, 0.6432 is still not a very high similarity. This comes probably from the usage of abbreviated words in the example. Depending on how word embeddings have been generated, they might not handle abbreviation well. In a general manner, it is better to have syntactically and grammatically correct inputs to NLP algorithms. Depending on how your dataset looks like and your knowledge of it, doing some cleaning beforehand can be very helpful. Here is an example with grammatically correct sentences that highlights the similarity in meaning better:
s1 = "The president has given a good speech"
s2 = "Our representative has made a nice presentation"
s3 = "The president ate macaronis with cheese"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors
print(doc1.similarity(doc2))
>>> 0.8779
print(doc1.similarity(doc3))
>>> 0.6131
print(doc2.similarity(doc3))
>>> 0.5771
Anyway, word embeddings are probably what you are looking for but you need to take the time to learn about them. I would recommend that you read about word (and sentence, and document) embeddings and that you play a bit around with different pretrained vectors to get a better understanding of how they can be used for the task you have.

How to generate the result of bigrams with highest probabilities with a list of individual alphabetical strings as input

I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]

You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.

Finding percentage of shared tokens (percent similarity) between multiple strings

I want to find unique strings in a list of strings with a specific percentage (in Python). However, these strings should be significantly different. If there is a small difference between two strings then it's not interesting for me.
I can loop through the strings for finding their similarity percentage but I was wondering if there is a better way to do it?
For example,
String A: He is going to school.
String B: He is going to school tomorrow.
Let's say that these two strings are 80% similar.
Similarity: The string with exact same words in the same order are most similar. A string can be 100% similar with itself
It's a bit vague definition but it works for my use-case.

If you want to check the amount that two sentences are similar and you want to know when they are the exact same word ordering, then you can use single sentence BLEU score.
I would use the sentence_bleu found here: http://www.nltk.org/_modules/nltk/translate/bleu_score.html
You will need to make sure that you do something with your weights for short sentences. An example from something I have done in the past is
from nltk.translate.bleu_score import sentence_bleu
from nltk import word_tokenize
sentence1 = "He is a dog"
sentence2 = "She is a dog"
reference = word_tokenize(sentence1.lower())
hypothesis = word_tokenize(sentence2.lower())
if min(len(hypothesis), len(reference)) < 4:
weighting = 1.0 / min(len(hypothesis), len(reference))
weights = tuple([weighting] * min(len(hypothesis), len(reference)))
else:
weights = (0.25, 0.25, 0.25, 0.25)
bleu_score = sentence_bleu([reference], hypothesis, weights=weights)
Note that single sentence BLEU is quite bad at detecting similar sentences with different word orderings. So if that's what you're interested in then be careful. Other methods you could try are document similarity, Jaccard similarity, and cosine similarity.

clustering similar words and then mapping clusters into numbers in python

I'm familiar with k-means to cluster data points, but not text.. So I have one column of words( some rows has only one word and some have more etc.) in cvs format, which I want to cluster those which have similar word or more, and then mapping those cluster to numbers as index, those index numbers needed to be added as a second column. I know there is scipy packahes and word2vec in python, but this is the first time for me dealing with clustering text.. Any idea on how to do this?? Any code examples will be appreciated
Edit:What I want are not the similar words in meaning, I want the similar like the exact text, for example: we have three words in different three rows: Heart attack , Heart failure, Heart broken.. for example .. I want those rows to be in one cluster cause they have the common word " Heart... And by the way, all the rows are connected with each other somehow, so what I really want is to cluster the exact words
from csv import DictReader
import sets
### converting my cvs file into list!!
with open("export.csv") as f:
my_list = [row["BASE_NAME"] for row in DictReader(f)]
#print(my_list)
## having every word in the cvs file
Set = list()
for item in my_list:
MySet = list(set(item.split(' ')))
Set.append(MySet)
#print(Set)
cleanlist = []
[cleanlist.append(x) for x in Set if x not in cleanlist]
print(cleanlist[1])
#print(cleanlist)
###my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
#for item in my_list:
for i in xrange(len(cleanlist)):
# matching = [s for s in my_list if cleanlist[i] in s]
# matching = [x for x in my_list if cleanlist[i] in x]
matching = any( cleanlist[[i]] in item for item in my_list)
print(matching)
Sample of my_list is ['Carbon Monoxide (Blood)', 'Carbon Monoxide Poisoning', 'Carbonic anhydrase inhibitor administered']
Sample of cleanlist is [['Antibody', 'Cardiolipin'], ['Cardiomegaly'], ['Cardiomyopathy'], ['Cardiopulmonary', 'Resuscitation', '(CPR)'], ['Diet', 'Cardiovascular'], ['Disease', 'Cardiovascular']]
SOLVED[Now I'm having problems, My cleanlist is not containing only one item for each index, which makes the compare for matching hard, how to fix that??]
????Also, I want to create a list for each time of comparing, so for each comparing of the clean list, I want to create one list which will have the similar words between them,,, any help with that please??

I'm a newbie of machine-learning, but maybe I can give you some advise.
let's suppose we have one row:
rowText = "mary have a little lamb"
then split the word and put it in a set:
MySet = set(rowText.split(' '))
now we can add more row to this set, finally we got a big set including all words.
MySet.add(newRowText.split(' '))
now, we should remove some not so important words such as "a", "an", "is", "are",etc.
Convert the set back to list with order, check the length of the List. now we can create a N-Dimension space, for example, the total List is ["Mary", "has", "lamb", "happy"]. the row is "Mary has a little lamb", so we can convert the sentence to [1, 1, 1, 0].
Now you can do Cluster now. right?
If you found the "Mary" is very important, you can adjust the weight of Mary like [2,1,1,0].
The precess way you can refer from Bayes as my opinion.

Clustering text is hard, and most approaches do not work well. Clustering single words essentially requires a lot of background knowledge.
If you had longer text, you could measure similarity by the words they have in common.
But for single words, this approach does not work.
Consider:
Apple
Banana
Orange
Pear
Pea
To a human who knows a lot, apple and pear are supposedly the two most similar. To a computer who has only these 3 to 6 bytes of string data, pear and pea are the most similar words.
You see: language is a lot about background knowledge and associations. A computer who cannot associate both "apple" and "pear" with "fruit growing on a tree, usually green outside and white inside, black seeds in the center and a stem on top, about the size of a palm usually" cannot recognize what these have in common, and thus cannot cluster them.

For clustering you need some kind of distance measurement. I suggest using the Hamming Distance (see https://en.wikipedia.org/wiki/Hamming_distance). I think it's common to use it to measure similarity between two words.
Edit:
for your examples this would mean
Heart attack
Heart failure => dist 7
Heart attack
Heart broken => dist 6
Heart failure
Heart broken => dist 7
Heart broken
Banana => dist 12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.