I am following this example semantic clustering:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster", i+1)
print(cluster)
print(len(cluster))
print("")
Which results to the following lists:
Cluster 1
['The girl is carrying a baby.', 'The baby is carried by the woman']
2
Cluster 2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
2
Cluster 3
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
3
Cluster 4
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
2
Cluster 5
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
2
How to add these separate list to one?
Expected outcome:
list2[['The girl is carrying a baby.', 'The baby is carried by the woman'], .....['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']]
I tried the following:
list2=[]
for i in cluster:
list2.append(i)
list2
But I returns me only the last one:
['A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.']
Any ideas?
Following that example, you don't need to anything to get a list of lists; that's already been done for you.
Try printing clustered_sentences.
Basically, you need to get a "flat" list from a list of lists, you can achieve that with python list comprehension:
flat = [item for sub in clustered_sentences for item in sub]
Related
I wonder if it is possible to incrementally encode a corpus using python's sentence transformer. An example is the following,
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus = ['I drank an apple.',
'A man is eating food.',
'A man is eating a piece of bread.']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
Suppose now I want to incorporate
corpus2 = ['I ate an apple.', 'A man is drinking food.']
into corpus_embeddings, what should I do?
Related
Incremental semantic similarity with sentence embedding using sentence_transformers
I am currently learning to work with NLP. One of the problems I am facing is finding most common n words in text. Consider the following:
text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf']
Suppose n = 2. I am not looking for most common bigrams. I am searching for 2-words that occur together the most in the text. Like, the output for the above should give:
'Lion' & 'Elephant': 3
'Elephant' & 'Weed': 3
'Lion' & 'Monkey': 2
'Elephant' & 'Monkey': 2
and such..
Could anyone suggest a suitable way to tackle this?
I would suggest using Counter and combinations as follows.
from collections import Counter
from itertools import combinations, chain
text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']
def count_combinations(text, n_words, n_most_common=None):
count = []
for t in text:
words = t.split()
combos = combinations(words, n_words)
count.append([" & ".join(sorted(c)) for c in combos])
return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))
count_combinations(text, 2)
it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words :-) cause if elem has 3 words then it must be 2 empty spaces :-) in that case, only elem with 2 words will be returned
l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:
if elem.count(" ") == 1:
print(elem)
output
hello world
wassap babe
ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
message = 'When the mutton and an omelet had been served and a samovar and vodka brought, with some wine which the French had taken from a Russian cellar and brought with them, Ramballe invited Pierre to share his dinner, and himself began to eat greedily and quickly like a healthy and hungry man, munching his food rapidly with his strong teeth, continually smacking his lips, and repeating- Excellent! Delicious! His face grew red and was covered with perspiration. Pierre was hungry and shared the dinner with pleasure. Morel, the orderly, brought some hot water in a saucepan and placed a bottle of claret in it. He also brought a bottle of kvass, taken from the kitchen for them to try. That beverage was already known to the French and had been given a special name. They called it limonade de cochon (pigs lemonade), and Morel spoke well of the limonade de cochon he had found in the kitchen. But as the captain had the wine they had taken while passing through Moscow, he left the kvass to Morel and applied himself to the bottle of Bordeaux. He wrapped the bottle up to its neck in a table napkin and poured out wine for himself and for Pierre. The satisfaction of his hunger and the wine rendered the captain still more lively and he chatted incessantly all through dinner'
key1 = 11
key2 = 3
def affine(message, key1, key2):
for m in message:
output = ''
if m.upper() not in ALPHABET:
output += m
continue
index = ALPHABET.find(m.upper())
newIndex = (index * key1 + key2) % len(ALPHABET)
output += ALPHABET.find(newIndex)
return output
ALPHABET.find() requires that the argument be a string to search for in ALPHABET. newIndex is a number, not a letter.
You just want
output += ALPHABET[newIndex]
Try this code:
ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
message = 'When the mutton and an omelet had been served and a samovar and vodka brought, with some wine which the French had taken from a Russian cellar and brought with them, Ramballe invited Pierre to share his dinner, and himself began to eat greedily and quickly like a healthy and hungry man, munching his food rapidly with his strong teeth, continually smacking his lips, and repeating- Excellent! Delicious! His face grew red and was covered with perspiration. Pierre was hungry and shared the dinner with pleasure. Morel, the orderly, brought some hot water in a saucepan and placed a bottle of claret in it. He also brought a bottle of kvass, taken from the kitchen for them to try. That beverage was already known to the French and had been given a special name. They called it limonade de cochon (pigs lemonade), and Morel spoke well of the limonade de cochon he had found in the kitchen. But as the captain had the wine they had taken while passing through Moscow, he left the kvass to Morel and applied himself to the bottle of Bordeaux. He wrapped the bottle up to its neck in a table napkin and poured out wine for himself and for Pierre. The satisfaction of his hunger and the wine rendered the captain still more lively and he chatted incessantly all through dinner'
key1 = 11
key2 = 3
def affine(message, key1, key2):
for m in message:
output = ''
if m.upper() not in ALPHABET:
output += m
continue
index = ALPHABET.find(m.upper())
newIndex = (index * key1 + key2) % len(ALPHABET)
output += ALPHABET[newIndex]
return output
affine(message, key1, key2)
Output:
'I'
I have column in my DataFrame and now my requirement is:
Repetition of words should not be present in the each row.
suppose i have dog and dogs in my string only dog should be present
Case sensitive: even if i have dog and Dogs then also it should remove dogs and gives back only dog
special cases like dog and dog's then remove dog's and result should only contain dog.
Please find the below example code which i used.
1.Tried with stemming and lemmatizing the data but not accurate
2. Even used spacy and lemmatize the data then also same result
output looks abit better but while lemmatizing, the other words are effecting
m='dog ran out of Dogs and Dog ran out of cat and dog''s adidas'
try:
def stem(tokens):
x=[]
stemmer = SnowballStemmer(language='english')
for token in tokens:
x.append(stemmer.stem(token))
return x
except:
print('problem at stemming')
s12=' '.join(stem(m.split()))
#####Then written code for duplicate removal
try:
def unique_list(list1):
marker = set()
result = [not marker.add(x.casefold()) and x for x in list1 if x.casefold() not in marker]
return result
except:
print("Problem in removing duplicates")
s5=' '.join(unique_list(s12.split()))
Actual string : 'dog ran out of Dogs and Dog ran out of cat and dog''s adidas'
Actual Result : 'dog ran out of dog and dog ran out of cat and dog adida'
So, in the actual result it is also lemmatizing adidas which is the last word in string. it is becoming 'adida' instead of adidas
Expected Result: 'dog ran out of dog and dog ran out pf cat and dog adidas'
Need your though or help in resolving this issue.
from nltk import WordNetLemmatizer
lemm = WordNetLemmatizer()
sent = 'dog ran out of Dogs and Dog ran out of cat and dog''s adidas'
word_token = [y.lower() for y in sent.split()]
print(' '.join([lemm.lemmatize(word,'n') for word in word_token]))
#o/p
'dog ran out of dog and dog ran out of cat and dog adidas'
park = "a park.shp"
road = "the roads.shp"
school = "a school.shp"
train = "the train"
bus = "the bus.shp"
mall = "a mall"
ferry = "the ferry"
viaduct = "a viaduct"
dataList = [park, road, school, train, bus, mall, ferry, viaduct]
print dataList
for a in dataList:
print a
#if a.endswith(".shp"):
# dataList.remove(a)
print dataList
gives the following output (so the loop is working and reading everything correctly):
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
a park.shp
the roads.shp
a school.shp
the train
the bus.shp
a mall
the ferry
a viaduct
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
but when I remove the # marks to run the if statement, where it should remove the strings ending in .shp, the string road remains in the list?
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
a park.shp
a school.shp
the bus.shp
the ferry
a viaduct
['the roads.shp', 'the train', 'a mall', 'the ferry', 'a viaduct']
Something else I noticed, it doesn't print all the strings when it's clearly in a for loop that should go through each string? Can someone please explain what's going wrong, where the loop keeps the string road but finds the other strings ending with .shp and removes them correctly?
Thanks,
C
(FYI, this is on Python 2.6.6, because of Arc 10.0)
You are mutating the list and causing the index to skip.
Use a list comprehension like this:
[d for d in dataList if not d.endswith('.shp')]
and then get:
>>> ['the train', 'a mall', 'the ferry', 'a viaduct']
Removing items from the same list you're iterating over almost always causes problems. Make a copy of the original list and iterate over that instead; that way you don't skip anything.
for a in dataList[:]: # Iterate over a copy of the list
print a
if a.endswith(".shp"):
dataList.remove(a) # Remove items from the original, not the copy
Of course, if this loop has no purpose other than creating a list with no .shp files, you can just use one list comprehension and skip the whole mess.
no_shp_files = [a for a in datalist if not a.endswith('.shp')]