I wonder if it is possible to incrementally encode a corpus using python's sentence transformer. An example is the following,
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus = ['I drank an apple.',
'A man is eating food.',
'A man is eating a piece of bread.']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
Suppose now I want to incorporate
corpus2 = ['I ate an apple.', 'A man is drinking food.']
into corpus_embeddings, what should I do?
Related
Incremental semantic similarity with sentence embedding using sentence_transformers
Related
I am following this example semantic clustering:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster", i+1)
print(cluster)
print(len(cluster))
print("")
Which results to the following lists:
Cluster 1
['The girl is carrying a baby.', 'The baby is carried by the woman']
2
Cluster 2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
2
Cluster 3
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
3
Cluster 4
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
2
Cluster 5
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
2
How to add these separate list to one?
Expected outcome:
list2[['The girl is carrying a baby.', 'The baby is carried by the woman'], .....['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']]
I tried the following:
list2=[]
for i in cluster:
list2.append(i)
list2
But I returns me only the last one:
['A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.']
Any ideas?
Following that example, you don't need to anything to get a list of lists; that's already been done for you.
Try printing clustered_sentences.
Basically, you need to get a "flat" list from a list of lists, you can achieve that with python list comprehension:
flat = [item for sub in clustered_sentences for item in sub]
I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]
If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.
I am trying to understand how to create clustering of texts using sklearn. I have 800 hundred texts (600 training data and 200 test data) like the following:
Texts # columns name
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
and I would like create clusters from those.
To transform the corpus into vector space I have used tf-idf and to cluster the documents using the k-means algorithm.
However, I cannot understand if the results are those expected or not as unfortunately the output is not 'graphical' (I have tried to use CountVectorizer to have a matrix of frequency, but probably I am using it in the wrong way).
What I would expect by doing tf-idf is that when I test the test dataset
When I TEST:
test_dataset = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]
(the test dataset comes from the column df["0"]['Names'])
I would like to see which cluster(made by k-means) the texts belongs to.
Please see below the code that I am currently using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
where df["0"]['Names'] is the column 'Names' of the 0th dataframe.
A visual example, even with a different dataset but pretty same structure of dataframe (just for a better understanding) would be also good, if you prefer.
All the help you will provide will be greatly appreciated. Thanks
taking your test_data and adding three more sentence to make corpus
train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
"Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19",
"Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
"find the most representative document for each topic",
"topic distribution across documents",
"to help with understanding the topic",
"one of the practical application of topic modeling is to determine"]
creating dataframe from above dataset
df = pd.DataFrame(train_data, columns = 'text')
now you can use either Countvectorizer or TfidfVectorizer for vectorizing text, i am using TfidfVectorizer
vect = TfidfVectorizer(tokenizer=preprocessing)
vectorized_text = vect.fit_transform(df['text'])
kmeans = KMeans(n_clusters=2).fit(vectorized_text)
# now predicting the cluster for given dataset
df['predicted cluster'] = kmeans.predict(vectorized_text)
Now, when you are going to predict for test data or new data
new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom
#op
array([1])
input: noun
output: noun
example:
Input: dog, I want to get an output of cat, wolf, tiger
or
Input: computer, I want to get an output of laptop, tablet, phone, etc.
can anyone write a python code that does this? preferbably using NLTK?
sample code:
text = nltk.Text(word.lower() for word in nltk.corpus.treebank.words())
y= text.similar('dog')
print (y)
Update : found this and it works better
from py_thesaurus import Thesaurus
thesaurus = Thesaurus('dog')
print(thesaurus.get_synonym())
For dog, even I am getting "No Matches".
Similar words depends upon corpus you are using. For now I have used the same corpus you used.
Code:
import nltk
text = nltk.Text(word.lower() for word in nltk.corpus.treebank.words())
text.similar('computer')
similar_words = text._word_context_index.similar_words('computer')
print(similar_words)
Output:
industrial insurance research manufacturing market car farm trading
presence stored
['manufacturing', 'industrial', 'insurance', 'research', 'market', 'presence', 'farm', 'trading', 'stored', 'car']
I use LDA package to model topics with a large set of text documents. A simplified(!) example (I removed all other cleaning steps, lemmatization, biograms etc.) of my code is below and I'm happy with the results so far. But now I struggle to write a code to predict a new text. I can't find any reference in LDA's documentation about save/loading/predict options. I can add a new text to my set and fit it again but it is an expensive way of doing it.
I know I can do it with gensim. But somehow the results from the gensim model are less impressive so I'd stick to my initial LDA model.
Will appreciate any suggestions!
My code:
import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english')) # nltk stopwords list
documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
'Game of Thrones stars Kit Harington and Rose Leslie to wed',
'Tony Booth: Till Death Us Do Part actor dies at 85',
'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
"Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
'How long can The Apprentice keep going?',
'Strictly Come Dancing beats X Factor for Saturday viewers',
"Joe Sugg: 8 things to know about one of YouTube's biggest stars",
'Sir Terry Wogan named greatest BBC radio presenter',
"DJs celebrate 50 years of Radio 1 and 2'"]
clean_docs = []
for doc in documents:
# set all to lower case and tokenize
tokens = nltk.tokenize.word_tokenize(doc.lower())
# remove stop words
texts = [i for i in tokens if i not in stops]
clean_docs.append(texts)
# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]
cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)
n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 3
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))
# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'