Predicting new content for text-clustering using sklearn

Predicting new content for text-clustering using sklearn - python

I am trying to understand how to create clustering of texts using sklearn. I have 800 hundred texts (600 training data and 200 test data) like the following:
Texts # columns name
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
and I would like create clusters from those.
To transform the corpus into vector space I have used tf-idf and to cluster the documents using the k-means algorithm.
However, I cannot understand if the results are those expected or not as unfortunately the output is not 'graphical' (I have tried to use CountVectorizer to have a matrix of frequency, but probably I am using it in the wrong way).
What I would expect by doing tf-idf is that when I test the test dataset
When I TEST:
test_dataset = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]
(the test dataset comes from the column df["0"]['Names'])
I would like to see which cluster(made by k-means) the texts belongs to.
Please see below the code that I am currently using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
where df["0"]['Names'] is the column 'Names' of the 0th dataframe.
A visual example, even with a different dataset but pretty same structure of dataframe (just for a better understanding) would be also good, if you prefer.
All the help you will provide will be greatly appreciated. Thanks

taking your test_data and adding three more sentence to make corpus
train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
"Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19",
"Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
"find the most representative document for each topic",
"topic distribution across documents",
"to help with understanding the topic",
"one of the practical application of topic modeling is to determine"]
creating dataframe from above dataset
df = pd.DataFrame(train_data, columns = 'text')
now you can use either Countvectorizer or TfidfVectorizer for vectorizing text, i am using TfidfVectorizer
vect = TfidfVectorizer(tokenizer=preprocessing)
vectorized_text = vect.fit_transform(df['text'])
kmeans = KMeans(n_clusters=2).fit(vectorized_text)
# now predicting the cluster for given dataset
df['predicted cluster'] = kmeans.predict(vectorized_text)
Now, when you are going to predict for test data or new data
new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom
#op
array([1])

Related

Vectorize document based on vocabulary AND regex

I am trying to train a text classifier using sklearn's CountVectorizer. The problem is that my training documents have many tokens that are document-specific. So for example there are regular english words that the CountVectorizer.fit_transform method works perfectly well on, but then there are some tokens that are formatted that would fit the regex: '\w\d\d\w\w\d', such as 'd84ke2'. As it is now, the fit_transform method would just take 'd84ke2' at face value and use that as a feature.
I want to be able to use those specific tokens that match that specific regex as their own feature, and leave the regular english words as their own features, since creating a feature such as 'd84ke2' would be useless as this will not come up again in any other document.
I've yet to find a way to do this, much less the "best" way. Below is an example of code I have, where you can see that the tokens 'j64ke2', 'r32kl4', 'w35kf9', and 'e93mf9' are all turned into their own features. I repeat for clarity: I want to basically condense these features into one and keep the others.
docs = ['the quick brown j64ke2 jumped over the lazy dogs r32kl4.',
'an apple a day keeps the w35kf9 away',
'you got the lions share of the e93mf9']
import numpy as np
# define target and target_names
target_names = ['zero', 'one', 'two']
target = np.array([0, 1, 2])
# Create message bunch.
from sklearn.utils import Bunch
doc_info = Bunch(data=docs, target=target, target_names=target_names)
# Vectorize training data
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(doc_info.data)
vocab = count_vect.vocabulary_
vocab_keys = list(vocab.keys())
#vocab_vals = list(vocab.values())
X_train_counts = count_vect.transform(doc_info.data)
X = X_train_counts.toarray()
import pandas as pd
df = pd.DataFrame(X, columns=vocab_keys)

yatu's comment is a good solution. I was able to clean the document before feeding it to CountVectorizer by substituting a word for each regex that matched.

Why can't spacy differentiate between two homograph tokens in the following code?

A homograph is a word that has the same spelling as another word but has a different sound and a different meaning, for example,lead (to go in front of) / lead (a metal) .
I was trying to use spacy word vectors to compare documents with each other by summing each word vector for each document and then finally finding cosine similarity. If for example spacy vectors have the same vector for the two 'lead' listed above , the results will be probably bad.
In the code below , why does the similarity between the two 'bank'
tokens come out as 1.00 ?
import spacy
nlp = spacy.load('en')
str1 = 'The guy went inside the bank to take out some money'
str2 = 'The house by the river bank.'
str1_tokenized = nlp(str1.decode('utf8'))
str2_tokenized = nlp(str2.decode('utf8'))
token1 = str1_tokenized[-6]
token2 = str2_tokenized[-2]
print 'token1 = {} token2 = {}'.format(token1,token2)
print token1.similarity(token2)
The output for given program is
token1 = bank token2 = bank
1.0

As kntgu already pointed out, spaCy distinguishes tokens by their characters, not by their semantic meaning. The sense2vec approach by the developers of spaCy concatenates tokens with their POS-tag and can help in the case of 'lead_VERB' vs. 'lead_NOUN'. However, it will not help in your example of 'bank (river bank)' vs. 'bank (financial institute)', as both are nouns.
SpaCy does not support any solution to this out of the box, but you can have a look at contextualized word representations like ELMo or BERT. Both generate word vectors for a given sentence, taking the context into account. Therefore, I assume the vectors for both 'bank' tokens will be substantially different.
Both are relatively recent approaches and are not as comfortable to use, but might help in your use case. For ELMo, there is a command line tool which lets you generate word embeddings for a set of sentences without having to write any code: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk

Predict new text using Python Latent Dirichlet Allocation (LDA) model

I use LDA package to model topics with a large set of text documents. A simplified(!) example (I removed all other cleaning steps, lemmatization, biograms etc.) of my code is below and I'm happy with the results so far. But now I struggle to write a code to predict a new text. I can't find any reference in LDA's documentation about save/loading/predict options. I can add a new text to my set and fit it again but it is an expensive way of doing it.
I know I can do it with gensim. But somehow the results from the gensim model are less impressive so I'd stick to my initial LDA model.
Will appreciate any suggestions!
My code:
import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english')) # nltk stopwords list
documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
'Game of Thrones stars Kit Harington and Rose Leslie to wed',
'Tony Booth: Till Death Us Do Part actor dies at 85',
'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
"Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
'How long can The Apprentice keep going?',
'Strictly Come Dancing beats X Factor for Saturday viewers',
"Joe Sugg: 8 things to know about one of YouTube's biggest stars",
'Sir Terry Wogan named greatest BBC radio presenter',
"DJs celebrate 50 years of Radio 1 and 2'"]
clean_docs = []
for doc in documents:
# set all to lower case and tokenize
tokens = nltk.tokenize.word_tokenize(doc.lower())
# remove stop words
texts = [i for i in tokens if i not in stops]
clean_docs.append(texts)
# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]
cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)
n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 3
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))
# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'

How to extract rows with only meaningful text in a column

I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!

You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.

Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.

TextBlob Naive Bayes. Choosing highest likelihood

As training data, have reviews of restaurants in XML, with associated target expression a sentiment is being expressed toward, a category which is a discrete label this belongs to and the polarity expressed toward this:
<text>With the great variety on the menu , I eat here often and never get bored .</text>
<Opinions>
<Opinion target="menu" category="FOOD#STYLE_OPTIONS" polarity="positive" from="30" to="34"/>
</Opinions>
I have used the TextBlob NB classifier to train targets terms to associated categories.
For test data, my aim is to predict the target expression, given a sentence and the category. I have first extracted nouns and noun phrases from the sentence, assuming the expression will be a subset of these. For the sentence:
"what may be interesting to most is the worst sevice attitude come from the owner of this establishment", these are ['sevice attitude', 'owner', 'establishment'].
I would like to know which of these is most likely given the category, which in this case is SERVICE#GENERAL. How could I go about this?

TextBlob's NB classifier by default extracts the text features as a bag of words. So you can simply concatenate the words in the list of extracted nouns and then concatenate it with the category to use the result as the training text. And use the target as the training label.
Considering the bag of words treat words independently, you should tranform these noun phrases in just one word. You can put a '-' instead of space, for example ('sevice attitude' would be 'sevice-attitude').
Example:
from textblob.classifiers import NaiveBayesClassifier
train = [('sevice-attitude owner establishment SERVICE#GENERAL', 'owner'),
('menu variety FOOD#STYLE_OPTIONS', 'menu')]
cl = NaiveBayesClassifier(train)
If you want you can customize the feature extraction: https://textblob.readthedocs.io/en/dev/classifiers.html#feature-extractors

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.