I use LDA package to model topics with a large set of text documents. A simplified(!) example (I removed all other cleaning steps, lemmatization, biograms etc.) of my code is below and I'm happy with the results so far. But now I struggle to write a code to predict a new text. I can't find any reference in LDA's documentation about save/loading/predict options. I can add a new text to my set and fit it again but it is an expensive way of doing it.
I know I can do it with gensim. But somehow the results from the gensim model are less impressive so I'd stick to my initial LDA model.
Will appreciate any suggestions!
My code:
import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english')) # nltk stopwords list
documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
'Game of Thrones stars Kit Harington and Rose Leslie to wed',
'Tony Booth: Till Death Us Do Part actor dies at 85',
'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
"Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
'How long can The Apprentice keep going?',
'Strictly Come Dancing beats X Factor for Saturday viewers',
"Joe Sugg: 8 things to know about one of YouTube's biggest stars",
'Sir Terry Wogan named greatest BBC radio presenter',
"DJs celebrate 50 years of Radio 1 and 2'"]
clean_docs = []
for doc in documents:
# set all to lower case and tokenize
tokens = nltk.tokenize.word_tokenize(doc.lower())
# remove stop words
texts = [i for i in tokens if i not in stops]
clean_docs.append(texts)
# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]
cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)
n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 3
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))
# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'
Related
I am trying to understand how to create clustering of texts using sklearn. I have 800 hundred texts (600 training data and 200 test data) like the following:
Texts # columns name
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
and I would like create clusters from those.
To transform the corpus into vector space I have used tf-idf and to cluster the documents using the k-means algorithm.
However, I cannot understand if the results are those expected or not as unfortunately the output is not 'graphical' (I have tried to use CountVectorizer to have a matrix of frequency, but probably I am using it in the wrong way).
What I would expect by doing tf-idf is that when I test the test dataset
When I TEST:
test_dataset = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]
(the test dataset comes from the column df["0"]['Names'])
I would like to see which cluster(made by k-means) the texts belongs to.
Please see below the code that I am currently using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
where df["0"]['Names'] is the column 'Names' of the 0th dataframe.
A visual example, even with a different dataset but pretty same structure of dataframe (just for a better understanding) would be also good, if you prefer.
All the help you will provide will be greatly appreciated. Thanks
taking your test_data and adding three more sentence to make corpus
train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
"Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19",
"Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
"find the most representative document for each topic",
"topic distribution across documents",
"to help with understanding the topic",
"one of the practical application of topic modeling is to determine"]
creating dataframe from above dataset
df = pd.DataFrame(train_data, columns = 'text')
now you can use either Countvectorizer or TfidfVectorizer for vectorizing text, i am using TfidfVectorizer
vect = TfidfVectorizer(tokenizer=preprocessing)
vectorized_text = vect.fit_transform(df['text'])
kmeans = KMeans(n_clusters=2).fit(vectorized_text)
# now predicting the cluster for given dataset
df['predicted cluster'] = kmeans.predict(vectorized_text)
Now, when you are going to predict for test data or new data
new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom
#op
array([1])
I am trying to extract noun phrases from sentences using Stanza(with Stanford CoreNLP). This can only be done with the CoreNLPClient module in Stanza.
# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')
Here is an example of a sentence, and I am using the tregrex function in client to get all the noun phrases. Tregex function returns a dict of dicts in python. Thus I needed to process the output of the tregrex before passing it to the Tree.fromstring function in NLTK to correctly extract the Noun phrases as strings.
pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``
Hence, I came up with the method stanza_phrases which has to loop through the dict of dicts which is the output of tregrex and correctly format for Tree.fromstring in NLTK.
def stanza_phrases(matches):
Nps = []
for match in matches:
for items in matches['sentences']:
for keys,values in items.items():
s = '(ROOT\n'+ values['match']+')'
Nps.extend(extract_phrase(s, pattern))
return set(Nps)
generates a tree to be used by NLTK
from nltk.tree import Tree
def extract_phrase(tree_str, label):
phrases = []
trees = Tree.fromstring(tree_str)
for tree in trees:
for subtree in tree.subtrees():
if subtree.label() == label:
t = subtree
t = ' '.join(t.leaves())
phrases.append(t)
return phrases
Here is my output:
{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity', 'the theory', 'the theory of relativity'}
Is there a way I can make this more code efficient with less number of lines (especially, stanza_phrases and extract_phrase methods)
from stanza.server import CoreNLPClient
# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
pattern = 'NP'
matches = _client.tregex(_text,pattern,annotators=_annotators)
print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))
# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
print('---')
print(englishText)
noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")
# French example
with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
frenchText = "Je suis John."
print('---')
print(frenchText)
noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")
The Constituent-Treelib does exactly what you want to accomplish, and with very few lines of code.
First install it via: pip install constituent-treelib
Then, perform the following steps:
import spacy
from constituent_treelib import ConstituentTree
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
nlp_pipeline = ConstituentTree.create_pipeline(ConstituentTree.Language.English)
doc = nlp_pipeline(text)
extracted_phrases = []
for sent in doc.sents:
sentence = sent.text
tree = ConstituentTree(sentence, nlp_pipeline)
extracted_phrases.append(tree.extract_all_phrases())
# --------------------------------------------------------------
# Output of extracted_phrases:
[{'S': ['Albert Einstein was a German - born theoretical physicist .'],
'ADJP': ['German - born'],
'VP': ['was a German - born theoretical physicist'],
'NP': ['Albert Einstein', 'a German - born theoretical physicist']},
{'S': ['He developed the theory of relativity .'],
'PP': ['of relativity'],
'VP': ['developed the theory of relativity'],
'NP': ['the theory of relativity', 'the theory']}]
I want to extract all bigrams and trigrams of the given sentences.
from gensim.models import Phrases
documents = ["the mayor of new york was there", "Human Computer Interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
trigram = Phrases(bigram(sentence_stream, min_count=1, threshold=2, delimiter=b' '))
for sent in sentence_stream:
#print(sent)
bigrams_ = bigram[sent]
trigrams_ = trigram[bigrams_]
print(bigrams_)
print(trigrams_)
The code works fine for bigrams and capture 'new york' and 'machine learning' ad bigrams.
However, I get the following error when I try to insert trigrams.
TypeError: 'Phrases' object is not callable
Please let me know, how to correct my code.
I am following the example documentation of gensim.
According to the docs, you can do:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
phrases = Phrases(sentence_stream)
bigram = Phraser(phrases)
trigram = Phrases(bigram[sentence_stream])
bigram, being a Phrases object, cannot be called again, as you are doing so.
I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
However, then I will miss important bigrams and trigrams in my dataset.
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.
I am new to wordvec and struggling how to do it. Please help me.
First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc
>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on.
Example:
trigram_model = Phrases(bigram_sentences)
Also there is a good notebook and video that explains how to use that .... the notebook, the video
The most important part of it is how to use it in real life sentences which is as follows:
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
Hope this helps you, but next time give us more information on what you are using and etc.
P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = [
"the mayor of new york was there",
"machine learning can be useful sometimes",
"new york mayor was present"
]
sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
print(bigram_phraser)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
Phrases and Phraser are those you should looking for
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
Once you are enough done with adding vocabs then use Phraser for faster access and efficient memory usage. Not mandatory but useful.
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
Thanks,
I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!
You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.
Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.